History log of /freebsd-head/sys/sys/socketvar.h
Revision Date Author Comments
fb264c632664996fe170971f0e660a5eebe2aa95 23-Jul-2020 jhb <jhb@FreeBSD.org> Add support for KTLS RX via software decryption.

Allow TLS records to be decrypted in the kernel after being received
by a NIC. At a high level this is somewhat similar to software KTLS
for the transmit path except in reverse. Protocols enqueue mbufs
containing encrypted TLS records (or portions of records) into the
tail of a socket buffer and the KTLS layer decrypts those records
before returning them to userland applications. However, there is an
important difference:

- In the transmit case, the socket buffer is always a single "record"
holding a chain of mbufs. Not-yet-encrypted mbufs are marked not
ready (M_NOTREADY) and released to protocols for transmit by marking
mbufs ready once their data is encrypted.

- In the receive case, incoming (encrypted) data appended to the
socket buffer is still a single stream of data from the protocol,
but decrypted TLS records are stored as separate records in the
socket buffer and read individually via recvmsg().

Initially I tried to make this work by marking incoming mbufs as
M_NOTREADY, but there didn't seemed to be a non-gross way to deal with
picking a portion of the mbuf chain and turning it into a new record
in the socket buffer after decrypting the TLS record it contained
(along with prepending a control message). Also, such mbufs would
also need to be "pinned" in some way while they are being decrypted
such that a concurrent sbcut() wouldn't free them out from under the
thread performing decryption.

As such, I settled on the following solution:

- Socket buffers now contain an additional chain of mbufs (sb_mtls,
sb_mtlstail, and sb_tlscc) containing encrypted mbufs appended by
the protocol layer. These mbufs are still marked M_NOTREADY, but
soreceive*() generally don't know about them (except that they will
block waiting for data to be decrypted for a blocking read).

- Each time a new mbuf is appended to this TLS mbuf chain, the socket
buffer peeks at the TLS record header at the head of the chain to
determine the encrypted record's length. If enough data is queued
for the TLS record, the socket is placed on a per-CPU TLS workqueue
(reusing the existing KTLS workqueues and worker threads).

- The worker thread loops over the TLS mbuf chain decrypting records
until it runs out of data. Each record is detached from the TLS
mbuf chain while it is being decrypted to keep the mbufs "pinned".
However, a new sb_dtlscc field tracks the character count of the
detached record and sbcut()/sbdrop() is updated to account for the
detached record. After the record is decrypted, the worker thread
first checks to see if sbcut() dropped the record. If so, it is
freed (can happen when a socket is closed with pending data).
Otherwise, the header and trailer are stripped from the original
mbufs, a control message is created holding the decrypted TLS
header, and the decrypted TLS record is appended to the "normal"
socket buffer chain.

(Side note: the SBCHECK() infrastucture was very useful as I was
able to add assertions there about the TLS chain that caught several
bugs during development.)

Tested by: rmacklem (various versions)
Relnotes: yes
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D24628
46c2358a6391998027a71678c0134d37a9bf3b08 21-May-2020 markj <markj@FreeBSD.org> Fix ACCEPT_FILTER_DEFINE to pass the version to MODULE_VERSION.

MFC with: r361263
b20c56f366b63be2e4ebdd47808bbc31a2c3dd5e 19-May-2020 markj <markj@FreeBSD.org> Define a module version for accept filter modules.

Otherwise accept filters compiled into the kernel do not preempt
preloaded accept filter modules. Then, the preloaded file registers its
accept filter module before the kernel, and the kernel's attempt fails
since duplicate accept filter list entries are not permitted. This
causes the preloaded file's module to be released, since
module_register_init() does a lookup by name, so the preloaded file is
unloaded, and the accept filter's callback points to random memory since
preload_delete_name() unmaps the file on x86 as of r336505.

Add a new ACCEPT_FILTER_DEFINE macro which wraps the accept filter and
module definitions, and ensures that a module version is defined.

PR: 245870
Reported by: Thomas von Dein <freebsd@daemon.de>
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
a0a0568c0f34f37aea702fd749b8added4041770 18-Apr-2020 sjg <sjg@FreeBSD.org> Define enum for so_qstate outside of struct.

LLVM-9.0 clang++ throws an error for enum defined within
an anonymous struct.

Reviewed by: jtl, rpokala
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org//D24477
5c3de7e0d4feb3082f6f948232fa778af069efa7 14-Apr-2020 jtl <jtl@FreeBSD.org> Make sonewconn() overflow messages have per-socket rate-limits and values.

sonewconn() emits debug-level messages when a listen socket's queue
overflows. Currently, sonewconn() tracks overflows on a global basis. It
will only log one message every 60 seconds, regardless of how many sockets
experience overflows. And, when it next logs at the end of the 60 seconds,
it records a single message referencing a single PCB with the total number
of overflows across all sockets.

This commit changes to per-socket overflow tracking. The code will now
log one message every 60 seconds per socket. And, the code will provide
per-socket queue length and overflow counts. It also provides a way to
change the period between log messages using a sysctl.

Reviewed by: jhb (previous version), bcr (manpages)
MFC after: 2 weeks
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D24316
343e0898f34efe4240c93f8c38ecdb71ea11e63d 09-Mar-2019 bz <bz@FreeBSD.org> Try to improve comment for socket state bits.

In r324227 the comment moved into socketvar.h originally from
sockstate.h r180948. Try to improve English and as a consequence rewrap
the comment.

No functional changes.

Reviewed by: jhb (a wording suggestion)
Differential Revision: https://reviews.freebsd.org/D13865
c2135dd9fd45a9788c7605a2e2a4fc20240cd4b2 06-Nov-2018 brooks <brooks@FreeBSD.org> Regen after r340199: Use declared types for caddr_t arguments.

Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17852
8a6f698b859241cbaeeb02fcf98ea7e43cea82f6 04-Aug-2018 glebius <glebius@FreeBSD.org> Now that after r335979 the kernel addresses in API structures are
fixed size, there is no reason left for the unions.

Discussed with: brooks
6615ed4c6149c9810ba766e318f591d44a0596df 05-Jul-2018 brooks <brooks@FreeBSD.org> Make struct xinpcb and friends word-size independent.

Replace size_t members with ksize_t (uint64_t) and pointer members
(never used as pointers in userspace, but instead as unique
idenitifiers) with kvaddr_t (uint64_t). This makes the structs
identical between 32-bit and 64-bit ABIs.

On 64-bit bit systems, the ABI is maintained. On 32-bit systems,
this is an ABI breaking change. The ABI of most of these structs
was previously broken in r315662. This also imposes a small API
change on userspace consumers who must handle kernel pointers
becoming virtual addresses.

PR: 228301 (exp-run by antoine)
Reviewed by: jtl, kib, rwatson (various versions)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15386
7cf8a13d285b4f9fcdbf23f48496029a8160644f 08-Jun-2018 jtl <jtl@FreeBSD.org> Add a socket destructor callback. This allows kernel providers to set
callbacks to perform additional cleanup actions at the time a socket is

Michio Honda presented a use for this at BSDCan 2018.
(See https://www.bsdcan.org/2018/schedule/events/965.en.html .)

Submitted by: Michio Honda <micchie at sfc.wide.ad.jp> (previous version)
Reviewed by: lstewart (previous version)
Differential Revision: https://reviews.freebsd.org/D15706
d0aeaa5af7f77964d05bcb54e2f5cfdeb187edaf 06-Jun-2018 sbruno <sbruno@FreeBSD.org> Load balance sockets with new SO_REUSEPORT_LB option.

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).

This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@
for the substantive feedback that is included in this commit.

Submitted by: Johannes Lundberg <johalun0@gmail.com>
Obtained from: DragonflyBSD
Relnotes: Yes
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003
257e6e5563df4d5f0a939c72acdef3c49c440e38 24-Apr-2018 sbruno <sbruno@FreeBSD.org> Revert r332894 at the request of the submitter.

Submitted by: Johannes Lundberg <johalun0_gmail.com>
Sponsored by: Limelight Networks
bbf7d4dd035a71710ac94fe1ada4d99244102159 23-Apr-2018 sbruno <sbruno@FreeBSD.org> Load balance sockets with new SO_REUSEPORT_LB option

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).

Submitted by: Johannes Lundberg <johanlun0@gmail.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003
daa4277b7c5fba0db519601f83a3a84cc3760079 07-Dec-2017 bz <bz@FreeBSD.org> Use correct field in the description for the lock after r319722.

Reviewed by: glebius
Sponsored by: iXsystems, Inc.
4736ccfd9c3411d50371d7f21f9450a47c19047e 20-Nov-2017 pfg <pfg@FreeBSD.org> sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
7168fac38845f2e3dded297a95af76ede2fc01f2 02-Oct-2017 glebius <glebius@FreeBSD.org> Hide struct socket and struct unpcb from the userland.

Violators may define _WANT_SOCKET and _WANT_UNPCB respectively and
are not guaranteed for stability of the structures. The violators
list is the the usual one: libprocstat(3) and netstat(1) internally
and lsof in ports.

In struct xunpcb remove the inclusion of kernel structure and add
a bunch of spare fields. The xsocket already has socket not included,
but add there spares as well. Embed xsockbuf into xsocket.

Sort declarations in sys/socketvar.h to separate kernel only from
userland available ones.

PR: 221820 (exp-run)
e35d543ec17b735ba76bb3311ed4a430d6cc945e 08-Jun-2017 glebius <glebius@FreeBSD.org> Listening sockets improvements.

o Separate fields of struct socket that belong to listening from
fields that belong to normal dataflow, and unionize them. This
shrinks the structure a bit.
- Take out selinfo's from the socket buffers into the socket. The
first reason is to support braindamaged scenario when a socket is
added to kevent(2) and then listen(2) is cast on it. The second
reason is that there is future plan to make socket buffers pluggable,
so that for a dataflow socket a socket buffer can be changed, and
in this case we also want to keep same selinfos through the lifetime
of a socket.
- Remove struct struct so_accf. Since now listening stuff no longer
affects struct socket size, just move its fields into listening part
of the union.
- Provide sol_upcall field and enforce that so_upcall_set() may be called
only on a dataflow socket, which has buffers, and for listening sockets
provide solisten_upcall_set().

o Remove ACCEPT_LOCK() global.
- Add a mutex to socket, to be used instead of socket buffer lock to lock
fields of struct socket that don't belong to a socket buffer.
- Allow to acquire two socket locks, but the first one must belong to a
listening socket.
- Make soref()/sorele() to use atomic(9). This allows in some situations
to do soref() without owning socket lock. There is place for improvement
here, it is possible to make sorele() also to lock optionally.
- Most protocols aren't touched by this change, except UNIX local sockets.
See below for more information.

o Reduce copy-and-paste in kernel modules that accept connections from
listening sockets: provide function solisten_dequeue(), and use it in
the following modules: ctl(4), iscsi(4), ng_btsocket(4), ng_ksocket(4),
infiniband, rpc.

o UNIX local sockets.
- Removal of ACCEPT_LOCK() global uncovered several races in the UNIX
local sockets. Most races exist around spawning a new socket, when we
are connecting to a local listening socket. To cover them, we need to
hold locks on both PCBs when spawning a third one. This means holding
them across sonewconn(). This creates a LOR between pcb locks and
- To fix the new LOR, abandon the global unp_list_lock in favor of global
unp_link_lock. Indeed, separating these two locks didn't provide us any
extra parralelism in the UNIX sockets.
- Now call into uipc_attach() may happen with unp_link_lock hold if, we
are accepting, or without unp_link_lock in case if we are just creating
a socket.
- Another problem in UNIX sockets is that uipc_close() basicly did nothing
for a listening socket. The vnode remained opened for connections. This
is fixed by removing vnode in uipc_close(). Maybe the right way would be
to do it for all sockets (not only listening), simply move the vnode
teardown from uipc_detach() to uipc_close()?

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D9770
3ef6278d8eb4172540967c5453c0764723d54988 07-Jun-2017 glebius <glebius@FreeBSD.org> Provide typedef for socket upcall function.
While here change so_gen_t type to modern uint64_t.
7e6cabd06e6caa6a02eeb86308dc0cb3f27e10da 28-Feb-2017 imp <imp@FreeBSD.org> Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
efa6326974ec2cdb6721fec731bcd86758d0877c 18-Jan-2017 hselasky <hselasky@FreeBSD.org> Implement kernel support for hardware rate limited sockets.

- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network

Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
701697521cdd74f2f7cfee4e18bd672544300bb3 16-Jan-2017 sobomax <sobomax@FreeBSD.org> Add a new socket option SO_TS_CLOCK to pick from several different clock
sources to return timestamps when SO_TIMESTAMP is enabled. Two additional
clock sources are:

o nanosecond resolution realtime clock (equivalent of CLOCK_REALTIME);
o nanosecond resolution monotonic clock (equivalent of CLOCK_MONOTONIC).

In addition to this, this option provides unified interface to get bintime
(equivalent of using SO_BINTIME), except it also supported with IPv6 where
SO_BINTIME has never been supported. The long term plan is to depreciate
SO_BINTIME and move everything to using SO_TS_CLOCK.

Idea for this enhancement has been briefly discussed on the Net session
during dev summit in Ottawa last June and the general input was positive.

This change is believed to benefit network benchmarks/profiling as well
as other scenarios where precise time of arrival measurement is necessary.

There are two regression test cases as part of this commit: one extends unix
domain test code (unix_cmsg) to test new SCM_XXX types and another one
implementis totally new test case which exchanges UDP packets between two
processes using both conventional methods (i.e. calling clock_gettime(2)
before recv(2) and after send(2)), as well as using setsockopt()+recv() in
receive path. The resulting delays are checked for sanity for all supported
clock types.

Reviewed by: adrian, gnn
Differential Revision: https://reviews.freebsd.org/D9171
41574d71106b68fd70cfa13d6e2fae3b82511b89 22-Sep-2016 oshogbo <oshogbo@FreeBSD.org> capsicum: propagate rights on accept(2)

Descriptor returned by accept(2) should inherits capabilities rights from
the listening socket.

PR: 201052
Reviewed by: emaste, jonathan
Discussed with: many
Differential Revision: https://reviews.freebsd.org/D7724
af533198e352d1ee97e831497a7a004f6f1f2740 23-Jun-2016 np <np@FreeBSD.org> Add spares to struct ifnet and socket for packet pacing and/or general
use. Update comments regarding the spare fields in struct inpcb.

Bump __FreeBSD_version for the changes to the size of the structures.

Reviewed by: gnn@
Approved by: re@ (gjb@)
Sponsored by: Chelsio Communications
00d578928eca75be320b36d37543a7e2a4f9fbdb 27-May-2016 grehan <grehan@FreeBSD.org> Create branch for bhyve graphics import.
047e60c94a85e1381cd4e18c9e845ec34d5164a9 18-May-2016 glebius <glebius@FreeBSD.org> The SA-16:19 wouldn't have happened if the sockargs() had properly typed
argument for length. While here make it static and convert to ANSI C.

Reviewed by: C Turt
1a67ba1558f52a2f48bad071679228fd43f57cb9 29-Apr-2016 jhb <jhb@FreeBSD.org> Expose soaio_enqueue().

This can be used by protocol-specific AIO handlers to queue work to the
socket AIO daemon pool.

Sponsored by: Chelsio Communications
be47bc68fb065fc834ff51fea7df108abeae031c 01-Mar-2016 jhb <jhb@FreeBSD.org> Refactor the AIO subsystem to permit file-type-specific handling and
improve cancellation robustness.

Introduce a new file operation, fo_aio_queue, which is responsible for
queueing and completing an asynchronous I/O request for a given file.
The AIO subystem now exports library of routines to manipulate AIO
requests as well as the ability to run a handler function in the
"default" pool of AIO daemons to service a request.

A default implementation for file types which do not include an
fo_aio_queue method queues requests to the "default" pool invoking the
fo_read or fo_write methods as before.

The AIO subsystem permits file types to install a private "cancel"
routine when a request is queued to permit safe dequeueing and cleanup
of cancelled requests.

Sockets now use their own pool of AIO daemons and service per-socket
requests in FIFO order. Socket requests will not block indefinitely
permitting timely cancellation of all requests.

Due to the now-tight coupling of the AIO subsystem with file types,
the AIO subsystem is now a standard part of all kernels. The VFS_AIO
kernel option and aio.ko module are gone.

Many file types may block indefinitely in their fo_read or fo_write
callbacks resulting in a hung AIO daemon. This can result in hung
user processes (when processes attempt to cancel all outstanding
requests during exit) or a hung system. To protect against this, AIO
requests are only permitted for known "safe" files by default. AIO
requests for all file types can be enabled by setting the new
vfs.aio.enable_usafe sysctl to a non-zero value. The AIO tests have
been updated to skip operations on unsafe file types if the sysctl is

Currently, AIO requests on sockets and raw disks are considered safe
and are enabled by default. aio_mlock() is also enabled by default.

Reviewed by: cem, jilles
Discussed with: kib (earlier version)
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5289
bc275cae2e0ec60ed7c7fbab8494a5a444949d84 16-Feb-2016 jhb <jhb@FreeBSD.org> The locking annotations for struct sockbuf originally used the key from
struct socket. When sockbuf.h was moved out of socketvar.h, the locking
key was no longer nearby. Instead, add a new key for sockbuf and use
a single item for the socket buffer lock instead of separate entries for
receive vs send buffers.

Reviewed by: adrian
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D4901
21434c7a701096dfe0b99fcfe37d637fae83cb2b 05-Feb-2016 jhb <jhb@FreeBSD.org> Rename aiocblist to kaiocb and use consistent variable names.

Typically <foo>list is used for a structure that holds a list head in
FreeBSD, not for members of a list. As such, rename 'struct aiocblist'
to 'struct kaiocb' (the kernel version of 'struct aiocb').

While here, use more consistent variable names for AIO control blocks:

- Use 'job' instead of 'aiocbe', 'cb', 'cbe', or 'iocb' for kernel job
- Use 'jobn' instead of 'cbn' for use with TAILQ_FOREACH_SAFE().
- Use 'sjob' and 'sjobn' instead of 'scb' and 'scbn' for fsync jobs.
- Use 'ujob' instead of 'aiocbp', 'job', 'uaiocb', or 'uuaiocb' to hold
a user pointer to a 'struct aiocb'.
- Use 'ujobp' instead of 'aiocbp' for a user pointer to a 'struct aiocb *'.

Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5125
b04e8a547e7e868f83d0fab68feb768770323303 02-Feb-2016 alfred <alfred@FreeBSD.org> Increase max allowed backlog for listen sockets
from short to int.

PR: 203922
Submitted by: White Knight <white_knight@2ch.net>
MFC After: 4 weeks
fb04ebcd378613b7afed6327eab75e4ff940797a 31-Aug-2015 delphij <delphij@FreeBSD.org> MFC r286836:

so_vnet is constant after creation and no locking is necessary,
document this fact.
51cae3f88c82a54760d37ec2fede4d85c837aaa1 17-Aug-2015 delphij <delphij@FreeBSD.org> so_vnet is constant after creation and no locking is necessary,
document this fact.

(netmap have an assignment too but that socket object is on stack).

MFC after: 2 weeks
7f65d178b4fad420cb01a2ff4b3163278d3025dc 11-Apr-2015 mjg <mjg@FreeBSD.org> Replace struct filedesc argument in getsock_cap with struct thread

This is is a step towards removal of spurious arguments.
c0b38b545a543b8ec276ed09783b02222303d543 12-Nov-2014 glebius <glebius@FreeBSD.org> In preparation of merging projects/sendfile, transform bare access to
sb_cc member of struct sockbuf to a couple of inline functions:

sbavail() and sbused()

Right now they are equal, but once notion of "not ready socket buffer data",
will be checked in, they are going to be different.

Sponsored by: Netflix
Sponsored by: Nginx, Inc.
42d9d5479ec3ef6f25f954395919f8bb3d8575b2 09-Oct-2014 marcel <marcel@FreeBSD.org> Move the SCTP syscalls to netinet with the rest of the SCTP code. The
syscalls themselves are tightly coupled with the network stack and
therefore should not be in the generic socket code.

The following four syscalls have been marked as NOSTD so they can be
dynamically registered in sctp_syscalls_init() function:

The syscalls are also set up to be dynamically registered when COMPAT32
option is configured.

As a side effect of moving the SCTP syscalls, getsock_cap needs to be
made available outside of the uipc_syscalls.c source file. A proper
prototype has been added to the sys/socketvar.h header file.

API tests from the SCTP reference implementation have been run to ensure
compatibility. (http://code.google.com/p/sctp-refimpl/source/checkout)

Submitted by: Steve Kiernan <stevek@juniper.net>
Reviewed by: tuexen, rrs
Obtained from: Juniper Networks, Inc.
8bec6893f3070a04f4ac88a75e3ce142c178e1ea 18-Aug-2014 marcel <marcel@FreeBSD.org> For vendors like Juniper, extensibility for sockets is important. A
good example is socket options that aren't necessarily generic. To
this end, OSD is added to the socket structure and hooks are defined
for key operations on sockets. These are:
o soalloc() and sodealloc()
o Get and set socket options
o Socket related kevent filters.

One aspect about hhook that appears to be not fully baked is the return
semantics (the return value from the hook is ignored in hhook_run_hooks()
at the time of commit). To support return values, the socket_hhook_data
structure contains a 'status' field to hold return values.

Submitted by: Anuranjan Shukla <anshukla@juniper.net>
Obtained from: Juniper Networks, Inc.
eb1a5f8de9f7ea602c373a710f531abbf81141c4 21-Feb-2014 gjb <gjb@FreeBSD.org> Move ^/user/gjb/hacking/release-embedded up one directory, and remove
^/user/gjb/hacking since this is likely to be merged to head/ soon.

Sponsored by: The FreeBSD Foundation
6b01bbf146ab195243a8e7d43bb11f8835c76af8 27-Dec-2013 gjb <gjb@FreeBSD.org> Copy head@r259933 -> user/gjb/hacking/release-embedded for initial
inclusion of (at least) arm builds with the release.

Sponsored by: The FreeBSD Foundation
6796656333337e5530946dca854ffe0ce55b0cf0 16-Sep-2013 kib <kib@FreeBSD.org> Remove zero-copy sockets code. It only worked for anonymous memory,
and the equivalent functionality is now provided by sendfile(2) over
posix shared memory filedescriptor.

Remove the cow member of struct vm_page, and rearrange the remaining
members. While there, make hold_count unsigned.

Requested and reviewed by: alc
Tested by: pho
Sponsored by: The FreeBSD Foundation
Approved by: re (delphij)
0dd1d9c578ddc35507ac2072c5062f5d57c53147 28-Jun-2013 davide <davide@FreeBSD.org> - Trim an unused and bogus Makefile for mount_smbfs.
- Reconnect with some minor modifications, in particular now selsocket()
internals are adapted to use sbintime units after recent'ish calloutng
cc8c6e4d0185c640c9d03ed2804e3020ff84fed0 06-May-2013 andre <andre@FreeBSD.org> Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.
e79bb9704b0801e384d3cf9953bd20277e72961d 10-Apr-2013 glebius <glebius@FreeBSD.org> Fix build.
702516e70b2669b5076691a0b760b4a37a8c06a2 02-Mar-2013 pjd <pjd@FreeBSD.org> - Implement two new system calls:

int bindat(int fd, int s, const struct sockaddr *addr, socklen_t addrlen);
int connectat(int fd, int s, const struct sockaddr *name, socklen_t namelen);

which allow to bind and connect respectively to a UNIX domain socket with a
path relative to the directory associated with the given file descriptor 'fd'.

- Add manual pages for the new syscalls.

- Make the new syscalls available for processes in capability mode sandbox.

- Add capability rights CAP_BINDAT and CAP_CONNECTAT that has to be present on
the directory descriptor for the syscalls to work.

- Update audit(4) to support those two new syscalls and to handle path
in sockaddr_un structure relative to the given directory descriptor.

- Update procstat(1) to recognize the new capability rights.

- Document the new capability rights in cap_rights_limit(2).

Sponsored by: The FreeBSD Foundation
Discussed with: rwatson, jilles, kib, des
d82ca0a363f911cee0897653d0a694a9118b9e07 08-Dec-2012 pjd <pjd@FreeBSD.org> The socket_zone UMA zone is now private to uipc_socket.c.
65d8b7120dda17d3319b5cb108caf8e3596f905f 18-Oct-2012 attilio <attilio@FreeBSD.org> Disconnect non-MPSAFE SMBFS from the build in preparation for dropping
GIANT from VFS. In addition, disconnect also netsmb, which is a base
requirement for SMBFS.

In the while SMBFS regular users can use FUSE interface and smbnetfs
port to work with their SMBFS partitions.

Also, there are ongoing efforts by vendor to support in-kernel smbfs,
so there are good chances that it will get relinked once properly locked.

This is not targeted for MFC.
d5e8d236f4009fc2611f996c317e94b2c8649cf5 12-Nov-2010 luigi <luigi@FreeBSD.org> This commit implements the SO_USER_COOKIE socket option, which lets
you tag a socket with an uint32_t value. The cookie can then be
used by the kernel for various purposes, e.g. setting the skipto
rule or pipe number in ipfw (this is the reason SO_USER_COOKIE has
been implemented; however there is nothing ipfw-specific in its

The ipfw-related code that uses the optopn will be committed separately.

This change adds a field to 'struct socket', but the struct is not
part of any driver or userland-visible ABI so the change should be

See the discussion at

Idea and code from Paul Joe, small modifications and manpage
changes by myself.

Submitted by: Paul Joe
MFC after: 1 week
09f9c897d33c41618ada06fbbcf1a9b3812dee53 19-Oct-2010 jamie <jamie@FreeBSD.org> A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.
b9d3291981a5e5a3fccba50333b2a4cdbe6486ac 18-Sep-2010 rwatson <rwatson@FreeBSD.org> With reworking of the socket life cycle in 7.x, the need for a "sotryfree()"
was eliminated: all references to sockets are explicitly managed by sorele()
and the protocols. As such, garbage collect sotryfree(), and update
sofree() comments to make the new world order more clear.

MFC after: 3 days
Reported by: Anuranjan Shukla <anshukla at juniper dot net>
3e540217974c33a858d099e8b23a14382c6aaffa 18-Jul-2010 trasz <trasz@FreeBSD.org> Revert r210225 - turns out I was wrong; the "/*-" is not license-only
thing; it's also used to indicate that the comment should not be automatically

Explained by: cperciva@
935237a66a797429103b05be92d72e3bada85e69 18-Jul-2010 trasz <trasz@FreeBSD.org> The "/*-" comment marker is supposed to denote copyrights. Remove non-copyright
occurences from sys/sys/ and sys/kern/.
f1216d1f0ade038907195fc114b7e630623b402c 19-Mar-2010 delphij <delphij@FreeBSD.org> Create a custom branch where I will be able to do the merge.
6c6bda868d12c26d6e0fe65000f5f3c00d0112d7 07-Jul-2009 kib <kib@FreeBSD.org> Fix poll(2) and select(2) for named pipes to return "ready for read"
when all writers, observed by reader, exited. Use writer generation
counter for fifo, and store the snapshot of the fifo generation in the
f_seqcount field of struct file, that is otherwise unused for fifos.
Set FreeBSD-undocumented POLLINIGNEOF flag only when file f_seqcount is
equal to fifo' fi_wgen, and revert r89376.

Fix POLLINIGNEOF for sockets and pipes, and return POLLHUP for them.
Note that the patch does not fix not returning POLLHUP for fifos.

PR: kern/94772
Submitted by: bde (original version)
Reviewed by: rwatson, jilles
Approved by: re (kensmith)
MFC after: 6 weeks (might be)
e66ed06df4c21fd9f4560ed95be34c4f650648dc 22-Jun-2009 andre <andre@FreeBSD.org> Add soreceive_stream(), an optimized version of soreceive() for
stream (TCP) sockets.

It is functionally identical to generic soreceive() but has a
number stream specific optimizations:
o does only one sockbuf unlock/lock per receive independent of
the length of data to be moved into the uio compared to
soreceive() which unlocks/locks per *mbuf*.
o uses m_mbuftouio() instead of its own copy(out) variant.
o much more compact code flow as a large number of special
cases is removed.
o much improved reability.

It offers significantly reduced CPU usage and lock contention
when receiving fast TCP streams. Additional gains are obtained
when the receiving application is using SO_RCVLOWAT to batch up
some data before a read (and wakeup) is done.

This function was written by "reverse engineering" and is not
just a stripped down variant of soreceive().

It is not yet enabled by default on TCP sockets. Instead it is
commented out in the protocol initialization in tcp_usrreq.c
until more widespread testing has been done.

Testers, especially with 10GigE gear, are welcome.

MFP4: r164817 //depot/user/andre/soreceive_stream/
a1af9ecca44f362b24fe3a8342ca6ed8676a399c 01-Jun-2009 jhb <jhb@FreeBSD.org> Rework socket upcalls to close some races with setup/teardown of upcalls.
- Each socket upcall is now invoked with the appropriate socket buffer
locked. It is not permissible to call soisconnected() with this lock
held; however, so socket upcalls now return an integer value. The two
possible values are SU_OK and SU_ISCONNECTED. If an upcall returns
SU_ISCONNECTED, then the soisconnected() will be invoked on the
socket after the socket buffer lock is dropped.
- A new API is provided for setting and clearing socket upcalls. The
API consists of soupcall_set() and soupcall_clear().
- To simplify locking, each socket buffer now has a separate upcall.
- When a socket upcall returns SU_ISCONNECTED, the upcall is cleared from
the receive socket buffer automatically. Note that a SO_SND upcall
should never return SU_ISCONNECTED.
- All this means that accept filters should now return SU_ISCONNECTED
instead of calling soisconnected() directly. They also no longer need
to explicitly clear the upcall on the new socket.
- The HTTP accept filter still uses soupcall_set() to manage its internal
state machine, but other accept filters no longer have any explicit
knowlege of socket upcall internals aside from their return value.
- The various RPC client upcalls currently drop the socket buffer lock
while invoking soreceive() as a temporary band-aid. The plan for
the future is to add a new flag to allow soreceive() to be called with
the socket buffer locked.
- The AIO callback for socket I/O is now also invoked with the socket
buffer locked. Previously sowakeup() would drop the socket buffer
lock only to call aio_swake() which immediately re-acquired the socket
buffer lock for the duration of the function call.

Discussed with: rwatson, rmacklem
39b6dc8ba2de1c81754454858aae4fc4b706bdbf 30-Apr-2009 zec <zec@FreeBSD.org> Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:

options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

INIT_VNET_NET(ifp->if_vnet); becomes

struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by: bz, rwatson
Approved by: julian (mentor)
19b6af98ec71398e77874582eb84ec5310c7156f 22-Nov-2008 dfr <dfr@FreeBSD.org> Clone Kip's Xen on stable/6 tree so that I can work on improving FreeBSD/amd64
performance in Xen's HVM mode.
cf5320822f93810742e3d4a1ac8202db8482e633 19-Oct-2008 lulf <lulf@FreeBSD.org> - Import the HEAD csup code which is the basis for the cvsmode work.
8437a9c28f5c1146c87f6e46ed31da910f2c4ea0 15-Sep-2008 rwatson <rwatson@FreeBSD.org> Merge r180198, r180211, r180365, r182682 from head to stable/7:

Add soreceive_dgram(9), an optimized socket receive function for use by
datagram-only protocols, such as UDP. This version removes use of
sblock(), which is not required due to an inability to interlace data
improperly with datagrams, as well as avoiding some of the larger loops
and state management that don't apply on datagram sockets.

This is experimental code, so hook it up only for UDPv4 for testing; if
there are problems we may need to revise it or turn it off by default,
but it offers *significant* performance improvements for threaded UDP
applications such as BIND9, nsd, and memcached using UDP.

Tested by: kris, ps

Update copyright date in light of soreceive_dgram(9).

Use soreceive_dgram() and sosend_dgram() with UDPv6, as we do with UDPv4.

Tested by: ps

Remove XXXRW in soreceive_dgram that proves unnecessary.

Remove unused orig_resid variable in soreceive_dgram.

Submitted by: alfred

Note: in the MFC, we do enable sosend_dgram for UDPv6 by default (it was
already used for UDPv4), but use of soreceive_dgram for both UDPv4 and
UDPv6 is controlled by a new loader tunable,
net.inet.udp.soreceive_dgram_enabled as soreceive_dgram has less testing
exposure than sosend_dgram. We may wish to change the default (and
eliminate the tunable) in 7.2.

MFC requested by: gnn, kris, ps
Approved by: re (kib)
eb03f0a2ab2e6b52c21928916540cce7735ae97f 31-Jul-2008 kmacy <kmacy@FreeBSD.org> MFC sockbuf refactoring and fixes to cxgb TOE
3bdaf5becf92a80bf4532a43163b41c1aca24d62 31-Jul-2008 kmacy <kmacy@FreeBSD.org> move sockbuf locking macros in to sockbuf.h
b51643853ae77a26f1758d2d8fff5e131a2ac9e1 29-Jul-2008 kmacy <kmacy@FreeBSD.org> Factor sockbuf, sockopt, and sockstate out of socketvar.h in to separate headers.

Reviewed by: rwatson
MFC after: 3 days
dc8d54c205784683ec1aae7ecf1f24fe1f6cb2c0 24-Jul-2008 julian <julian@FreeBSD.org> MFC an ABI compatible implementation of Multiple routing tables.
See the commit message for
version 1.129 (svn change # 178888) for more info.

Obtained from: Ironport (Cisco Systems)
0c50a62527ee21e24fb9e93a0ebf67df0043d601 02-Jul-2008 rwatson <rwatson@FreeBSD.org> Add soreceive_dgram(9), an optimized socket receive function for use by
datagram-only protocols, such as UDP. This version removes use of
sblock(), which is not required due to an inability to interlace data
improperly with datagrams, as well as avoiding some of the larger loops
and state management that don't apply on datagram sockets.

This is experimental code, so hook it up only for UDPv4 for testing; if
there are problems we may need to revise it or turn it off by default,
but it offers *significant* performance improvements for threaded UDP
applications such as BIND9, nsd, and memcached using UDP.

Tested by: kris, ps
368bdf05e9188bfc74fa11ed733351ed1c1e5c2c 15-May-2008 gnn <gnn@FreeBSD.org> Update the kernel to count the number of mbufs and clusters
(all types) used per socket buffer.

Add support to netstat to print out all of the socket buffer

Update the netstat manual page to describe the new -x flag
which gives the extended output.

Reviewed by: rwatson, julian
1dfc5c98a4f7c32163dfdc61e390ccf805385108 09-May-2008 julian <julian@FreeBSD.org> Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:


One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
packet streams to be routed by more than just the destination address.


I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco
47fbdc3fb8c8646192c926a5c5aa3f6005444d7c 01-Mar-2008 rwatson <rwatson@FreeBSD.org> Merge uipc_sockbuf.c:1.176, uipc_socket.c:1.305, socketvar.h:1.162 from

Further clean up sorflush:

- Expose sbrelease_internal(), a variant of sbrelease() with no
expectations about the validity of locks in the socket buffer.
- Use sbrelease_internel() in sorflush(), and as a result avoid
initializing and destroying a socket buffer lock for the temporary
stack copy of the actual buffer, asb.
- Add a comment indicating why we do what we do, and remove an XXX
since things have gotten less ugly in sorflush() lately.

This makes socket close cleaner, and possibly also marginally faster.
f23198af5c5d671548e3ad76cf9a569dbc55bd0d 04-Feb-2008 rwatson <rwatson@FreeBSD.org> Further clean up sorflush:

- Expose sbrelease_internal(), a variant of sbrelease() with no
expectations about the validity of locks in the socket buffer.
- Use sbrelease_internel() in sorflush(), and as a result avoid intializing
and destroying a socket buffer lock for the temporary stack copy of the
actual buffer, asb.
- Add a comment indicating why we do what we do, and remove an XXX since
things have gotten less ugly in sorflush() lately.

This makes socket close cleaner, and possibly also marginally faster.

MFC after: 3 weeks
cd6e8bfd1a7c8bca0a7af4785bae62c85b7971b1 01-Feb-2008 rwatson <rwatson@FreeBSD.org> Merge uipc_sockbuf.c:1.175, uipc_socket.c:1.304, uipc_syscalls.c:1.264,
sctp_input.c:1.67, sctp_peeloff.c:1.17, sctputil.c:1.73,
socketvar.h:1.161 from HEAD to RELENG_7:

Correct two problems relating to sorflush(), which is called to flush
read socket buffers in shutdown() and close():

- Call socantrcvmore() before sblock() to dislodge any threads that
might be sleeping (potentially indefinitely) while holding sblock(),
such as a thread blocked in recv().

- Flag the sblock() call as non-interruptible so that a signal
delivered to the thread calling sorflush() doesn't cause sblock() to
fail. The sblock() is required to ensure that all other socket
consumer threads have, in fact, left, and do not enter, the socket
buffer until we're done flushin it.

To implement the latter, change the 'flags' argument to sblock() to
accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK
flag. When SBL_NOINTR is set, it forces a non-interruptible sx
acquisition, regardless of the setting of the disposition of SB_NOINTR
on the socket buffer; without this change it would be possible for
another thread to clear SB_NOINTR between when the socket buffer mutex
is released and sblock() is invoked.

Reviewed by: bz, kmacy
Reported by: Jos Backus <jos at catnook dot com>
c57fa547596c416598879f29dc61157e959392bd 31-Jan-2008 rwatson <rwatson@FreeBSD.org> Correct two problems relating to sorflush(), which is called to flush
read socket buffers in shutdown() and close():

- Call socantrcvmore() before sblock() to dislodge any threads that
might be sleeping (potentially indefinitely) while holding sblock(),
such as a thread blocked in recv().

- Flag the sblock() call as non-interruptible so that a signal
delivered to the thread calling sorflush() doesn't cause sblock() to
fail. The sblock() is required to ensure that all other socket
consumer threads have, in fact, left, and do not enter, the socket
buffer until we're done flushin it.

To implement the latter, change the 'flags' argument to sblock() to
accept two flags, SBL_WAIT and SBL_NOINTR, rather than one M_WAITOK
flag. When SBL_NOINTR is set, it forces a non-interruptible sx
acquisition, regardless of the setting of the disposition of SB_NOINTR
on the socket buffer; without this change it would be possible for
another thread to clear SB_NOINTR between when the socket buffer mutex
is released and sblock() is invoked.

Reviewed by: bz, kmacy
Reported by: Jos Backus <jos at catnook dot com>
41d59439f83322b0bb95a3ae4e5ddb1d9886f5c1 17-Dec-2007 kmacy <kmacy@FreeBSD.org> Add SB_NOCOALESCE flag to disable socket buffer update in place
4ec9caf00c407884fc51ff18adb8ff16a5eb75c0 16-Dec-2007 jeff <jeff@FreeBSD.org> Refactor select to reduce contention and hide internal implementation
details from consumers.

- Track individual selecters on a per-descriptor basis such that there
are no longer collisions and after sleeping for events only those
descriptors which triggered events must be rescaned.
- Protect the selinfo (per descriptor) structure with a mtx pool mutex.
mtx pool mutexes were chosen to preserve api compatibility with
existing code which does nothing but bzero() to setup selinfo
- Use a per-thread wait channel rather than a global wait channel.
- Hide select implementation details in a seltd structure which is
opaque to the rest of the kernel.
- Provide a 'selsocket' interface for those kernel consumers who wish to
select on a socket when they have no fd so they no longer have to
be aware of select implementation details.

Tested by: kris
Reviewed on: arch
0166bdd415d715f2adb19684f25e6ef8b3d16993 03-Jul-2007 rwatson <rwatson@FreeBSD.org> Fix a bug in sblock() that has existed since revision 1.1 from BSD:
correctly return an error if M_NOWAIT is passed to sblock() and the
operation might block. This remarkably subtle macro bug appears to
be responsible for quite a few undiagnosed socket buffer corruption
and mbuf-related kernel panics.

This bug has already been fixed in 7-CURRENT as part of the move to
using sx(9) locks to serialize simultaneous socket I/O, but is an
MFC candidate for all earlier FreeBSD -STABLE branches.

MFC after: 2 weeks
Found by: Isilon
Submitted by: jeff
Tested by: jhb, Yahoo!
20848234d96d64f1050c92166ebffc88afcc2fae 03-May-2007 rwatson <rwatson@FreeBSD.org> sblock() implements a sleep lock by interlocking SB_WANT and SB_LOCK flags
on each socket buffer with the socket buffer's mutex. This sleep lock is
used to serialize I/O on sockets in order to prevent I/O interlacing.

This change replaces the custom sleep lock with an sx(9) lock, which
results in marginally better performance, better handling of contention
during simultaneous socket I/O across multiple threads, and a cleaner
separation between the different layers of locking in socket buffers.
Specifically, the socket buffer mutex is now solely responsible for
serializing simultaneous operation on the socket buffer data structure,
and not for I/O serialization.

While here, fix two historic bugs:

(1) a bug allowing I/O to be occasionally interlaced during long I/O
operations (discovere by Isilon).

(2) a bug in which failed non-blocking acquisition of the socket buffer
I/O serialization lock might be ignored (discovered by sam).

SCTP portion of this patch submitted by rrs.
693f8d3cab1392ad4c214eadd9de209ea6b628e4 19-Mar-2007 andre <andre@FreeBSD.org> Space to tab in SB_* defines to match with rest of file.
515a501549ecde68391cfcda8fa7589d3be40b09 19-Mar-2007 andre <andre@FreeBSD.org> Maintain a pointer and offset pair into the socket buffer mbuf chain to
avoid traversal of the entire socket buffer for larger offsets on stream

Adjust tcp_output() make use of it.

Tested by: gallatin
7289d7c6981ff64f8c9d112326c7dfcc5b1be0bf 12-Mar-2007 ru <ru@FreeBSD.org> MFC: Don't block on the socket zone limit during the socket()
syscall which can lock up a system otherwise; instead, return
ENOBUFS as documented, which matches the FreeBSD 4.x behavior.
ad9bb7722c1df7084e512c80e86b0b022110ed5b 01-Feb-2007 andre <andre@FreeBSD.org> Generic socket buffer auto sizing support, header defines, flag inheritance.

MFC after: 1 month
cedd19512d5ec8d27159476420b827e2e01731e1 01-Aug-2006 rwatson <rwatson@FreeBSD.org> Reimplement socket buffer tear-down in sofree(): as the socket is no
longer referenced by other threads (hence our freeing it), we don't need
to set the can't send and can't receive flags, wake up the consumers,
perform two levels of locking, etc. Implement a fast-path teardown,
sbdestroy(), which flushes and releases each socket buffer. A manual
dom_dispose of the receive buffer is still required explicitly to GC
any in-flight file descriptors, etc, before flushing the buffer.

This results in a 9% UP performance improvement and 16% SMP performance
improvement on a tight loop of socket();close(); in micro-benchmarking,
but will likely also affect CPU-bound macro-benchmark performance.
40868fda8a777cd2639ad89e252401b8b27c8c91 24-Jul-2006 rwatson <rwatson@FreeBSD.org> soreceive_generic(), and sopoll_generic(). Add new functions sosend(),
soreceive(), and sopoll(), which are wrappers for pru_sosend,
pru_soreceive, and pru_sopoll, and are now used univerally by socket
consumers rather than either directly invoking the old so*() functions
or directly invoking the protocol switch method (about an even split
prior to this commit).

This completes an architectural change that was begun in 1996 to permit
protocols to provide substitute implementations, as now used by UDP.
Consumers now uniformly invoke sosend(), soreceive(), and sopoll() to
perform these operations on sockets -- in particular, distributed file
systems and socket system calls.

Architectural head nod: sam, gnn, wollman
785a134681af45ce9e5b4061151f88bc872206bf 24-Jul-2006 rwatson <rwatson@FreeBSD.org> Tweak so_gencnt comment: it was once last, but that is no longer the
7d6ff37cf67ce6f0e21be705c1ebae379018f92c 24-Jul-2006 rwatson <rwatson@FreeBSD.org> Tweak comment for so_head: it is a pointer to the listen socket, rather
than the accept socket.
7880900fff58afaf260770bed12e821c823daf49 28-Jun-2006 rwatson <rwatson@FreeBSD.org> Merge uipc_socket2.c:1.158 and socketvar.h:1.150 from HEAD to RELENG_6:

Remove sbinsertoob(), sbinsertoob_locked(). They violate (and have
basically always violated) invariannts of soreceive(), which assume
that the first mbuf pointer in a receive socket buffer can't change
while the SB_LOCK sleepable lock is held on the socket buffer,
which is precisely what these functions do. No current protocols
invoke these functions, and removing them will help discourage them
from ever being used. I should have removed them years ago, but
lost track of it.

Prodded almost by accident by: peter
a1677cd65490dc2dd3c35b441958968477f1181b 17-Jun-2006 rwatson <rwatson@FreeBSD.org> Remove sbinsertoob(), sbinsertoob_locked(). They violate (and have
basically always violated) invariannts of soreceive(), which assume
that the first mbuf pointer in a receive socket buffer can't change
while the SB_LOCK sleepable lock is held on the socket buffer,
which is precisely what these functions do. No current protocols
invoke these functions, and removing them will help discourage them
from ever being used. I should have removed them years ago, but
lost track of it.

MFC after: 1 week
Prodded almost by accident by: peter
120490c1a53b9a5e5df4f242c0ea5afd571503e5 10-Jun-2006 rwatson <rwatson@FreeBSD.org> Move some functions and definitions from uipc_socket2.c to uipc_socket.c:

- Move sonewconn(), which creates new sockets for incoming connections on
listen sockets, so that all socket allocate code is together in

- Move 'maxsockets' and associated sysctls to uipc_socket.c with the
socket allocation code.

- Move kern.ipc sysctl node to uipc_socket.c, add a SYSCTL_DECL() for it
to sysctl.h and remove lots of scattered implementations in various
IPC modules.

- Sort sodealloc() after soalloc() in uipc_socket.c for dependency order
reasons. Statisticize soalloc() and sodealloc() as they are now
required only in uipc_socket.c, and are internal to the socket

After this change, socket allocation and deallocation is entirely
centralized in one file, and uipc_socket2.c consists entirely of socket
buffer manipulation and default protocol switch functions.

MFC after: 1 month
7f08bc3477d0953ebfa9c4a06cbee4950933d1d4 01-Apr-2006 rwatson <rwatson@FreeBSD.org> Add a comment describing SS_PROTOREF in detail. This will eventually be
in socket(9).

MFC after: 3 months
053507bd40c99c756879636521e0827573f88946 16-Mar-2006 rwatson <rwatson@FreeBSD.org> Change soabort() from returning int to returning void, since all
consumers ignore the return value, soabort() is required to succeed,
and protocols produce errors here to report multiple freeing of the
pcb, which we hope to eliminate.
bd6bdecfd4708ce7d5a0e17b281315f931e3ed47 15-Mar-2006 rwatson <rwatson@FreeBSD.org> Correct spelling of 0x4000 in previous commit. This one line change from
a 42k patch seemed easier to retype than apply, but apparently not. :-)

Submitted by: pjd
ca3dcadafa770c452b90dc4d96ecdb7eebdb1937 15-Mar-2006 rwatson <rwatson@FreeBSD.org> Add SS_PROTOREF socket flag, which represents a strong reference by the
protocol to the socket. Normally protocol references are weak: that is,
the socket layer can tear down the socket (and hence protocol state)
when it finds convenient. This flag will allow the protocol to
explicitly declare to the socket layer that it is maintaining a
strong reference, rather than the current implicit model associated
with so_pcb pointer values and repeated attempts to possibly free the
6bf4e0e89f721a5ea3aabca5bfc505748b6e7fd9 13-Jan-2006 rwatson <rwatson@FreeBSD.org> Add sosend_dgram(), a greatly reduced and simplified version of sosend()
intended for use solely with atomic datagram socket types, and relies
on the previous break-out of sosend_copyin(). Changes to allow UDP to
optionally use this instead of sosend() will be committed as a
0df84f5a83def3e1548e39bb43b0c177d50265fa 02-Nov-2005 andre <andre@FreeBSD.org> Retire MT_HEADER mbuf type and change its users to use MT_DATA.

Having an additional MT_HEADER mbuf type is superfluous and redundant
as nothing depends on it. It only adds a layer of confusion. The
distinction between header mbuf's and data mbuf's is solely done
through the m->m_flags M_PKTHDR flag.

Non-native code is not changed in this commit. For compatibility
MT_HEADER is mapped to MT_DATA.

Sponsored by: TCP/IP Optimization Fundraise 2005
49831ed8da06ffadf684484671b4561c9ac55f9f 30-Oct-2005 rwatson <rwatson@FreeBSD.org> Push the assignment of a new or updated so_qlimit from solisten()
following the protocol pru_listen() call to solisten_proto(), so
that it occurs under the socket lock acquisition that also sets
SO_ACCEPTCONN. This requires passing the new backlog parameter
to the protocol, which also allows the protocol to be aware of
changes in queue limit should it wish to do something about the
new queue limit. This continues a move towards the socket layer
acting as a library for the protocol.

Bump __FreeBSD_version due to a change in the in-kernel protocol
interface. This change has been tested with IPv4 and UNIX domain
sockets, but not other protocols.
8c5a4b8551b185d28592d3d03103bfcf97ac4955 09-Jul-2005 jhb <jhb@FreeBSD.org> Document that SOCK_LOCK is used to protect so_emuldata.

Approved by: re (scottl)
20482cce4b243e3d1213d9b49edd0bae22d03c37 12-Mar-2005 rwatson <rwatson@FreeBSD.org> Move the logic implementing retrieval of the SO_ACCEPTFILTER socket option
from uipc_socket.c to uipc_accf.c in do_getopt_accept_filter(), so that it
now matches do_setopt_accept_filter(). Slightly reformulate the logic to
match the optimistic allocation of storage for the argument in advance,
and slightly expand the coverage of the socket lock.
26df80bf2cb36ff3fb93ad6eef085062a90896c6 21-Feb-2005 rwatson <rwatson@FreeBSD.org> In the current world order, solisten() implements the state transition of
a socket from a regular socket to a listening socket able to accept new
connections. As part of this state transition, solisten() calls into the
protocol to update protocol-layer state. There were several bugs in this
implementation that could result in a race wherein a TCP SYN received
in the interval between the protocol state transition and the shortly
following socket layer transition would result in a panic in the TCP code,
as the socket would be in the TCPS_LISTEN state, but the socket would not
have the SO_ACCEPTCONN flag set.

This change does the following:

- Pushes the socket state transition from the socket layer solisten() to
to socket "library" routines called from the protocol. This permits
the socket routines to be called while holding the protocol mutexes,
preventing a race exposing the incomplete socket state transition to TCP
after the TCP state transition has completed. The check for a socket
layer state transition is performed by solisten_proto_check(), and the
actual transition is performed by solisten_proto().

- Holds the socket lock for the duration of the socket state test and set,
and over the protocol layer state transition, which is now possible as
the socket lock is acquired by the protocol layer, rather than vice
versa. This prevents additional state related races in the socket

This permits the dual transition of socket layer and protocol layer state
to occur while holding locks for both layers, making the two changes
atomic with respect to one another. Similar changes are likely require
elsewhere in the socket/protocol code.

Reported by: Peter Holm <peter@holm.cc>
Review and fixes from: emax, Antoine Brodin <antoine.brodin@laposte.net>
Philosophical head nod: gnn
cb47ade08f79b52cb97c0c7593f4925607de5733 18-Feb-2005 rwatson <rwatson@FreeBSD.org> Move do_setopt_accept_filter() from uipc_socket.c to uipc_accf.c, where
the rest of the accept filter code currently lives.

MFC after: 3 days
dbb0b5a00bb5d064b3dd792dda81f23148866116 30-Jan-2005 glebius <glebius@FreeBSD.org> Move sb_state to the beginning of structure, above sb_startzero member.
sb_state shouldn't be erased, when socket buffer is flushed by sorflush().

When sb_state was bzero'ed, a recently set SBS_CANTRCVMORE flag was cleared.
If a socket was shutdown(SHUT_RD), a subsequent read() would block on it.

Reported by: Ed Maste, Gerrit Nagelhout
Reviewed by: rwatson
d084122f3697505f1422b27c5d64204b1bcf66d4 24-Jan-2005 glebius <glebius@FreeBSD.org> - Convert so_qlen, so_incqlen, so_qlimit fields of struct socket from
short to unsigned short.
- Add SYSCTL_PROC() around somaxconn, not accepting values < 1 or > U_SHRTMAX.

Before this change setting somaxconn to smth above 32767 and calling
listen(fd, -1) lead to a socket, which doesn't accept connections at all.

Reviewed by: rwatson
Reported by: Igor Sysoev
4b81ce6dd2658abba782e835143f8008092c1c6f 18-Oct-2004 rwatson <rwatson@FreeBSD.org> Push acquisition of the accept mutex out of sofree() into the caller

- This permits the caller to acquire the accept mutex before the socket
mutex, avoiding sofree() having to drop the socket mutex and re-order,
which could lead to races permitting more than one thread to enter
sofree() after a socket is ready to be free'd.

- This also covers clearing of the so_pcb weak socket reference from
the protocol to the socket, preventing races in clearing and
evaluation of the reference such that sofree() might be called more
than once on the same socket.

This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket. The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets. The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.

RELENG_5_3 candidate.

MFC after: 3 days
Reviewed by: dwhite
Discussed with: gnn, dwhite, green
Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by: Vlad <marchenko at gmail dot com>
405e05f570f5bcf7eecd7cef8ee07b5300a8e0c3 09-Oct-2004 rwatson <rwatson@FreeBSD.org> Add SOCKBUF_UNLOCK_ASSERT(), which asserts that the current thread does
not hold the mutex for a socket buffer.
6ff1185c1dcb3500ceef5f5cd01d8ecafa452497 12-Jul-2004 dwmalone <dwmalone@FreeBSD.org> Rename Alfred's kern_setsockopt to so_setsockopt, as this seems a
a better name. I have a kern_[sg]etsockopt which I plan to commit
shortly, but the arguments to these function will be quite different
from so_setsockopt.

Approved by: alfred
031e087d2c3a8f23b40ec3cd62b741f935e1e2e6 12-Jul-2004 alfred <alfred@FreeBSD.org> Use SO_REUSEADDR and SO_REUSEPORT when reconnecting NFS mounts.
Tune the timeout from 5 seconds to 12 seconds.
Provide a sysctl to show how many reconnects the NFS client has done.

Seems to fix IPv6 from: kuriyama
0fb2a4694cbff42e959de64a46ae35cd03964456 27-Jun-2004 rwatson <rwatson@FreeBSD.org> Annotate so_gencnt field of struct socket locked by so_global_mtx.
856e57e291cf451220a7c3a99ee6eafc5791133c 24-Jun-2004 rwatson <rwatson@FreeBSD.org> Annotate so_error as being used for simple assignment and reads, and
therefore not locked.

Assert the socket buffer lock in sowwakeup_locked() to match
933bf044b5624ccde9f51b2a3ef6754b1b318c54 24-Jun-2004 rwatson <rwatson@FreeBSD.org> Annotate which SB_ constants are for sb_flags fields.
60a4c150d340370b7b6f2d5b41101618182638fe 24-Jun-2004 rwatson <rwatson@FreeBSD.org> Protect so_oobmark with with SOCKBUF_LOCK(&so->so_rcv), and broaden
locking in tcp_input() for TCP packets with urgent data pointers to
hold the socket buffer lock across testing and updating oobmark
from just protecting sb_state.

Update socket locking annotations
caac080ec9db76e581f66abcc30718b1968b0d1c 24-Jun-2004 rwatson <rwatson@FreeBSD.org> Introduce sbreserve_locked(), which asserts the socket buffer lock on
the socket buffer having its limits adjusted. sbreserve() now acquires
the lock before calling sbreserve_locked(). In soreserve(), acquire
socket buffer locks across read-modify-writes of socket buffer fields,
and calls into sbreserve/sbrelease; make sure to acquire in keeping
with the socket buffer lock order. In tcp_mss(), acquire the socket
buffer lock in the calling context so that we have atomic read-modify
-write on buffer sizes.
21164a78ac496af0257d2d8e6a9cac3101bc6005 21-Jun-2004 rwatson <rwatson@FreeBSD.org> Merge next step in socket buffer locking:

- sowakeup() now asserts the socket buffer lock on entry. Move
the call to KNOTE higher in sowakeup() so that it is made with
the socket buffer lock held for consistency with other calls.
Release the socket buffer lock prior to calling into pgsigio(),
so_upcall(), or aio_swake(). Locking for this event management
will need revisiting in the future, but this model avoids lock
order reversals when upcalls into other subsystems result in
socket/socket buffer operations. Assert that the socket buffer
lock is not held at the end of the function.

- Wrapper macros for sowakeup(), sorwakeup() and sowwakeup(), now
have _locked versions which assert the socket buffer lock on
entry. If a wakeup is required by sb_notify(), invoke
sowakeup(); otherwise, unconditionally release the socket buffer
lock. This results in the socket buffer lock being released
whether a wakeup is required or not.

- Break out socantsendmore() into socantsendmore_locked() that
asserts the socket buffer lock. socantsendmore()
unconditionally locks the socket buffer before calling
socantsendmore_locked(). Note that both functions return with
the socket buffer unlocked as socantsendmore_locked() calls
sowwakeup_locked() which has the same properties. Assert that
the socket buffer is unlocked on return.

- Break out socantrcvmore() into socantrcvmore_locked() that
asserts the socket buffer lock. socantrcvmore() unconditionally
locks the socket buffer before calling socantrcvmore_locked().
Note that both functions return with the socket buffer unlocked
as socantrcvmore_locked() calls sorwakeup_locked() which has
similar properties. Assert that the socket buffer is unlocked
on return.

- Break out sbrelease() into a sbrelease_locked() that asserts the
socket buffer lock. sbrelease() unconditionally locks the
socket buffer before calling sbrelease_locked().
sbrelease_locked() now invokes sbflush_locked() instead of

- Assert the socket buffer lock in socket buffer sanity check
functions sblastrecordchk(), sblastmbufchk().

- Assert the socket buffer lock in SBLINKRECORD().

- Break out various sbappend() functions into sbappend_locked()
(and variations on that name) that assert the socket buffer
lock. The !_locked() variations unconditionally lock the socket
buffer before calling their _locked counterparts. Internally,
make sure to call _locked() support routines, etc, if already
holding the socket buffer lock.

- Break out sbinsertoob() into sbinsertoob_locked() that asserts
the socket buffer lock. sbinsertoob() unconditionally locks the
socket buffer before calling sbinsertoob_locked().

- Break out sbflush() into sbflush_locked() that asserts the
socket buffer lock. sbflush() unconditionally locks the socket
buffer before calling sbflush_locked(). Update panic strings
for new function names.

- Break out sbdrop() into sbdrop_locked() that asserts the socket
buffer lock. sbdrop() unconditionally locks the socket buffer
before calling sbdrop_locked().

- Break out sbdroprecord() into sbdroprecord_locked() that asserts
the socket buffer lock. sbdroprecord() unconditionally locks
the socket buffer before calling sbdroprecord_locked().

- sofree() now calls socantsendmore_locked() and re-acquires the
socket buffer lock on return. It also now calls

- sorflush() now calls socantrcvmore_locked() and re-acquires the
socket buffer lock on return. Clean up/mess up other behavior
in sorflush() relating to the temporary stack copy of the socket
buffer used with dom_dispose by more properly initializing the
temporary copy, and selectively bzeroing/copying more carefully
to prevent WITNESS from getting confused by improperly
initialized mutexes. Annotate why that's necessary, or at
least, needed.

- soisconnected() now calls sbdrop_locked() before unlocking the
socket buffer to avoid locking overhead.

Some parts of this change were:

Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS
e9c176f24779f794f2b4d84ff880a7682c36d086 20-Jun-2004 rwatson <rwatson@FreeBSD.org> Annotate so_state as locked with SOCK_LOCK(so).

Add a commenting indicating that the SB_ constants apply to sb_flags.
645f8869ff3381fc306945af41297d833cf9ae6a 15-Jun-2004 rwatson <rwatson@FreeBSD.org> Fill in locking annotation for additional socket fields:

so_timeo Used as a sleep/wakeup address, no locking.
sb_* Almost all socket buffer fields locked with
sockbuf lock for the oskcet buffer.
so_cred Static after socket creation.
1288c81d474991b1b36834f0e8e12358cf2b2479 14-Jun-2004 rwatson <rwatson@FreeBSD.org> Remove unneeded '-' from comment header; this comment contains only
English text paragraphs that shouldn't have problems when run through
f2c0db1521299639d6b16b7d477a777fbd81ada6 14-Jun-2004 rwatson <rwatson@FreeBSD.org> The socket field so_state is used to hold a variety of socket related
flags relating to several aspects of socket functionality. This change
breaks out several bits relating to send and receive operation into a
new per-socket buffer field, sb_state, in order to facilitate locking.
This is required because, in order to provide more granular locking of
sockets, different state fields have different locking properties. The
following fields are moved to sb_state:

SS_RCVATMARK (so_state)

Rename respectively to:

SBS_CANTRCVMORE (so_rcv.sb_state)
SBS_CANTSENDMORE (so_snd.sb_state)
SBS_RCVATMARK (so_rcv.sb_state)

This facilitates locking by isolating fields to be located with other
identically locked fields, and permits greater granularity in socket
locking by avoiding storing fields with different locking semantics in
the same short (avoiding locking conflicts). In the future, we may
wish to coallesce sb_state and sb_flags; for the time being I leave
them separate and there is no additional memory overhead due to the
packing/alignment of shorts in the socket buffer structure.
f1bc833e9552e6874a5343bfd4a0b2999a185b42 13-Jun-2004 rwatson <rwatson@FreeBSD.org> Socket MAC labels so_label and so_peerlabel are now protected by

- Hold socket lock over calls to MAC entry points reading or
manipulating socket labels.

- Assert socket lock in MAC entry point implementations.

- When externalizing the socket label, first make a thread-local
copy while holding the socket lock, then release the socket lock
to externalize to userspace.
56bd265f9d195c47ad6a39c09d3d5fc3287e3ea7 12-Jun-2004 rwatson <rwatson@FreeBSD.org> Move #ifdef _KERNEL higher in socketvar.h to cover various socket
buffer related macros.
82295697cd4bae93852c3a10a939f20227018fbd 12-Jun-2004 rwatson <rwatson@FreeBSD.org> Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
various contexts in the stack, both at the socket and protocol

- In some cases, perform soisdisconnected() before sotryfree(), as
this could result in frobbing of a non-present socket if
sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
they don't free the socket.

Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS
dc268bcf6a44b920628b2aa0d735b2f3d95e2958 12-Jun-2004 rwatson <rwatson@FreeBSD.org> Whitespace-only restyling of socket reference count macros.
7bfe3e80fc4fa11f091353498f3ce7475f7beec8 12-Jun-2004 rwatson <rwatson@FreeBSD.org> Introduce a mutex into struct sockbuf, sb_mtx, which will be used to
protect fields in the socket buffer. Add accessor macros to use the
mutex (SOCKBUF_*()). Initialize the mutex in soalloc(), and destroy
it in sodealloc(). Add addition, add SOCK_*() access macros which
will protect most remaining fields in the socket; for the time being,
use the receive socket buffer mutex to implement socket level locking
to reduce memory overhead.

Submitted by: sam
Sponosored by: FreeBSD Foundation
Obtained from: BSD/OS
d94a5b2b916d48d866be79a8a0a43cf7a8087a51 11-Jun-2004 rwatson <rwatson@FreeBSD.org> Use tabs instead of spaces between #define and macro name; a merge
mistake as they are in rwatson_netperf.
87449e4f90b8aa5016bd4a60334bb762a0e456d9 04-Jun-2004 rwatson <rwatson@FreeBSD.org> Mark sun_noname as const since it's immutable. Update definitions
of functions that potentially accept &sun_noname (sbappendaddr(),
et al) to accept a const sockaddr pointer.
576b26bafdee61ede4fafa63fd78e21930f5bc2e 02-Jun-2004 rwatson <rwatson@FreeBSD.org> Integrate accept locking from rwatson_netperf, introducing a new
global mutex, accept_mtx, which serializes access to the following
fields across all sockets:

so_qlen so_incqlen so_qstate
so_comp so_incomp so_list

While providing only coarse granularity, this approach avoids lock
order issues between sockets by avoiding ownership of the fields
by a specific socket and its per-socket mutexes.

While here, rewrite soclose(), sofree(), soaccept(), and
sonewconn() to add assertions, close additional races and address
lock order concerns. In particular:

- Reorganize the optimistic concurrency behavior in accept1() to
always allocate a file descriptor with falloc() so that if we do
find a socket, we don't have to encounter the "Oh, there wasn't
a socket" race that can occur if falloc() sleeps in the current
code, which broke inbound accept() ordering, not to mention
requiring backing out socket state changes in a way that raced
with the protocol level. We may want to add a lockless read of
the queue state if polling of empty queues proves to be important
to optimize.

- In accept1(), soref() the socket while holding the accept lock
so that the socket cannot be free'd in a race with the protocol
layer. Likewise in netgraph equivilents of the accept1() code.

- In sonewconn(), loop waiting for the queue to be small enough to
insert our new socket once we've committed to inserting it, or
races can occur that cause the incomplete socket queue to
overfill. In the previously implementation, it was sufficient
to simply tested once since calling soabort() didn't release
synchronization permitting another thread to insert a socket as
we discard a previous one.

- In soclose()/sofree()/et al, it is the responsibility of the
caller to remove a socket from the incomplete connection queue
before calling soabort(), which prevents soabort() from having
to walk into the accept socket to release the socket from its
queue, and avoids races when releasing the accept mutex to enter
soabort(), permitting soabort() to avoid lock ordering issues
with the caller.

- Generally cluster accept queue related operations together
throughout these functions in order to facilitate locking.

Annotate new locking in socketvar.h.
87f869d1e968df0d8fcf5fc474d092d13dc4f70d 01-Jun-2004 rwatson <rwatson@FreeBSD.org> Replace current locking comments for struct socket/struct sockbuf
with new ones. Annotate constant-after-creation fields as such. The
comments describe a number of locks that are not yet merged.
bddadcf71a191234c652f1a57c52259d99eac58d 01-Jun-2004 rwatson <rwatson@FreeBSD.org> The SS_COMP and SS_INCOMP flags in the so_state field indicate whether
the socket is on an accept queue of a listen socket. This change
renames the flags to SQ_COMP and SQ_INCOMP, and moves them to a new
state field on the socket, so_qstate, as the locking for these flags
is substantially different for the locking on the remainder of the
flags in so_state.
9e1925f5605eb84767aaedbc35ab643a8f9f90a5 07-Apr-2004 imp <imp@FreeBSD.org> Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core
b0b5f961bd295aee20ca2584b96ebf00e9cb92f3 01-Mar-2004 rwatson <rwatson@FreeBSD.org> Rename dup_sockaddr() to sodupsockaddr() for consistency with other
functions in kern_socket.c.

Rename the "canwait" field to "mflags" and pass M_WAITOK and M_NOWAIT
in from the caller context rather than "1" or "0".

Correct mflags pass into mac_init_socket() from previous commit to not
include M_ZERO.

Submitted by: sam
94d29f742676d86484a59179b2c75fc46b7767ea 29-Feb-2004 rwatson <rwatson@FreeBSD.org> Modify soalloc() API so that it accepts a malloc flags argument rather
than a "waitok" argument. Callers now passing M_WAITOK or M_NOWAIT
rather than 0 or 1. This simplifies the soalloc() logic, and also
makes the waiting behavior of soalloc() more clear in the calling

Submitted by: sam
74614e7f63a3fda41bd634356405cc328a9e2284 16-Nov-2003 alc <alc@FreeBSD.org> - Modify alpha's sf_buf implementation to use the direct virtual-to-
physical mapping.
- Move the sf_buf API to its own header file; make struct sf_buf's
definition machine dependent. In this commit, we remove an
unnecessary field from struct sf_buf on the alpha, amd64, and ia64.
Ultimately, we may eliminate struct sf_buf on those architecures
except as an opaque pointer that references a vm page.
77ed6e2d1cbbf9a46dd5ae6d089eeb45ab81fbcb 12-Nov-2003 rwatson <rwatson@FreeBSD.org> Modify the MAC Framework so that instead of embedding a (struct label)
in various kernel objects to represent security data, we embed a
(struct label *) pointer, which now references labels allocated using
a UMA zone (mac_label.c). This allows the size and shape of struct
label to be varied without changing the size and shape of these kernel
objects, which become part of the frozen ABI with 5-STABLE. This opens
the door for boot-time selection of the number of label slots, and hence
changes to the bound on the number of simultaneous labeled policies
at boot-time instead of compile-time. This also makes it easier to
embed label references in new objects as required for locking/caching
with fine-grained network stack locking, such as inpcb structures.

This change also moves us further in the direction of hiding the
structure of kernel objects from MAC policy modules, not to mention
dramatically reducing the number of '&' symbols appearing in both the
MAC Framework and MAC policy modules, and improving readability.

While this results in minimal performance change with MAC enabled, it
will observably shrink the size of a number of critical kernel data
structures for the !MAC case, and should have a small (but measurable)
performance benefit (i.e., struct vnode, struct socket) do to memory
conservation and reduced cost of zeroing memory.

NOTE: Users of MAC must recompile their kernel and all MAC modules as a
result of this change. Because this is an API change, third party
MAC modules will also need to be updated to make less use of the '&'

Suggestions from: bmilekic
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
39ba2e1c90c5d5fc0d01568719c540b15001528d 28-Oct-2003 sam <sam@FreeBSD.org> speedup stream socket recv handling by tracking the tail of
the mbuf chain instead of walking the list for each append

Submitted by: ps/jayanth
Obtained from: netbsd (jason thorpe)
fb82c18f66d416fe2da9e9472fea685cd3fb8d87 05-Aug-2003 hsu <hsu@FreeBSD.org> Make the second argument to sooptcopyout() constant in order to
simplify the upcoming PIM patches.

Submitted by: Pavlin Radoslavov <pavlin@icir.org>
ab57004058eee4b698f7d9cece99440ad9b49bbb 17-Jul-2003 robert <robert@FreeBSD.org> To avoid a kernel panic provoked by a NULL pointer dereference,
do not clear the `sb_sel' member of the sockbuf structure
while invalidating the receive sockbuf in sorflush(), called
from soshutdown().

The panic was reproduceable from user land by attaching a knote
with EVFILT_READ filters to a socket, disabling further reads
from it using shutdown(2), and then closing it. knote_remove()
was called to remove all knotes from the socket file descriptor
by detaching each using its associated filterops' detach call-
back function, sordetach() in this case, which tried to remove
itself from the invalidated sockbuf's klist (sb_sel.si_note).

PR: kern/54331
6f59be774d7b820d9fc3ec62a551dbb881c00d04 29-Mar-2003 alc <alc@FreeBSD.org> Pass the vm_page's address to sf_buf_alloc(); map the vm_page as part
of sf_buf_alloc() instead of expecting sf_buf_alloc()'s caller to map it.

The ultimate reason for this change is to enable two optimizations:
(1) that there never be more than one sf_buf mapping a vm_page at a time
and (2) 64-bit architectures can transparently use their 1-1 virtual
to physical mapping (e.g., "K0SEG") avoiding the overhead of pmap_qenter()
and pmap_qremove().
2756b6c9641bd9899a346582c191310de25cccc5 02-Mar-2003 des <des@FreeBSD.org> More low-hanging fruit: kill caddr_t in calls to wakeup(9) / [mt]sleep(9).
cf874b345d0f766fb64cf4737e1c85ccc78d2bee 19-Feb-2003 imp <imp@FreeBSD.org> Back out M_* changes, per decision of the TRB.

Approved by: trb
bf8e8a6e8f0bd9165109f0a258730dd242299815 21-Jan-2003 alfred <alfred@FreeBSD.org> Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
1a26cd3da49a6f412ad7131f6a447fe3d7b75b5a 11-Jan-2003 tjr <tjr@FreeBSD.org> Don't count mbufs with m_type == MT_HEADER or MT_OOBDATA as control data
in sballoc(), sbcompress(), sbdrop() and sbfree(). Fixes fstat() st_size
reporting and kevent() EVFILT_READ on TCP sockets.
26984a400376cf63b49953e6661acbf02e70c62d 23-Dec-2002 phk <phk@FreeBSD.org> Move the declaration of the socket fileops from socketvar.h to file.h.
This allows us to use the new typedefs and removes the needs for a number
of forward struct declarations in socketvar.h
fedfaf9f8b1045aeb82f674a281267ce7853224b 23-Dec-2002 phk <phk@FreeBSD.org> s/sokqfilter/soo_kqfilter/ for consistency with the naming of all
other socket/file operations.
328836c626572cbb92db3e480fbfe6f839592623 02-Nov-2002 alc <alc@FreeBSD.org> Revert the change in revision 1.77 of kern/uipc_socket2.c. It is causing
a panic because the socket's state isn't as expected by sofree().

Discussed with: dillon, fenner
c70af01e8027513cb16aa9d60244c105336fe3e6 01-Nov-2002 kbyanc <kbyanc@FreeBSD.org> Track the number of non-data chararacters stored in socket buffers so that
the data value returned by kevent()'s EVFILT_READ filter on non-TCP
sockets accurately reflects the amount of data that can be read from the
sockets by applications.

PR: 30634
Reviewed by: -net, -arch
Sponsored by: NTT Multimedia Communications Labs
MFC after: 2 weeks
3246fbf45f089a96288563f2d5071bfbde5f99df 17-Aug-2002 rwatson <rwatson@FreeBSD.org> In continuation of early fileop credential changes, modify fo_ioctl() to
accept an 'active_cred' argument reflecting the credential of the thread
initiating the ioctl operation.

- Change fo_ioctl() to accept active_cred; change consumers of the
fo_ioctl() interface to generally pass active_cred from td->td_ucred.
- In fifofs, initialize filetmp.f_cred to ap->a_cred so that the
invocations of soo_ioctl() are provided access to the calling f_cred.
Pass ap->a_td->td_ucred as the active_cred, but note that this is
required because we don't yet distinguish file_cred and active_cred
in invoking VOP's.
- Update kqueue_ioctl() for its new argument.
- Update pipe_ioctl() for its new argument, pass active_cred rather
than td_ucred to MAC for authorization.
- Update soo_ioctl() for its new argument.
- Update vn_ioctl() for its new argument, use active_cred rather than
td->td_ucred to authorize VOP_IOCTL() and the associated VOP_GETATTR().

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
2b82cd24f10c789221e2b4edc59b96a7733b9e71 16-Aug-2002 rwatson <rwatson@FreeBSD.org> Make similar changes to fo_stat() and fo_poll() as made earlier to
fo_read() and fo_write(): explicitly use the cred argument to fo_poll()
as "active_cred" using the passed file descriptor's f_cred reference
to provide access to the file credential. Add an active_cred
argument to fo_stat() so that implementers have access to the active
credential as well as the file credential. Generally modify callers
of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which
was redundantly provided via the fp argument. This set of modifications
also permits threads to perform these operations on behalf of another
thread without modifying their credential.

Trickle this change down into fo_stat/poll() implementations:

- badfo_poll(), badfo_stat(): modify/add arguments.
- kqueue_poll(), kqueue_stat(): modify arguments.
- pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to
MAC checks rather than td->td_ucred.
- soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather
than cred to pru_sopoll() to maintain current semantics.
- sopoll(): moidfy arguments.
- vn_poll(), vn_statfile(): modify/add arguments, pass new arguments
to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL()
to maintian current semantics.
- vn_close(): rename cred to file_cred to reflect reality while I'm here.
- vn_stat(): Add active_cred and file_cred arguments to vn_stat()
and consumers so that this distinction is maintained at the VFS
as well as 'struct file' layer. Pass active_cred instead of
td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics.

- fifofs: modify the creation of a "filetemp" so that the file
credential is properly initialized and can be used in the socket
code if desired. Pass ap->a_td->td_ucred as the active
credential to soo_poll(). If we teach the vnop interface about
the distinction between file and active credentials, we would use
the active credential here.

Note that current inconsistent passing of active_cred vs. file_cred to
VOP's is maintained. It's not clear why GETATTR would be authorized
using active_cred while POLL would be authorized using file_cred at
the file system level.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
44404e4547aee87b255582d4e6395551869e29b1 15-Aug-2002 rwatson <rwatson@FreeBSD.org> In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
"cred", and change the semantics of consumers of fo_read() and
fo_write() to pass the active credential of the thread requesting
an operation rather than the cached file cred. The cached file
cred is still available in fo_read() and fo_write() consumers
via fp->f_cred. These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
pipe_read/write() now authorize MAC using active_cred rather
than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred. Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not. If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
20c1b55c0bf8d21af127bf49d9fa72117edaf7d3 14-Aug-2002 rwatson <rwatson@FreeBSD.org> Move to a nested include of _label.h instead of mac.h in sys/sys/*.h
(Most of the places where mac.h was recursively included from another
kernel header file. net/netinet to follow.)

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
Suggested by: bde
6dce2e7eff44abde5af8dc5ee18ca69ee0a491b1 13-Aug-2002 dg <dg@FreeBSD.org> Moved sf_buf_alloc and sf_buf_free function declarations to sys/socketvar.h
so that they can be seen by external callers.
e623365f830c6acf0ff0579f0ebf4cb7ce9ffc40 30-Jul-2002 rwatson <rwatson@FreeBSD.org> Introduce support for Mandatory Access Control and extensible kernel
access control.

Label socket IPC objects, permitting security features to be maintained
at the granularity of the socket. Two labels are stored for each
socket: the label of the socket itself, and a cached peer label
permitting interogation of the remote endpoint. Since socket locking
is not yet present in the base tree, these objects are not locked,
but are assumed to follow the same semantics as other modifiable
entries in the socket structure.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
ba178e4b25fd9285d77312727825985585dba80c 27-Jul-2002 rwatson <rwatson@FreeBSD.org> Remote socheckproc(), which was removed when p_can*() was introduced
ages ago. The prototype was missed.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
86b3836232069126cbad3fccd52778f5adc4a676 24-Jul-2002 jdp <jdp@FreeBSD.org> Widen struct sockbuf's sb_timeo member to int from short. With
non-default but reasonable values of hz this member overflowed,
breaking NFS over UDP.

Also, as long as I'm plowing up struct sockbuf ... Change certain
members from u_long/long to u_int/int in order to reduce wasted
space on 64-bit machines. This change was requested by Andrew

Netstat and systat need to be rebuilt. I am incrementing
__FreeBSD_version in case any ports need to change.
d1cbf6a1d1f96288005329dfaca2aaffbd884d81 29-Jun-2002 alfred <alfred@FreeBSD.org> More caddr_t removal, make fo_ioctl take a void * instead of a caddr_t.
c5ef3e2f452f05ba20a0e236aee5d1877f9cec35 28-Jun-2002 alfred <alfred@FreeBSD.org> change struct socket -> so_pcb from caddr_t to void *.
0d3a835f3f94caa05448371062a46d2485185ee1 26-Jun-2002 ken <ken@FreeBSD.org> At long last, commit the zero copy sockets code.

MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes.

ti.4: Update the ti(4) man page to include information on the
and also include information about the new character
device interface and the associated ioctls.

man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated

jumbo.9: New man page describing the jumbo buffer allocator
interface and operation.

zero_copy.9: New man page describing the general characteristics of
the zero copy send and receive code, and what an
application author should do to take advantage of the
zero copy functionality.


conf/files: Add uipc_jumbo.c and uipc_cow.c.

conf/options: Add the 5 options mentioned above.

kern_subr.c: Receive side zero copy implementation. This takes
"disposable" pages attached to an mbuf, gives them to
a user process, and then recycles the user's page.
This is only active when ZERO_COPY_SOCKETS is turned on
and the kern.ipc.zero_copy.receive sysctl variable is
set to 1.

uipc_cow.c: Send side zero copy functions. Takes a page written
by the user and maps it copy on write and assigns it
kernel virtual address space. Removes copy on write
mapping once the buffer has been freed by the network

uipc_jumbo.c: Jumbo disposable page allocator code. This allocates
(optionally) disposable pages for network drivers that
want to give the user the option of doing zero copy

uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are
enabled if ZERO_COPY_SOCKETS is turned on.

Add zero copy send support to sosend() -- pages get
mapped into the kernel instead of getting copied if
they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
can be used elsewhere. (uipc_cow.c)

if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
calling malloc() with M_WAITOK. Return an error if
the M_NOWAIT malloc fails.

The ti(4) driver and the wi(4) driver, at least, call
this with a mutex held. This causes witness warnings
for 'ifconfig -a' with a wi(4) or ti(4) board in the
system. (I've only verified for ti(4)).

ip_output.c: Fragment large datagrams so that each segment contains
a multiple of PAGE_SIZE amount of data plus headers.
This allows the receiver to potentially do page
flipping on receives.

if_ti.c: Add zero copy receive support to the ti(4) driver. If
TI_PRIVATE_JUMBOS is not defined, it now uses the
jumbo(9) buffer allocator for jumbo receive buffers.

Add a new character device interface for the ti(4)
driver for the new debugging interface. This allows
(a patched version of) gdb to talk to the Tigon board
and debug the firmware. There are also a few additional
debugging ioctls available through this interface.

Add header splitting support to the ti(4) driver.

Tweak some of the default interrupt coalescing
parameters to more useful defaults.

Add hooks for supporting transmit flow control, but
leave it turned off with a comment describing why it
is turned off.

if_tireg.h: Change the firmware rev to 12.4.11, since we're really
at 12.4.11 plus fixes from 12.4.13.

Add defines needed for debugging.

Remove the ti_stats structure, it is now defined in

ti_fw.h: 12.4.11 firmware.

ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13,
and my header splitting patches. Revision 12.4.13
doesn't handle 10/100 negotiation properly. (This
firmware is the same as what was in the tree previously,
with the addition of header splitting support.)

sys/jumbo.h: Jumbo buffer allocator interface.

sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to
indicate that the payload buffer can be thrown away /
flipped to a userland process.

socketvar.h: Add prototype for socow_setup.

tiio.h: ioctl interface to the character portion of the ti(4)
driver, plus associated structure/type definitions.

uio.h: Change prototype for uiomoveco() so that we'll know
whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c: In vm_fault(), check to see whether we need to do a page
based copy on write fault.

vm_object.c: Add a new function, vm_object_allocate_wait(). This
does the same thing that vm_object allocate does, except
that it gives the caller the opportunity to specify whether
it should wait on the uma_zalloc() of the object structre.

This allows vm objects to be allocated while holding a
mutex. (Without generating WITNESS warnings.)

vm_object_allocate() is implemented as a call to
vm_object_allocate_wait() with the malloc flag set to

vm_object.h: Add prototype for vm_object_allocate_wait().

vm_page.c: Add page-based copy on write setup, clear and fault

vm_page.h: Add page based COW function prototypes and variable in
the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.
cb3347e926b2f55441b5b402a984b3cf1f946ff7 18-Jun-2002 tanimura <tanimura@FreeBSD.org> Remove so*_locked(), which were backed out by mistake.
e6fa9b9e922913444c2e6b2b58bf3de5eaed868d 31-May-2002 tanimura <tanimura@FreeBSD.org> Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu
92d8381dd544a8237b3fd68c4e7fce9bd0903fb2 20-May-2002 tanimura <tanimura@FreeBSD.org> Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred
7abfc9c5e6632b9279d66ad677080588e1719a88 02-May-2002 alfred <alfred@FreeBSD.org> Cleanup, quote:

This leaves some vestiges of the old locking, including style
bugs in it. I've only noticed anachronisms in socketvar.h so far
(I've merged net* but not kern or all of sys). The patch also
has old fixes for style bugs in accf stuff and namespace pollution
in uma... The largest style bugs are line continued backslashes
in column 80 and (these are fixed), and starting the do-while
code for the new macros in column 40, which is quite unlike the
usual indentation (see sys/queue.h) and not even like the indentation
for the old macros (column 32) (this is not fixed).

Submitted by: bde
798c53d495a4eb1c10dc65a1d2ca87e2cb12f8df 01-May-2002 alfred <alfred@FreeBSD.org> Redo the sigio locking.

Turn the sigio sx into a mutex.

Sigio lock is really only needed to protect interrupts from dereferencing
the sigio pointer in an object when the sigio itself is being destroyed.

In order to do this in the most unintrusive manner change pgsigio's
sigio * argument into a **, that way we can lock internally to the
89ec521d918cc9582731bc57c7d9d39010f3ccba 30-Apr-2002 tanimura <tanimura@FreeBSD.org> Revert the change of #includes in sys/filedesc.h and sys/socketvar.h.

Requested by: bde

Since locking sigio_lock is usually followed by calling pgsigio(),
move the declaration of sigio_lock and the definitions of SIGIO_*() to

While I am here, sort include files alphabetically, where possible.
dbb4756491715a06ce4578841f6eba43fc62fa70 27-Apr-2002 tanimura <tanimura@FreeBSD.org> Add a global sx sigio_lock to protect the pointer to the sigio object
of a socket. This avoids lock order reversal caused by locking a
process in pgsigio().

sowakeup() and the callers of it (sowwakeup, soisconnected, etc.) now
require sigio_lock to be locked. Provide sowwakeup_locked(),
soisconnected_locked(), and so on in case where we have to modify a
socket and wake up a process atomically.
b4055530fc4c41c58a6ac51b6b5dece2bbe25f1b 24-Apr-2002 silby <silby@FreeBSD.org> Remove sodropablereq - this function hasn't been used since the
syncache went in.

MFC after: 3 days
d9992faaf2001cc6dca237e00491cc57a7426909 08-Apr-2002 hsu <hsu@FreeBSD.org> There's only one socket zone so we don't need to remember it
in every socket structure.
8362010d3337dbf98da10017f6603904e740724b 26-Mar-2002 bde <bde@FreeBSD.org> Removed some namespace pollution (unnecessary nested includes).
d27d5e3b44ff619a0151aa86993697a211e61bc7 23-Mar-2002 bde <bde@FreeBSD.org> Fixed some style bugs in the removal of __P(()). The main ones were
not removing tabs before "__P((", and not outdenting continuation lines
to preserve non-KNF lining up of code with parentheses. Switch to KNF
formatting and/or rewrap the whole prototype in some cases.
803cb2a2ba1f8847bbb7ea2f1c36cfc14a645091 20-Mar-2002 jeff <jeff@FreeBSD.org> Backout part of my previous commit; I was wrong about vm_zone's handling of
limits on zones w/o objects.
318cbeeecf54d416eb936f4bb65c00b18aab686b 20-Mar-2002 jeff <jeff@FreeBSD.org> Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.
3b2d03b60a11ce28e58a87212bcccedd306f2c81 19-Mar-2002 alfred <alfred@FreeBSD.org> Remove __P
2923687da3c046deea227e675d5af075b9fa52d4 19-Mar-2002 jeff <jeff@FreeBSD.org> This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@
c4988e25d265bba2c63409a8c8b8708c13d8525e 13-Jan-2002 alfred <alfred@FreeBSD.org> Add parens around macro args.

Forgotten by: dillon
4b04cada9319ee4ce88335eead3bb690bfb630bc 09-Jan-2002 alfred <alfred@FreeBSD.org> holdsock is gone, remove the prototype
5eea21cccab61c0a7e31c0025f3f57feeb99870a 31-Dec-2001 rwatson <rwatson@FreeBSD.org> o Make the credential used by socreate() an explicit argument to
socreate(), rather than getting it implicitly from the thread

o Make NFS cache the credential provided at mount-time, and use
the cached credential (nfsmount->nm_cred) when making calls to
socreate() on initially connecting, or reconnecting the socket.

This fixes bugs involving NFS over TCP and ipfw uid/gid rules, as well
as bugs involving NFS and mandatory access control implementations.

Reviewed by: freebsd-arch
91e55e40ce426978db952fe0bc2bf5eaaadffcb1 13-Dec-2001 green <green@FreeBSD.org> Remove stale prototype for sonewconn3().
86ed17d675cb503ddb3f71f8b6f7c3af530bb29a 17-Nov-2001 dillon <dillon@FreeBSD.org> Give struct socket structures a ref counting interface similar to
vnodes. This will hopefully serve as a base from which we can
expand the MP code. We currently do not attempt to obtain any
mutex or SX locks, but the door is open to add them when we nail
down exactly how that part of it is going to work.
eea321b9ba4d9ca3dac5d200ac87a3c2574a470a 25-Oct-2001 rwatson <rwatson@FreeBSD.org> o Remove extern showallsockets, defunct as of the change to
kern.security.seeotheruids_permitted. This was missed in the
commit that made this change elsewhere.
38383190d52e794a70d2b71ee33fa321e5109e7a 05-Oct-2001 ps <ps@FreeBSD.org> Only allow users to see their own socket connections if
kern.ipc.showallsockets is set to 0.

Submitted by: billf (with modifications by me)
Inspired by: Dave McKay (aka pm aka Packet Magnet)
Reviewed by: peter
MFC after: 2 weeks
1d2d921ea8dc1db60c8ed47fe6c7ce2048325f76 13-Sep-2001 obrien <obrien@FreeBSD.org> Re-apply rev 1.178 -- style(9) the structure definitions.
I have to wonder how many other changes were lost in the KSE mildstone 2 merge.
5596676e6c6c1e81e899cd0531f9b1c28a292669 12-Sep-2001 julian <julian@FreeBSD.org> KSE Milestone 2
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha
40da6b02e6f93a023ffa59f81d5dbb34506aac5c 05-Sep-2001 obrien <obrien@FreeBSD.org> style(9) the structure definitions.
2e65ae5305ced150bd070ec843c291d0201137f6 31-Aug-2001 jlemon <jlemon@FreeBSD.org> Whitespace change.
2d6e61953672bd0e974515b5bee2c5317fc757c1 28-Jun-2001 jlemon <jlemon@FreeBSD.org> Correct comment: so_q -> so_comp, so_q0 -> so_incomp.

Submitted by: Adagio Vangogh <adagio_v@pacbell.net>
11781a7431fab609cd00058a63ac09ccddb16854 15-Feb-2001 jlemon <jlemon@FreeBSD.org> Extend kqueue down to the device layer.

Backwards compatible approach suggested by: peter
70c88bb8dad8043bc6de6c2381c0f3d55e803379 09-Jan-2001 wollman <wollman@FreeBSD.org> select() DKI is now in <sys/selinfo.h>.
5de479435ac7f4124fce758194096f1de151aea4 31-Dec-2000 phk <phk@FreeBSD.org> Use macro API to <sys/queue.h>
15a44d16ca10bf52da55462560c345940cd19b38 18-Nov-2000 dillon <dillon@FreeBSD.org> This patchset fixes a large number of file descriptor race conditions.
Pre-rfork code assumed inherent locking of a process's file descriptor
array. However, with the advent of rfork() the file descriptor table
could be shared between processes. This patch closes over a dozen
serious race conditions related to one thread manipulating the table
(e.g. closing or dup()ing a descriptor) while another is blocked in
an open(), close(), fcntl(), read(), write(), etc...

PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>
baac93215b7e569a555cb4ba9d3c4cf86f39f708 06-Sep-2000 alfred <alfred@FreeBSD.org> Accept filter maintainance

Update copyrights.

Introduce a new sysctl node:

Although acceptfilters need refcounting to be properly (safely) unloaded
as a temporary hack allow them to be unloaded if the sysctl
net.inet.accf.unloadable is set, this is really for developers who want
to work on thier own filters.

A near complete re-write of the accf_http filter:
1) Parse check if the request is HTTP/1.0 or HTTP/1.1 if not dump
to the application.
Because of the performance implications of this there is a sysctl
'net.inet.accf.http.parsehttpversion' that when set to non-zero
parses the HTTP version.
The default is to parse the version.
2) Check if a socket has filled and dump to the listener
3) optimize the way that mbuf boundries are handled using some voodoo
4) even though you'd expect accept filters to only be used on TCP
connections that don't use m_nextpkt I've fixed the accept filter
for socket connections that use this.

This rewrite of accf_http should allow someone to use them and maintain
full HTTP compliance as long as net.inet.accf.http.parsehttpversion is
df0e25bf6c3619217f1f2c8b5a35a6e706f2a0b4 19-Aug-2000 dwmalone <dwmalone@FreeBSD.org> Replace the mbuf external reference counting code with something
that should be better.

The old code counted references to mbuf clusters by using the offset
of the cluster from the start of memory allocated for mbufs and
clusters as an index into an array of chars, which did the reference
counting. If the external storage was not a cluster then reference
counting had to be done by the code using that external storage.

NetBSD's system of linked lists of mbufs was cosidered, but Alfred
felt it would have locking issues when the kernel was made more
SMP friendly.

The system implimented uses a pool of unions to track external
storage. The union contains an int for counting the references and
a pointer for forming a free list. The reference counts are
incremented and decremented atomically and so should be SMP friendly.
This system can track reference counts for any sort of external

Access to the reference counting stuff is now through macros defined
in mbuf.h, so it should be easier to make changes to the system in
the future.

The possibility of storing the reference count in one of the
referencing mbufs was considered, but was rejected 'cos it would
often leave extra mbufs allocated. Storing the reference count in
the cluster was also considered, but because the external storage
may not be a cluster this isn't an option.

The size of the pool of reference counters is available in the
stats provided by "netstat -m".

PR: 19866
Submitted by: Bosko Milekic <bmilekic@dsuper.net>
Reviewed by: alfred (glanced at by others on -net)
e3e72a583b17917255252eaad0aded3476d3652b 20-Jun-2000 alfred <alfred@FreeBSD.org> return of the accept filter part II

accept filters are now loadable as well as able to be compiled into
the kernel.

two accept filters are provided, one that returns sockets when data
arrives the other when an http request is completed (doesn't work
with 0.9 requests)

Reviewed by: jmg
961b97d43458f3c57241940cabebb3bedf7e4c00 26-May-2000 jake <jake@FreeBSD.org> Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others
d93fbc99166053b75c2eeb69b5cb603cfaf79ec0 23-May-2000 jake <jake@FreeBSD.org> Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd
c41c876463ee8c302f06554537e0fb22a3fcdca4 16-Apr-2000 jlemon <jlemon@FreeBSD.org> Introduce kqueue() and kevent(), a kernel event notification facility.
b42951578188c5aab5c9f8cbcde4a743f8092cdc 02-Apr-2000 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'ALSA'.
241bd93929c50045d997159647c990ce30a09b38 14-Jan-2000 jasone <jasone@FreeBSD.org> Add aio_waitcomplete(). Make aio work correctly for socket descriptors.
Make gratuitous style(9) fixes (me, not the submitter) to make the aio
code more readable.

PR: kern/12053
Submitted by: Chris Sedore <cmsedore@maxwell.syr.edu>
15b9bcb121e1f3735a2c98a11afdb52a03301d7e 29-Dec-1999 peter <peter@FreeBSD.org> Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.
cad2014b2749528351ec5180e88a5929efebbfc4 22-Nov-1999 shin <shin@FreeBSD.org> KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP
for IPv6 yet)

With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project
be8eba35406cc6d689af2b06690b08bea32aec04 08-Nov-1999 peter <peter@FreeBSD.org> Update socket file type for fo_stat(). soo_stat() becomes a fileops
switch entry point rather than being used externally with knowledge of the
internals of the DTYPE_SOCKET f_data contents.
f980526bf65a754175b095da7e9c65301bc51ef6 09-Oct-1999 green <green@FreeBSD.org> Implement RLIMIT_SBSIZE in the kernel. This is a per-uid sockbuf total
usage limit.
140cb4ff83b0061eeba0756f708f3f7c117e76e5 19-Sep-1999 green <green@FreeBSD.org> This is what was "fdfix2.patch," a fix for fd sharing. It's pretty
far-reaching in fd-land, so you'll want to consult the code for
changes. The biggest change is that now, you don't use
fp->f_ops->fo_foo(fp, bar)
but instead
fo_foo(fp, bar),
which increments and decrements the fp refcount upon entry and exit.
Two new calls, fhold() and fdrop(), are provided. Each does what it
seems like it should, and if fdrop() brings the refcount to zero, the
fd is freed as well.

Thanks to peter ("to hell with it, it looks ok to me.") for his review.
Thanks to msmith for keeping me from putting locks everywhere :)

Reviewed by: peter
4395e552e2eb08e6dc53ccf82bfcc4040c59bda6 19-Sep-1999 green <green@FreeBSD.org> Change so_cred's type to a ucred, not a pcred. THis makes more sense, actually.
Make a sonewconn3() which takes an extra argument (proc) so new sockets created
with sonewconn() from a user's system call get the correct credentials, not
just the parent's credentials.
3b842d34e82312a8004a7ecd65ccdb837ef72ac1 28-Aug-1999 peter <peter@FreeBSD.org> $Id$ -> $FreeBSD$
4c7609f41f8c404601cc2d10cfc7b5a379050636 17-Jun-1999 green <green@FreeBSD.org> Reviewed by: the cast of thousands

This is the change to struct sockets that gets rid of so_uid and replaces
it with a much more useful struct pcred *so_cred. This is here to be able
to do socket-level credential checks (i.e. IPFW uid/gid support, to be added
to HEAD soon). Along with this comes an update to pidentd which greatly
simplifies the code necessary to get a uid from a socket. Soon to come:
a sysctl() interface to finding individual sockets' credentials.
f13dd5fa6d2de09245e582c2792228f2653fd2a8 04-Apr-1999 dt <dt@FreeBSD.org> Add standard padding argument to pread and pwrite syscall. That should make them
NetBSD compatible.

Add parameter to fo_read and fo_write. (The only flag FOF_OFFSET mean that
the offset is set in the struct uio).

Factor out some common code from read/pread/write/pwrite syscalls.
a116fafdf6d8a5ed6064da8cadc31d375c502eab 01-Feb-1999 newton <newton@FreeBSD.org> Moved prototypes for soo_{read,write,close} into socketvar.h where they

Suggested by: bde
3981a0cad876cb3ebb719faf6988494522fe5a7e 31-Jan-1999 bde <bde@FreeBSD.org> Fixed smashed tabs and inconstent comment style in previous commit.
698a7e12d0fbb51c6126b8d49e032b7fd2f09382 30-Jan-1999 newton <newton@FreeBSD.org> Changed struct socket to include a new field (at the end, so as not
to break existing software) acting as a pointer to emulator-specific
state data that some emulators may (or may not) need to maintain about
a socket.

Used by the svr4 module as a place for maintaining STREAMS emulation state.

Discussed with: Mike Smith, Garrett Wollman back in Sept 98
56bddd51c28ba7ae74608ff9e0278bf9b2ac4ad4 25-Jan-1999 fenner <fenner@FreeBSD.org> Port NetBSD's 19990120-accept bug fix. This works around the race condition
where select(2) can return that a listening socket has a connected socket
queued, the connection is broken, and the user calls accept(2), which then
blocks because there are no connections queued.

Reviewed by: wollman
Obtained from: NetBSD
d869e3568091433f9d3beb11c9f29c4fdd08f39a 11-Nov-1998 truckman <truckman@FreeBSD.org> I got another batch of suggestions for cosmetic changes from bde.
de184682fa22833c7b18a96a136bc031ae786434 11-Nov-1998 truckman <truckman@FreeBSD.org> Installed the second patch attached to kern/7899 with some changes suggested
by bde, a few other tweaks to get the patch to apply cleanly again and
some improvements to the comments.

This change closes some fairly minor security holes associated with
F_SETOWN, fixes a few bugs, and removes some limitations that F_SETOWN
had on tty devices. For more details, see the description on the PR.

Because this patch increases the size of the proc and pgrp structures,
it is necessary to re-install the includes and recompile libkvm,
the vinum lkm, fstat, gcore, gdb, ipfilter, ps, top, and w.

PR: kern/7899
Reviewed by: bde, elvind
b178f74f12a0446656640ae873e0bc71057f5de3 05-Nov-1998 dg <dg@FreeBSD.org> Implemented zero-copy TCP/IP extensions via sendfile(2) - send a
file to a stream socket. sendfile(2) is similar to implementations in
HP-UX, Linux, and other systems, but the API is more extensive and
addresses many of the complaints that the Apache Group and others have
had with those other implementations. Thanks to Marc Slemko of the
Apache Group for helping me work out the best API for this.
Anyway, this has the "net" result of speeding up sends of files over
TCP/IP sockets by about 10X (that is to say, uses 1/10th of the CPU
cycles) when compared to a traditional read/write loop.
a76fb5eefabdc9418c911bf0b61768d533c15cbd 23-Aug-1998 wollman <wollman@FreeBSD.org> Yow! Completely change the way socket options are handled, eliminating
another specialized mbuf type in the process. Also clean up some
of the cruft surrounding IPFW, multicast routing, RSVP, and other
ill-explored corners.
1d5f38ac2264102518a09c66a7b285f57e81e67e 07-Jun-1998 dfr <dfr@FreeBSD.org> This commit fixes various 64bit portability problems required for
FreeBSD/alpha. The most significant item is to change the command
argument to ioctl functions from int to u_long. This change brings us
inline with various other BSD versions. Driver writers may like to
use (__FreeBSD_version == 300003) to detect this change.

The prototype FreeBSD/alpha machdep will follow in a couple of days
55de15ff4348a19da5300c33637af428c596e934 31-May-1998 peter <peter@FreeBSD.org> Have the sorwakeup and sowwakeup check the upcall flags.

Obtained from: NetBSD
bbc4497adab2d7702eab9a609897b2e5f672289e 15-May-1998 wollman <wollman@FreeBSD.org> Convert socket structures to be type-stable and add a version number.

Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.

Convert PF_LOCAL PCB structures to be type-stable and add a version number.

Define an external format for infomation about socket structures and use
it in several places.

Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time. This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.

It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).
406aea3e09766ad7004a2dd800fe137e6769b856 01-Mar-1998 guido <guido@FreeBSD.org> Make sure that you can only bind a more specific address when it is
done by the same uid.
Obtained from: OpenBSD
2adc6309fe27cdb87a4f07ae1e75c44c5da4ceed 01-Feb-1998 bde <bde@FreeBSD.org> Forward declare more structs that are used in prototypes here - don't
depend on <sys/types.h> forward declaring common ones.
0506343883d62f6649f7bbaf1a436133cef6261d 11-Jan-1998 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'jb'.
7c6e96080c4fb49bf912942804477d202a53396c 10-Jan-1998 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'JB'.
7fb46d49218076b8a6b7f0ddeb7afd3d18187df8 21-Dec-1997 bde <bde@FreeBSD.org> Moved some declarations from <sys/socket.h> to the correct places, and
fixed everything that depended on them being misplaced.
a29a9749e55aaa01b56952729af076163aa661ed 14-Sep-1997 peter <peter@FreeBSD.org> Update interfaces for poll()
4542c1cf5d7077caf33d6d9468f5e647cd9d19e5 16-Aug-1997 wollman <wollman@FreeBSD.org> Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.
38192cf9a58e44202d926c75756a54dfbf5f295a 19-Jul-1997 fenner <fenner@FreeBSD.org> Remove sonewconn() macro kludge, introduced in 4.3-Reno to catch argument
mismatches. Prototypes do a much better job these days.

Noticed by: bde
6afbf203bd570424ecf3f9d9d9ced17f82c81adc 27-Apr-1997 wollman <wollman@FreeBSD.org> The long-awaited mega-massive-network-code- cleanup. Part I.

This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled

2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.

3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.

4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.

5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.

As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.
94b6d727947e1242356988da003ea702d41a97de 22-Feb-1997 peter <peter@FreeBSD.org> Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.
808a36ef658c1810327b5d329469bcf5dad24b28 14-Jan-1997 jkh <jkh@FreeBSD.org> Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
0ae40bc63fdf12cc3cf21682a3c69d5600c5f15c 12-Nov-1996 bde <bde@FreeBSD.org> Added missing prototype for new function sbcreatecontrol().

Should be in 2.2.
b51353f335c040b93b8b1ff67548f54b0583bf05 07-Oct-1996 pst <pst@FreeBSD.org> Increase robustness of FreeBSD against high-rate connection attempt
denial of service attacks.

Reviewed by: bde,wollman,olah
Inspired by: vjs@sgi.com
b3ded284a077f0b9c8fb148958c083eb07762578 01-May-1996 bde <bde@FreeBSD.org> Made this self-sufficent (apart from <sys/types.h>) again. It included
<sys/stat.h> and <sys/filedesc.h> just to get struct tags and depended
on a previous #include for <sys/queue.h>
73a498e93ef77f792f958b4a1ea0d9ad0490888a 11-Mar-1996 peter <peter@FreeBSD.org> Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.
[new sys/syscallargs.h file, to be "cvs rm"ed]
a30b0a83b7ea8a0a022098b93d8cb4522d0b4cae 11-Mar-1996 dg <dg@FreeBSD.org> Changed socket code to use 4.4BSD queue macros. This includes removing
the obsolete soqinsque and soqremque functions as well as collapsing
so_q0len and so_qlen into a single queue length of unaccepted connections.
Now the queue of unaccepted & complete connections is checked directly
for queued sockets. The new code should be functionally equivilent to
the old while being substantially faster - especially in cases where
large numbers of connections are often queued for accept (e.g. http).
c8955bb47d1a02e6f77fecec085751b4ad0f2346 11-Mar-1996 hsu <hsu@FreeBSD.org> Merge in Lite2: clean up function prototypes.
Did not accept change of second argument to ioctl from int to u_long.
Reviewed by: davidg & bde
5c25078715eec4e8c4bd3113070c61741eda267e 13-Feb-1996 wollman <wollman@FreeBSD.org> Kill XNS.
While we're at it, fix socreate() to take a process argument. (This
was supposed to get committed days ago...)
f3dd75a38d66ed54a0f2660b0a27d177fb33f068 30-Jan-1996 mpp <mpp@FreeBSD.org> Fix a bunch of spelling errors in the comment fields of
a bunch of system include files.
d7acdbc572d46a43b184ec3a1ca01c259b6e66bc 14-Dec-1995 bde <bde@FreeBSD.org> Nuked ambiguous sleep message strings:
old: new:
netcls[] = "netcls" "soclos"
netcon[] = "netcon" "accept", "connec"
netio[] = "netio" "sblock", "sbwait"
63ec2c0ae9b44c5394bae5d6ee7fea5be9659585 14-Dec-1995 phk <phk@FreeBSD.org> A Major staticize sweep. Generates a couple of warnings that I'll deal
with later.
A number of unused vars removed.
A number of unused procs removed or #ifdefed.
24ce87cc75daff742416402eb15baa8a2445fb87 21-Nov-1995 bde <bde@FreeBSD.org> Completed function declarations and/or added prototypes.
86f1bc4514fdcfd255f37f3218fe234bdc3664fc 05-Nov-1995 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'LINUX'.
c86f0c7a71e7ade3e38b325c186a9cf374e0411e 30-May-1995 rgrimes <rgrimes@FreeBSD.org> Remove trailing whitespace.
2e14d9ebc3d3592c67bdf625af9ebe0dfc386653 14-Mar-1995 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'MATT_THOMAS'.
b1b8768e6a30f06704cfedaeae8f3397ad760a4e 02-Oct-1994 phk <phk@FreeBSD.org> Prototypes, prototypes and even more prototypes. Not quite done yet, but
getting closer all the time.
34cd81d75f398ee455e61969b118639dacbfd7a6 23-Sep-1994 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'MACKERRAS'.
1f5dfa1be240017a7a4d938578b9a01ce1082391 21-Aug-1994 paul <paul@FreeBSD.org> Made them all idempotent.
Reviewed by:
Submitted by:
f9fc827448679cf1d41e56512c34521bf06ce37a 18-Aug-1994 wollman <wollman@FreeBSD.org> Fix up some sloppy coding practices:

- Delete redundant declarations.
- Add -Wredundant-declarations to Makefile.i386 so they don't come back.
- Delete sloppy COMMON-style declarations of uninitialized data in
header files.
- Add a few prototypes.
- Clean up warnings resulting from the above.

NB: ioconf.c will still generate a redundant-declaration warning, which
is unavoidable unless somebody volunteers to make `config' smarter.
e16baf7a5fe7ac1453381d0017ed1dcdeefbc995 07-Aug-1994 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'SUNRPC'.
8d205697aac53476badf354623abd4e1c7bc5aff 02-Aug-1994 dg <dg@FreeBSD.org> Added $Id$
8fb65ce818b3e3c6f165b583b910af24000768a5 24-May-1994 rgrimes <rgrimes@FreeBSD.org> BSD 4.4 Lite Kernel Sources