History log of /freebsd-head/sys/netinet/in_pcb.h
Revision Date Author Comments
738968b552d0ef05231688f9b4eb34746c0102d7 18-May-2020 karels <karels@FreeBSD.org> Allow TCP to reuse local port with different destinations

Previously, tcp_connect() would bind a local port before connecting,
forcing the local port to be unique across all outgoing TCP connections
for the address family. Instead, choose a local port after selecting
the destination and the local address, requiring only that the tuple
is unique and does not match a wildcard binding.

Reviewed by: tuexen (rscheff, rrs previous version)
MFC after: 1 month
Sponsored by: Forcepoint LLC
Differential Revision: https://reviews.freebsd.org/D24781
9dd4070571f77676d0dcd3d2aa7acdce654d68f4 12-Feb-2020 rrs <rrs@FreeBSD.org> White space cleanup -- remove trailing tab's or spaces
from any line.

Sponsored by: Netflix Inc.
9aa583df2066e84e77ab179af2d3ee900fd5ba63 15-Jan-2020 glebius <glebius@FreeBSD.org> Stop header pollution and don't include if_var.h via in_pcb.h.
58c0da1e84e2c63365cada56594eae15e1cf77b1 12-Jan-2020 tuexen <tuexen@FreeBSD.org> Fix race when accepting TCP connections.

When expanding a SYN-cache entry to a socket/inp a two step approach was
1) The local address was filled in, then the inp was added to the hash
2) The remote address was filled in and the inp was relocated in the
hash table.
Before the epoch changes, a write lock was held when this happens and
the code looking up entries was holding a corresponding read lock.
Since the read lock is gone away after the introduction of the
epochs, the half populated inp was found during lookup.
This resulted in processing TCP segments in the context of the wrong
TCP connection.
This patch changes the above procedure in a way that the inp is fully
populated before inserted into the hash table.

Thanks to Paul <devgs@ukr.net> for reporting the issue on the net@
mailing list and for testing the patch!

Reviewed by: rrs@
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D22971
92dda61f135296d64199e5d1ce01bd22ee1e7ef4 07-Nov-2019 glebius <glebius@FreeBSD.org> Remove now unused INP_INFO_RLOCK macros.
e3012914f17747826a890e11e4efaac8340ec908 07-Nov-2019 glebius <glebius@FreeBSD.org> Remove now unused INP_HASH_RLOCK() macros.
deb3c19c5505f5f062687d328dd3ffc7736d0a2e 07-Nov-2019 glebius <glebius@FreeBSD.org> Add INP_UNLOCK() which will do whatever R/W unlock is required.
f66d5bcdd266eb22421e6f81a8f4530d1627b4a8 02-Aug-2019 bz <bz@FreeBSD.org> IPv6 cleanup: kernel

Finish what was started a few years ago and harmonize IPv6 and IPv4
kernel names. We are down to very few places now that it is feasible
to do the change for everything remaining with causing too much disturbance.

Remove "aliases" for IPv6 names which confusingly could indicate
that we are talking about a different data structure or field or
have two fields, one for each address family.
Try to follow common conventions used in FreeBSD.

* Rename sin6p to sin6 as that is how it is spelt in most places.
* Remove "aliases" (#defines) for:
- in6pcb which really is an inpcb and nothing separate
- sotoin6pcb which is sotoinpcb (as per above)
- in6p_sp which is inp_sp
- in6p_flowinfo which is inp_flow
* Try to use ia6 for in6_addr rather than in6p.
* With all these gone also rename the in6p variables to inp as
that is what we call it in most of the network stack including
parts of netinet6.

The reasons behind this cleanup are that we try to further
unify netinet and netinet6 code where possible and that people
will less ignore one or the other protocol family when doing
code changes as they may not have spotted places due to different
names for the same thing.

No functional changes.

Discussed with: tuexen (SCTP changes)
MFC after: 3 months
Sponsored by: Netflix
8a34b17735d7079d0019f78ac9030811e8670d30 01-Aug-2019 rrs <rrs@FreeBSD.org> This adds the third step in getting BBR into the tree. BBR and
an updated rack depend on having access to the new
ratelimit api in this commit.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D20953
b80b5fa3897e6c0d1117d8a0dca548d130a3ed84 10-Jul-2019 rrs <rrs@FreeBSD.org> This commit updates rack to what is basically being used at NF as
well as sets in some of the groundwork for committing BBR. The
hpts system is updated as well as some other needed utilities
for the entrance of BBR. This is actually part 1 of 3 more
needed commits which will finally complete with BBRv1 being
added as a new tcp stack.

Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D20834
63aec3850f781558523c875c46cc39bd917ee4fa 25-Apr-2019 gallatin <gallatin@FreeBSD.org> Track TCP connection's NUMA domain in the inpcb

Drivers can now pass up numa domain information via the
mbuf numa domain field. This information is then used
by TCP syncache_socket() to associate that information
with the inpcb. The domain information is then fed back
into transmitted mbufs in ip{6}_output(). This mechanism
is nearly identical to what is done to track RSS hash values
in the inp_flowid.

Follow on changes will use this information for lacp egress
port selection, binding TCP pacers to the appropriate NUMA
domain, etc.

Reviewed by: markj, kib, slavash, bz, scottl, jtl, tuexen
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20028
6d8cc191f953b3680c5e5911afc66b7c1f8e6c4b 09-Jan-2019 glebius <glebius@FreeBSD.org> Mechanical cleanup of epoch(9) usage in network stack.

- Remove macros that covertly create epoch_tracker on thread stack. Such
macros a quite unsafe, e.g. will produce a buggy code if same macro is
used in embedded scopes. Explicitly declare epoch_tracker always.

- Unmask interface list IFNET_RLOCK_NOSLEEP(), interface address list
IF_ADDR_RLOCK() and interface AF specific data IF_AFDATA_RLOCK() read
locking macros to what they actually are - the net_epoch.
Keeping them as is is very misleading. They all are named FOO_RLOCK(),
while they no longer have lock semantics. Now they allow recursion and
what's more important they now no longer guarantee protection against
their companion WLOCK macros.
Note: INP_HASH_RLOCK() has same problems, but not touched by this commit.

This is non functional mechanical change. The only functionally changed
functions are ni6_addrs() and ni6_store_addrs(), where we no longer enter
epoch recursively.

Discussed with: jtl, gallatin
a74ba1d4315b29b32f134692295d2bbc8d345e55 05-Dec-2018 markj <markj@FreeBSD.org> Clamp the INPCB port hash tables to IPPORT_MAX + 1 chains.

Memory beyond that limit was previously unused, wasting roughly 1MB per
8GB of RAM. Also retire INP_PCBLBGROUP_PORTHASH, which was identical to

Reviewed by: glebius
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D17803
001b7b7b0f48cab0f7d8a77d34a4efb6b9dd2fa1 01-Oct-2018 ae <ae@FreeBSD.org> Add INP_INFO_WUNLOCK_ASSERT() macro and use it instead of
INP_INFO_UNLOCK_ASSERT() in TCP-related code. For encapsulated traffic
it is possible, that the code is running in net_epoch_preempt section,
and INP_INFO_UNLOCK_ASSERT() is very strict assertion for such case.

PR: 231428
Reviewed by: mmacy, tuexen
Approved by: re (kib)
Differential Revision: https://reviews.freebsd.org/D17335
6131d60f6a38d1bb8b73c3104e6d318e0393cab6 10-Sep-2018 markj <markj@FreeBSD.org> Fix synchronization of LB group access.

Lookups are protected by an epoch section, so the LB group linkage must
be a CK_LIST rather than a plain LIST. Furthermore, we were not
deferring LB group frees, so in_pcbremlbgrouphash() could race with
readers and cause a use-after-free.

Reviewed by: sbruno, Johannes Lundberg <johalun0@gmail.com>
Tested by: gallatin
Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D17031
f7f4373428c89a2fbea05d617d3960b0f3960bb6 06-Sep-2018 bz <bz@FreeBSD.org> The inp_lle field to struct inpcb, along with two "valid" flags
for the rt and lle cache were added in r191129 (2009).
To my best knowledge they have never been used and route caching
has converted the inp_rt field from that commit to inp_route
rendering this field and these flags obsolete.

Convert the pointer into a spare pointer to not change the size of
the structure anymore (and to have a spare pointer) and mark the
two fields as unused.

Reviewed by: markj, karels
Approved by: re (gjb)
Differential Revision: https://reviews.freebsd.org/D17062
3055b3b3269e2ef2e9d85dbd2cc6f6ea929e9109 21-Aug-2018 tuexen <tuexen@FreeBSD.org> Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP
socket resulted in sending fragmented IPV6 packets.

This is fixes by reducing the MSS to the appropriate value. In addtion,
if the socket option is set before the handshake happens, announce this
MSS to the peer. This is not stricly required, but done since TCP
is conservative.

PR: 173444
Reviewed by: bz@, rrs@
MFC after: 1 month
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D16796
cc0e36203dafc6d9f1595573d5f4cb658bcff27f 20-Aug-2018 bz <bz@FreeBSD.org> GC inc_isipv6; it was added for "temp" compatibility in 2001, r86764
and does not seem to be used.
8a6f698b859241cbaeeb02fcf98ea7e43cea82f6 04-Aug-2018 glebius <glebius@FreeBSD.org> Now that after r335979 the kernel addresses in API structures are
fixed size, there is no reason left for the unions.

Discussed with: brooks
6615ed4c6149c9810ba766e318f591d44a0596df 05-Jul-2018 brooks <brooks@FreeBSD.org> Make struct xinpcb and friends word-size independent.

Replace size_t members with ksize_t (uint64_t) and pointer members
(never used as pointers in userspace, but instead as unique
idenitifiers) with kvaddr_t (uint64_t). This makes the structs
identical between 32-bit and 64-bit ABIs.

On 64-bit bit systems, the ABI is maintained. On 32-bit systems,
this is an ABI breaking change. The ABI of most of these structs
was previously broken in r315662. This also imposes a small API
change on userspace consumers who must handle kernel pointers
becoming virtual addresses.

PR: 228301 (exp-run by antoine)
Reviewed by: jtl, kib, rwatson (various versions)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15386
14de8a2820efdf121114eefd291e6427fa353690 04-Jul-2018 mmacy <mmacy@FreeBSD.org> epoch(9): allow preemptible epochs to compose

- Add tracker argument to preemptible epochs
- Inline epoch read path in kernel and tied modules
- Change in_epoch to take an epoch as argument
- Simplify tfb_tcp_do_segment to not take a ti_locked argument,
there's no longer any benefit to dropping the pcbinfo lock
and trying to do so just adds an error prone branchfest to
these functions
- Remove cases of same function recursion on the epoch as
recursing is no longer free.
- Remove the the TAILQ_ENTRY and epoch_section from struct
thread as the tracker field is now stack or heap allocated
as appropriate.

Tested by: pho and Limelight Networks
Reviewed by: kbowling at llnw dot com
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D16066
79793784f78622f0d53b194ead20a250ba529bcb 19-Jun-2018 mmacy <mmacy@FreeBSD.org> convert inpcbinfo hash and info rwlocks to epoch + mutex

- Convert inpcbinfo info & hash locks to epoch for read and mutex for write
- Garbage collect code that handled INP_INFO_TRY_RLOCK failures as
INP_INFO_RLOCK which can no longer fail

When running 64 netperfs sending minimal sized packets on a 2x8x2 reduces
unhalted core cycles samples in rwlock rlock/runlock in udp_send from 51% to

Overall packet throughput rate limited by CPU affinity and NIC driver design

On the receiver unhalted core cycles samples in in_pcblookup_hash went from
13% to to 1.6%

Tested by LLNW and pho@

Reviewed by: jtl
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15686
255aa2f16f5f00dce267e48d5d74f843554f3c3c 13-Jun-2018 mmacy <mmacy@FreeBSD.org> Fix PCBGROUPS build post CK conversion of pcbinfo
6e4e86f96e714c757adf0947d64d400ab1450670 12-Jun-2018 mmacy <mmacy@FreeBSD.org> Defer inpcbport free until after a grace period has elapsed

This is a dependency for inpcbinfo rlock conversion to epoch
1cbc14be824365378877a193776933041e352feb 12-Jun-2018 mmacy <mmacy@FreeBSD.org> mechanical CK macro conversion of inpcbinfo lists

This is a dependency for converting the inpcbinfo hash and info rlocks
to epoch.
f2fc01c6c797b053d42a3873c287c76b3b233cbe 12-Jun-2018 mmacy <mmacy@FreeBSD.org> Defer inpcb deletion until after a grace period has elapsed

Deferring the actual free of the inpcb until after a grace
period has elapsed will allow us to convert the inpcbinfo
info and hash read locks to epoch.

Reviewed by: gallatin, jtl
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D15510
d0aeaa5af7f77964d05bcb54e2f5cfdeb187edaf 06-Jun-2018 sbruno <sbruno@FreeBSD.org> Load balance sockets with new SO_REUSEPORT_LB option.

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures:
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or
threads sharing the same socket).

This is a substantially different contribution as compared to its original
incarnation at svn r332894 and reverted at svn r332967. Thanks to rwatson@
for the substantive feedback that is included in this commit.

Submitted by: Johannes Lundberg <johalun0@gmail.com>
Obtained from: DragonflyBSD
Relnotes: Yes
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003
482d0f575f76dd009c45a07a7fbdd34af77178a0 21-May-2018 mmacy <mmacy@FreeBSD.org> inpcb: revert deferred inpcb free pending further review
7110d4d79667d03917e95c1908a87432606b615b 20-May-2018 mmacy <mmacy@FreeBSD.org> inpcb: defer destruction of inpcb until after a grace period has elapsed

in_pcbfree will remove the incpb from the list and release the rtentry
while the vnet is set, but the actual destruction will be deferred
until any threads in a (not yet used) epoch section, no longer potentially
have references.
99ec59840bac8a84c993ebd2a0eb62a8bb369e87 20-May-2018 mmacy <mmacy@FreeBSD.org> in_pcb: add helper for deferring inpcb rele calls from list functions
257e6e5563df4d5f0a939c72acdef3c49c440e38 24-Apr-2018 sbruno <sbruno@FreeBSD.org> Revert r332894 at the request of the submitter.

Submitted by: Johannes Lundberg <johalun0_gmail.com>
Sponsored by: Limelight Networks
bbf7d4dd035a71710ac94fe1ada4d99244102159 23-Apr-2018 sbruno <sbruno@FreeBSD.org> Load balance sockets with new SO_REUSEPORT_LB option

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple
programs or threads to bind to the same port and incoming connections will be
load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not
be set per socket. This patch allows for simultaneous use of both the current
SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

As DragonflyBSD, a load balance group is limited to 256 pcbs
(256 programs or threads sharing the same socket).

Submitted by: Johannes Lundberg <johanlun0@gmail.com>
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D11003
863f90dbfae4122e483fff0604d5919b560d4fe0 19-Apr-2018 rrs <rrs@FreeBSD.org> This commit brings in the TCP high precision timer system (tcp_hpts).
It is the forerunner/foundational work of bringing in both Rack and BBR
which use hpts for pacing out packets. The feature is optional and requires
the TCPHPTS option to be enabled before the feature will be active. TCP
modules that use it must assure that the base component is compile in
the kernel in which they are loaded.

MFC after: Never
Sponsored by: Netflix Inc.
Differential Revision: https://reviews.freebsd.org/D15020
4736ccfd9c3411d50371d7f21f9450a47c19047e 20-Nov-2017 pfg <pfg@FreeBSD.org> sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
7c2ab1d9f6ff94b01027808fde17933a54ea8c82 06-Sep-2017 hselasky <hselasky@FreeBSD.org> Add support for generic backpressure indicator for ratelimited
transmit queues aswell as non-ratelimited ones.

Add the required structure bits in order to support a backpressure
indication with ratelimited connections aswell as non-ratelimited
ones. The backpressure indicator is a value between zero and 65535
inclusivly, indicating if the destination transmit queue is empty or
full respectivly. Applications can use this value as a decision point
for when to stop transmitting data to avoid endless ENOBUFS error
codes upon transmitting an mbuf. This indicator is also useful to
reduce the latency for ratelimited queues.

Reviewed by: gallatin, kib, gnn
Differential Revision: https://reviews.freebsd.org/D11518
Sponsored by: Mellanox Technologies
7845c5b75cd9864c329b8ecf859f9c3479f12df2 24-May-2017 glebius <glebius@FreeBSD.org> o Rearrange struct inpcb fields to optimize the TCP output code path
considering cache line hits and misses. Put the lock and hash list
glue into the first cache line, put inp_refcount inp_flags inp_socket
into the second cache line.
o On allocation zero out entire structure except the lock and list entries,
including inp_route inp_lle inp_gencnt. When inp_route and inp_lle were
introduced, they were added below inp_zero_size, resulting on not being
cleared after free/alloc. This definitely was a source of bugs with route
caching. Could be that r315956 has just fixed one of them.
The inp_gencnt is reinitialized on every alloc, so it is safe to clear it.

This has been proved to improve TCP performance at Netflix.

Obtained from: rrs
Differential Revision: D10686
19fea9c3e06195766dd4017d5bd5c644486a5442 15-May-2017 glebius <glebius@FreeBSD.org> Reduce in_pcbinfo_init() by two params. No users supply any flags to this
function (they used to say UMA_ZONE_NOFREE), so flag parameter goes away.
The zone_fini parameter also goes away. Previously no protocols (except
divert) supplied zone_fini function, so inpcb locks were leaked with slabs.
This was okay while zones were allocated with UMA_ZONE_NOFREE flag, but now
this is a leak. Fix that by suppling inpcb_fini() function as fini method
for all inpcb zones.
a58d0019c110a1656b15c71152374f0decd4a8de 21-Mar-2017 glebius <glebius@FreeBSD.org> Force same alignment on struct xinpgen as we have on struct xinpcb. This
fixes 32-bit builds.
3a5c9aaf2b2ea107bcaf0ba28b706238d92bdbbd 21-Mar-2017 glebius <glebius@FreeBSD.org> Hide struct inpcb, struct tcpcb from the userland.

This is a painful change, but it is needed. On the one hand, we avoid
modifying them, and this slows down some ideas, on the other hand we still
eventually modify them and tools like netstat(1) never work on next version of
FreeBSD. We maintain a ton of spares in them, and we already got some ifdef
hell at the end of tcpcb.

- Hide struct inpcb, struct tcpcb under _KERNEL || _WANT_FOO.
- Make struct xinpcb, struct xtcpcb pure API structures, not including
kernel structures inpcb and tcpcb inside. Export into these structures
the fields from inpcb and tcpcb that are known to be used, and put there
a ton of spare space.
- Make kernel and userland utilities compilable after these changes.
- Bump __FreeBSD_version.

Reviewed by: rrs, gnn
Differential Revision: D10018
65907cccb64065edee63501fdacb909884400fe1 09-Mar-2017 glebius <glebius@FreeBSD.org> Make inp_lock_assert() depend on INVARIANT_SUPPORT, not INVARIANTS.
This will make INVARIANT-enabled modules, that use this function to load
successfully on a kernel that has INVARIANT_SUPPORT only.
f2a480c25cb3d60d5b3221a43f3c0c5b763099eb 06-Mar-2017 eri <eri@FreeBSD.org> The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Reviewed by: adrian, aw
Approved by: ae (mentor)
Sponsored by: rsync.net
Differential Revision: D9235
7e6cabd06e6caa6a02eeb86308dc0cb3f27e10da 28-Feb-2017 imp <imp@FreeBSD.org> Renumber copyright clause 4

Renumber cluase 4 to 3, per what everybody else did when BSD granted
them permission to remove clause 3. My insistance on keeping the same
numbering for legal reasons is too pedantic, so give up on that point.

Submitted by: Jan Schaumann <jschauma@stevens.edu>
Pull Request: https://github.com/freebsd/freebsd/pull/96
bb348ba959fe20fe2e80e46b6549e83bb2864d80 12-Feb-2017 eri <eri@FreeBSD.org> Committed without approval from mentor.

Reported by: gnn
6898c4334b816da44ea78b83e61db9b78b581c67 10-Feb-2017 eri <eri@FreeBSD.org> Revert r313527

Heh svn is not git
b429db62bced19ed2003c67ca849f9e35f9ca234 10-Feb-2017 eri <eri@FreeBSD.org> Correct missed variable name.

Reported-by: ohartmann@walstatt.org
ed45b3149493fb1e83faa4fc28cc0acf91aae040 10-Feb-2017 eri <eri@FreeBSD.org> The patch provides the same socket option as Linux IP_ORIGDSTADDR.
Unfortunately they will have different integer value due to Linux value being already assigned in FreeBSD.

The patch is similar to IP_RECVDSTADDR but also provides the destination port value to the application.

This allows/improves implementation of transparent proxies on UDP sockets due to having the whole information on forwarded packets.

Sponsored-by: rsync.net
Differential Revision: D9235
Reviewed-by: adrian
efa6326974ec2cdb6721fec731bcd86758d0877c 18-Jan-2017 hselasky <hselasky@FreeBSD.org> Implement kernel support for hardware rate limited sockets.

- Add RATELIMIT kernel configuration keyword which must be set to
enable the new functionality.

- Add support for hardware driven, Receive Side Scaling, RSS aware, rate
limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

- Add rate limit function callback API to "struct ifnet" which supports
the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

- Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT
flag, which tells if a network driver supports rate limiting or not.

- This patch also adds support for rate limiting through VLAN and LAGG
intermediate network devices.

- How rate limiting works:

1) The userspace application calls setsockopt() after accepting or
making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

2) When the network driver sees the "m->m_pkthdr.snd_tag" different
from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

3) Route changes are detected by the NIC drivers in the ifp->if_transmit()
routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

4) When the PCB is detached the custom sendqueue will be released by a
non-blocking ifp->if_snd_tag_free() call to the currently bound network

Reviewed by: wblock (manpages), adrian, gallatin, scottl (network)
Differential Revision: https://reviews.freebsd.org/D3687
Sponsored by: Mellanox Technologies
MFC after: 3 months
af533198e352d1ee97e831497a7a004f6f1f2740 23-Jun-2016 np <np@FreeBSD.org> Add spares to struct ifnet and socket for packet pacing and/or general
use. Update comments regarding the spare fields in struct inpcb.

Bump __FreeBSD_version for the changes to the size of the structures.

Reviewed by: gnn@
Approved by: re@ (gjb@)
Sponsored by: Chelsio Communications
00d578928eca75be320b36d37543a7e2a4f9fbdb 27-May-2016 grehan <grehan@FreeBSD.org> Create branch for bhyve graphics import.
d9c9113377a2f19d01848ae8dcc470e9306ce932 03-May-2016 pfg <pfg@FreeBSD.org> sys/net*: minor spelling fixes.

No functional change.
c3d5404bbe9b51c8373832a220b2568fc8b806fe 24-Mar-2016 gnn <gnn@FreeBSD.org> FreeBSD previously provided route caching for TCP (and UDP). Re-add
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.

Submitted by: Mike Karels
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D4306
790dc6f94a18458e1890c6c82b22e51da4285e04 05-Sep-2015 glebius <glebius@FreeBSD.org> Use Jenkins hash for TCP syncache.

o Unlike xor, in Jenkins hash every bit of input affects virtually
every bit of output, thus salting the hash actually works. With
xor salting only provides a false sense of security, since if
hash(x) collides with hash(y), then of course, hash(x) ^ salt
would also collide with hash(y) ^ salt. [1]
o Jenkins provides much better distribution than xor, very close to

TCP connection setup/teardown benchmark has shown a 10% increase
with default hash size, and with bigger hashes that still provide
possibility for collisions. With enormous hash size, when dataset is
by an order of magnitude smaller than hash size, the benchmark has
shown 4% decrease in performance decrease, which is expected and

Noticed by: Jeffrey Knockel <jeffk cs.unm.edu> [1]
Benchmarks by: jch
Reviewed by: jch, pkelsey, delphij
Security: strengthens protection against hash collision DoS
Sponsored by: Nginx, Inc.
349429fe8270f2579de42235749bb5ed010ee83a 08-Aug-2015 jch <jch@FreeBSD.org> Fix a kernel assertion issue introduced with r286227:
Avoid too strict INP_INFO_RLOCK_ASSERT checks due to
tcp_notify() being called from in6_pcbnotify().

Reported by: Larry Rosenman <ler@lerctr.org>
Submitted by: markj, jch
67927a7a7c96545feb52784dea33376dcf127e76 03-Aug-2015 jch <jch@FreeBSD.org> Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability:

- The existing TCP INP_INFO lock continues to protect the global inpcb list
stability during full list traversal (e.g. tcp_pcblist()).

- A new INP_LIST lock protects inpcb list actual modifications (inp allocation
and free) and inpcb global counters.

It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input())
and INP_INFO_WLOCK only in occasional operations that walk all connections.

PR: 183659
Differential Revision: https://reviews.freebsd.org/D2599
Reviewed by: jhb, adrian
Tested by: adrian, nitroboost-gmail.com
Sponsored by: Verisign, Inc.
b09afc6f3f088fa610e8e85066b0efc23f29fee1 24-Apr-2015 hiren <hiren@FreeBSD.org> MFC r275358 r275483 r276982 - Removing M_FLOWID by hps@

Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Remove M_FLOWID from SCTP code.

Remove no longer used "M_FLOWID" flag from mbuf.h and update the netisr

Note: The FreeBSD version has been bumped.

Reviewed by: hps, tuexen
Sponsored by: Limelight Networks
39fb34452ff3bc4927f319c53cfad417efb4e883 06-Apr-2015 hiren <hiren@FreeBSD.org> MFC r266418, r266448

Add the flowtype to the inpcb.
Add -R to netstat to dump RSS/flow information.

Reviewed by: delphij
Relnotes: yes (for r266448)
Sponsored by: Limelight Networks
78c1f8fbf46602fa3a316af9a8cb3221b76a24b2 01-Dec-2014 dim <dim@FreeBSD.org> Merge ^/head r275262 through r275363.
12fec3618b88732ec95820e4717a392d431ddb61 01-Dec-2014 hselasky <hselasky@FreeBSD.org> Start process of removing the use of the deprecated "M_FLOWID" flag
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.

This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.

"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.

Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.

MFC after: 1 month
Sponsored by: Mellanox Technologies
b5d711d3a6940afdd3615f7ffc2dcfa3faacd446 09-Nov-2014 melifaro <melifaro@FreeBSD.org> Renove faith(4) and faithd(8) from base. It looks like industry
have chosen different (and more traditional) stateless/statuful
NAT64 as translation mechanism. Last non-trivial commits to both
faith(4) and faithd(8) happened more than 12 years ago, so I assume
it is time to drop RFC3142 in FreeBSD.

No objections from: net@
1576b695b6af7e812a76f08eaf9ff60baabeeafd 10-Sep-2014 ae <ae@FreeBSD.org> Add scope zone id to the in_endpoints and hc_metrics structures.

A non-global IPv6 address can be used in more than one zone of the same
scope. This zone index is used to identify to which zone a non-global
address belongs.

Also we can have many foreign hosts with equal non-global addresses,
but from different zones. So, they can have different metrics in the
host cache.

Obtained from: Yandex LLC
Sponsored by: Yandex LLC
82d0b71937338226c447858359bb3ba8b80dc66d 10-Sep-2014 ae <ae@FreeBSD.org> Introduce INP6_PCBHASHKEY macro. Replace usage of hardcoded part of
IPv6 address as hash key in all places.

Obtained from: Yandex LLC
e623d51cd5c4ea0255d03a6a082071e1ae700947 09-Sep-2014 adrian <adrian@FreeBSD.org> Add support for receiving and setting flowtype, flowid and RSS bucket
information as part of recvmsg().

This is primarily used for debugging/verification of the various
processing paths in the IP, PCB and driver layers.

Unfortunately the current implementation of the control message path
results in a ~10% or so drop in UDP frame throughput when it's used.

Differential Revision: https://reviews.freebsd.org/D527
Reviewed by: grehan
01670bf02723bd0e0904c85b3d43d99f44f197b9 12-Jul-2014 adrian <adrian@FreeBSD.org> Expose in_pcbbind_check_bindmulti() so the upcoming IPv6 RSS changes
can be made to use it.
627c6869c375d438267904bd1157d3129d6811e0 10-Jul-2014 adrian <adrian@FreeBSD.org> Implement the first stage of multi-bind listen sockets and RSS socket

* Introduce IP_BINDMULTI - indicating that it's okay to bind multiple
sockets on the same bind details.

Although the PCB code has been taught about this (see below) this patch
doesn't introduce the rest of the PCB changes necessary to distribute
lookups among multiple PCB entries in the global wildcard table.

* Introduce IP_RSS_LISTEN_BUCKET - placing an listen socket into the
given RSS bucket (and thus a single PCBGROUP hash.)

* Modify the PCB add path to be aware of IP_BINDMULTI:
+ Only allow further PCB entries to be added if the owner credentials
and IP_BINDMULTI has been specified. Ie, only allow further
IP_BINDMULTI sockets to appear if the first bind() was IP_BINDMULTI.

* Teach the PCBGROUP code about IP_RSS_LISTE_BUCKET marked PCB entries.
Instead of using the wildcard logic and hashing, these sockets are
simply placed into the PCBGROUP and _not_ in the wildcard hash.

* When doing a PCBGROUP lookup, also do a wildcard match as well.
This allows for an RSS bucket PCB entry to appear in a PCBGROUP
rather than having to exist in the wildcard list.


* TCP IPv4 server testing with igb(4)
* TCP IPv4 server testing with ix(4)


* The pcbgroup lookup code duplicated the wildcard and wildcard-PCB
logic. This could be refactored into a single function.

* This doesn't yet work for IPv6 (The PCBGROUP code in netinet6/ doesn't
yet know about this); nor does it yet fully work for UDP.
873b20c0ff82983e12fd6594ed634830f455f3ab 26-May-2014 smh <smh@FreeBSD.org> MFC r264879

Fix jailed raw sockets not setting the correct source address by
calling in_pcbladdr instead of prison_get_ip4.

Sponsored by: Multiplay
f91e4baca7e6ca89e8d188094776a02786928a35 18-May-2014 adrian <adrian@FreeBSD.org> Add the flowtype to the inpcb.

The flowid isn't enough to use as part of any RSS related CPU affinity
lookups - the RSS code would like to know what kind of hash it is.
977eea5c3ad3742e57b52c0c3ebd7d6af0d456d4 24-Apr-2014 smh <smh@FreeBSD.org> Fix jailed raw sockets not setting the correct source address by
calling in_pcbladdr instead of prison_get_ip4

MFC after: 1 month
eb1a5f8de9f7ea602c373a710f531abbf81141c4 21-Feb-2014 gjb <gjb@FreeBSD.org> Move ^/user/gjb/hacking/release-embedded up one directory, and remove
^/user/gjb/hacking since this is likely to be merged to head/ soon.

Sponsored by: The FreeBSD Foundation
6b01bbf146ab195243a8e7d43bb11f8835c76af8 27-Dec-2013 gjb <gjb@FreeBSD.org> Copy head@r259933 -> user/gjb/hacking/release-embedded for initial
inclusion of (at least) arm builds with the release.

Sponsored by: The FreeBSD Foundation
9b554dcd02e15b115e6aa09edcc4ed77c2019510 04-Jul-2013 trociny <trociny@FreeBSD.org> In r227207, to fix the issue with possible NULL inp_socket pointer
dereferencing, when checking for SO_REUSEPORT option (and SO_REUSEADDR
for multicast), INP_REUSEPORT flag was introduced to cache the socket
option. It was decided then that one flag would be enough to cache
setsockopt(2), it was checked if it was called for a multicast address
and INP_REUSEPORT was set accordingly.

Unfortunately that approach does not work when setsockopt(2) is called
before binding to a multicast address: the multicast check fails and
INP_REUSEPORT is not set.

Fix this by adding INP_REUSEADDR flag to unconditionally cache

PR: 179901
Submitted by: Michael Gmelin freebsd grem.de (initial version)
Reviewed by: rwatson
MFC after: 1 week
cc8c6e4d0185c640c9d03ed2804e3020ff84fed0 06-May-2013 andre <andre@FreeBSD.org> Back out r249318, r249320 and r249327 due to a heisenbug most
likely related to a race condition in the ipi_hash_lock with
the exact cause currently unknown but under investigation.
306fddaf7801d7fae1206025486e9d9a97f52ad4 09-Apr-2013 andre <andre@FreeBSD.org> Change certain heavily used network related mutexes and rwlocks to
reside on their own cache line to prevent false sharing with other
nearby structures, especially for those in the .bss segment.

NB: Those mutexes and rwlocks with variables next to them that get
changed on every invocation do not benefit from their own cache line.
Actually it may be net negative because two cache misses would be
incurred in those cases.
a73c365c3b7a51e662010c36af75a21348fd4284 02-Oct-2012 glebius <glebius@FreeBSD.org> There is a complex race in in_pcblookup_hash() and in_pcblookup_group().
Both functions need to obtain lock on the found PCB, and they can't do
classic inter-lock with the PCB hash lock, due to lock order reversal.
To keep the PCB stable, these functions put a reference on it and after PCB
lock is acquired drop it. If the reference was the last one, this means
we've raced with in_pcbfree() and the PCB is no longer valid.

This approach works okay only if we are acquiring writer-lock on the PCB.
In case of reader-lock, the following scenario can happen:

- 2 threads locate pcb, and do in_pcbref() on it.
- These 2 threads drop the inp hash lock.
- Another thread comes to delete pcb via in_pcbfree(), it obtains hash lock,
does in_pcbremlists(), drops hash lock, and runs in_pcbrele_wlocked(), which
doesn't free the pcb due to two references on it. Then it unlocks the pcb.
- 2 aforementioned threads acquire reader lock on the pcb and run
in_pcbrele_rlocked(). One gets 1 from in_pcbrele_rlocked() and continues,
second gets 0 and considers pcb freed, returns.
- The thread that got 1 continutes working with detached pcb, which later
leads to panic in the underlying protocol level.

To plumb that problem an additional INPCB flag introduced - INP_FREED. We
check for that flag in the in_pcbrele_rlocked() and if it is set, we pretend
that that was the last reference.

Discussed with: rwatson, jhb
Reported by: Vladimir Medvedkin <medved rambler-co.ru>
32041f44edbadf78cfaf57b4d6a30f5c41b4732d 12-Jun-2012 tuexen <tuexen@FreeBSD.org> Add a IP_RECVTOS socket option to receive for received UDP/IPv4
packets a cmsg of type IP_RECVTOS which contains the TOS byte.
Much like IP_RECVTTL does for TTL. This allows to implement a
protocol on top of UDP and implementing ECN.

MFC after: 3 days
26b44e3c9dec81c45a5cd5e5386ca5b6c9a06953 17-Mar-2012 rmh <rmh@FreeBSD.org> Hide a few declarations from userland (including `struct inpcbgroup'). This
removes the dependency on <machine/param.h> which was introduced with SVN
rev 222748 (due to CACHE_LINE_SIZE).

Reviewed by: bde
MFC after: 10 days
6e9e854884e3fcfa127843dd7969c7d3015c76be 08-Nov-2011 attilio <attilio@FreeBSD.org> MFC
f9135967f2f39cdfec7694004e7e3edd6d8663d9 06-Nov-2011 trociny <trociny@FreeBSD.org> Cache SO_REUSEPORT socket option in inpcb-layer in order to avoid
inp_socket->so_options dereference when we may not acquire the lock on
the inpcb.

This fixes the crash due to NULL pointer dereference in
in_pcbbind_setup() when inp_socket->so_options in a pcb returned by
in_pcblookup_local() was checked.

Reported by: dave jones <s.dave.jones@gmail.com>, Arnaud Lacombe <lacombar@gmail.com>
Suggested by: rwatson
Glanced by: rwatson
Tested by: dave jones <s.dave.jones@gmail.com>
352be4e985c0df0cf92cf64d89515b0b32bd1bf4 17-Jul-2011 bz <bz@FreeBSD.org> Add spares to the network stack for FreeBSD-9:
- TCP keep* timers
- TCP UTO (adjust from what was there already)
- netmap
- route caching
- user cookie (temporary to allow for the real fix)

Slightly re-shuffle struct ifnet moving fields out of the middle
of spares and to better align.

Discussed with: rwatson (slightly earlier version)
fa5ff41200519da6b30c30e3ba5e8dcdf171923e 06-Jun-2011 attilio <attilio@FreeBSD.org> MFC
45c14b9c114b5d1172b078f708d3bfe1f36c516f 06-Jun-2011 bz <bz@FreeBSD.org> Unbreak kernels with non-default PCBGROUP included but no WITNESS.
Rather than including lock.h in in_pcbgroup.c in right order, fix it
for all consumers of in_pcb.h by further header file pollution under
#ifdef KERNEL.

Reported by: Pan Tsu (inyaoo gmail.com)
6e29aea1dbf128b84b885f9acc6396c69ab080ce 06-Jun-2011 rwatson <rwatson@FreeBSD.org> Implement a CPU-affine TCP and UDP connection lookup data structure,
struct inpcbgroup. pcbgroups, or "connection groups", supplement the
existing inpcbinfo connection hash table, which when pcbgroups are
enabled, might now be thought of more usefully as a per-protocol
4-tuple reservation table.

Connections are assigned to connection groups base on a hash of their
4-tuple; wildcard sockets require special handling, and are members
of all connection groups. During a connection lookup, a
per-connection group lock is employed rather than the global pcbinfo
lock. By aligning connection groups with input path processing,
connection groups take on an effective CPU affinity, especially when
aligned with RSS work placement (see a forthcoming commit for
details). This eliminates cache line migration associated with
global, protocol-layer data structures in steady state TCP and UDP
processing (with the exception of protocol-layer statistics; further
commit to follow).

Elements of this approach were inspired by Willman, Rixner, and Cox's
2006 USENIX paper, "An Evaluation of Network Stack Parallelization
Strategies in Modern Operating Systems". However, there are also
significant differences: we maintain the inpcb lock, rather than using
the connection group lock for per-connection state.

Likewise, the focus of this implementation is alignment with NIC
packet distribution strategies such as RSS, rather than pure software
strategies. Despite that focus, software distribution is supported
through the parallel netisr implementation, and works well in
configurations where the number of hardware threads is greater than
the number of NIC input queues, such as in the RMI XLR threaded MIPS

Another important difference is the continued maintenance of existing
hash tables as "reservation tables" -- these are useful both to
distinguish the resource allocation aspect of protocol name management
and the more common-case lookup aspect. In configurations where
connection tables are aligned with hardware hashes, it is desirable to
use the traditional lookup tables for loopback or encapsulated traffic
rather than take the expense of hardware hashes that are hard to
implement efficiently in software (such as RSS Toeplitz).

Connection group support is enabled by compiling "options PCBGROUP"
into your kernel configuration; for the time being, this is an
experimental feature, and hence is not enabled by default.

Subject to the limited MFCability of change dependencies in inpcb,
and its change to the inpcbinfo init function signature, this change
in principle could be merged to FreeBSD 8.x.

Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
e9eb5d3b9cabfc492871c5e6a6b40f13063f17f9 04-Jun-2011 rwatson <rwatson@FreeBSD.org> Add _mbuf() variants of various inpcb-related interfaces, including lookup,
hash install, etc. For now, these are arguments are unused, but as we add
RSS support, we will want to use hashes extracted from mbufs, rather than
manually calculated hashes of header fields, due to the expensive of the
software version of Toeplitz (and similar hashes).

Add notes that it would be nice to be able to pass mbufs into lookup
routines in pf(4), optimising firewall lookup in the same way, but the
code structure there doesn't facilitate that currently.

(In principle there is no reason this couldn't be MFCed -- the change
extends rather than modifies the KBI. However, it won't be useful without
other previous possibly less MFCable changes.)

Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
fdfdadb612a4b077d62094c7d4aa65de3524cf62 30-May-2011 rwatson <rwatson@FreeBSD.org> Decompose the current single inpcbinfo lock into two locks:

- The existing ipi_lock continues to protect the global inpcb list and
inpcb counter. This lock is now relegated to a small number of
allocation and free operations, and occasional operations that walk
all connections (including, awkwardly, certain UDP multicast receive
operations -- something to revisit).

- A new ipi_hash_lock protects the two inpcbinfo hash tables for
looking up connections and bound sockets, manipulated using new
INP_HASH_*() macros. This lock, combined with inpcb locks, protects
the 4-tuple address space.

Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb
connection locks, so may be acquired while manipulating a connection on
which a lock is already held, avoiding the need to acquire the inpcbinfo
lock preemptively when a binding change might later be required. As a
result, however, lookup operations necessarily go through a reference
acquire while holding the lookup lock, later acquiring an inpcb lock --
if required.

A new function in_pcblookup() looks up connections, and accepts flags
indicating how to return the inpcb. Due to lock order changes, callers
no longer need acquire locks before performing a lookup: the lookup
routine will acquire the ipi_hash_lock as needed. In the future, it will
also be able to use alternative lookup and locking strategies
transparently to callers, such as pcbgroup lookup. New lookup flags are,
supplementing the existing INPLOOKUP_WILDCARD flag:

INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb
INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb

Callers must pass exactly one of these flags (for the time being).

Some notes:

- All protocols are updated to work within the new regime; especially,
TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely
eliminated, and global hash lock hold times are dramatically reduced
compared to previous locking.
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
- Support for reverting to the FreeBSD 7.x locking strategy in TCP input
is no longer available -- hash lookup locks are now held only very
briefly during inpcb lookup, rather than for potentially extended
periods. However, the pcbinfo ipi_lock will still be acquired if a
connection state might change such that a connection is added or
- Raw IP sockets continue to use the pcbinfo ipi_lock for protection,
due to maintaining their own hash tables.
- The interface in6_pcblookup_hash_locked() is maintained, which allows
callers to acquire hash locks and perform one or more lookups atomically
with 4-tuple allocation: this is required only for TCPv6, as there is no
in6_pcbconnect_setup(), which there should be.
- UDPv6 locking remains significantly more conservative than UDPv4
locking, which relates to source address selection. This needs
attention, as it likely significantly reduces parallelism in this code
for multithreaded socket use (such as in BIND).
- In the UDPv4 and UDPv6 multicast cases, we need to revisit locking
somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which
is no longer sufficient. A second check once the inpcb lock is held
should do the trick, keeping the general case from requiring the inpcb
lock for every inpcb visited.
- This work reminds us that we need to revisit locking of the v4/v6 flags,
which may be accessed lock-free both before and after this change.
- Right now, a single lock name is used for the pcbhash lock -- this is
undesirable, and probably another argument is required to take care of
this (or a char array name field in the pcbinfo?).

This is not an MFC candidate for 8.x due to its impact on lookup and
locking semantics. It's possible some of these issues could be worked
around with compatibility wrappers, if necessary.

Reviewed by: bz
Sponsored by: Juniper Networks, Inc.
9879530ca1401176b647cf90fcd0e4ccf4f8869e 23-May-2011 attilio <attilio@FreeBSD.org> MFC
95a805600bb382ba9dc73eed13f32cd8a5de3ae7 23-May-2011 rwatson <rwatson@FreeBSD.org> Continue to refine inpcb reference counting and locking, in preparation for
reworking of inpcbinfo locking:

(1) Convert inpcb reference counting from manually manipulated integers to
the refcount(9) KPI. This allows the refcount to be managed atomically
with an inpcb read lock rather than write lock, or even with no inpcb
lock at all. As a result, in_pcbref() also no longer requires an inpcb
lock, so can be performed solely using the lock used to look up an

(2) Shift more inpcb freeing activity from the in_pcbrele() context (via
in_pcbfree_internal) to the explicit in_pcbfree() context. This means
that the inpcb refcount is increasingly used only to maintain memory
stability, not actually defer the clean up of inpcb protocol parts.
This is desirable as many of those protocol parts required the pcbinfo
lock, which we'd like not to acquire in in_pcbrele() contexts. Document
this in comments better.

(3) Introduce new read-locked and write-locked in_pcbrele() variations,
in_pcbrele_rlocked() and in_pcbrele_wlocked(), which allow the inpcb to
be properly unlocked as needed. in_pcbrele() is a wrapper around the
latter, and should probably go away at some point. This makes it
easier to use this weak reference model when holding only a read lock,
as will happen in the future.

This may well be safe to MFC, but some more KBI analysis is required.

Reviewed by: bz
MFC after: 3 weeks
Sponsored by: Juniper Networks, Inc.
56d91395adb1ef88fb6159f6751200ea542fada9 23-May-2011 rwatson <rwatson@FreeBSD.org> A number of quite incremental refinements to struct inpcbinfo's definition:

(1) Add a locking guide for inpcbinfo.
(2) Annotate inpcbinfo fields with synchronisation information; not all
annotations are 100% satisfactory.
(3) Reorder inpcbinfo fields so that the lock is at the head of the
structure, and close to fields it protects.
(4) Sort fields that will eventually be hashlock/pcbgroup-related together
even though they remain locked by ipi_lock for now.

Reviewed by: bz
Sponsored by: Juniper Networks
X-MFC after: KBI analysis required
25cccaf492f9174346c2e76cb28eefd79c356c61 20-Apr-2011 bz <bz@FreeBSD.org> MFp4 CH=191470:

Move the ipport_tick_callout and related functions from ip_input.c
to in_pcb.c. The random source port allocation code has been merged
and is now local to in_pcb.c only.
Use a SYSINIT to get the callout started and no longer depend on
initialization from the inet code, which would not work in an IPv6
only setup.

Reviewed by: gnn
Sponsored by: The FreeBSD Foundation
Sponsored by: iXsystems
MFC after: 4 days
524448845cfdfa08a56ba5dc26d7dc9ca075ec30 12-Mar-2011 bz <bz@FreeBSD.org> Merge the two identical implementations for local port selections from
in_pcbbind_setup() and in6_pcbsetport() in a single in_pcb_lport().

MFC after: 2 weeks
09f9c897d33c41618ada06fbbcf1a9b3812dee53 19-Oct-2010 jamie <jamie@FreeBSD.org> A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.
813b1149e9cb62b9d9412d1cdd65a49e0748ab4d 01-Jun-2010 rwatson <rwatson@FreeBSD.org> Merge r204806 from head to stable/8:

Wrap use of rw_try_upgrade() on pcbinfo with macro INP_INFO_TRY_UPGRADE()
to match other pcbinfo locking macros.

Approved by: re (bz)
f1216d1f0ade038907195fc114b7e630623b402c 19-Mar-2010 delphij <delphij@FreeBSD.org> Create a custom branch where I will be able to do the merge.
1fdd3bccc0217643bd336f00c0ffe11fcf22f906 14-Mar-2010 rwatson <rwatson@FreeBSD.org> Abstract out initialization of most aspects of struct inpcbinfo from
their calling contexts in {IP divert, raw IP sockets, TCP, UDP} and
create new helper functions: in_pcbinfo_init() and in_pcbinfo_destroy()
to do this work in a central spot. As inpcbinfo becomes more complex
due to ongoing work to add connection groups, this will reduce code

MFC after: 1 month
Reviewed by: bz
Sponsored by: Juniper Networks
72ccf684117a9c31eec566ffb3d7466488a8c2e6 06-Mar-2010 rwatson <rwatson@FreeBSD.org> Wrap use of rw_try_upgrade() on pcbinfo with macro INP_INFO_TRY_UPGRADE()
to match other pcbinfo locking macros.

MFC after: 1 week
d249e9c3220f3b4c814b7bfbdfc4bdff1a2112ee 02-Aug-2009 rwatson <rwatson@FreeBSD.org> Add padding to struct inpcb, missed during our padding sweep earlier in
the release cycle.

Approved by: re (kensmith)
88f8de4d4001c74946458579ca0710df70161c90 16-Jul-2009 rwatson <rwatson@FreeBSD.org> Remove unused VNET_SET() and related macros; only VNET_GET() is
ever actually used. Rename VNET_GET() to VNET() to shorten
variable references.

Discussed with: bz, julian
Reviewed by: bz
Approved by: re (kensmith, kib)
57ca4583e728cab422fba8f15de10bd0b637b3dd 14-Jul-2009 rwatson <rwatson@FreeBSD.org> Build on Jeff Roberson's linker-set based dynamic per-CPU allocator
(DPCPU), as suggested by Peter Wemm, and implement a new per-virtual
network stack memory allocator. Modify vnet to use the allocator
instead of monolithic global container structures (vinet, ...). This
change solves many binary compatibility problems associated with
VIMAGE, and restores ELF symbols for virtualized global variables.

Each virtualized global variable exists as a "reference copy", and also
once per virtual network stack. Virtualized global variables are
tagged at compile-time, placing the in a special linker set, which is
loaded into a contiguous region of kernel memory. Virtualized global
variables in the base kernel are linked as normal, but those in modules
are copied and relocated to a reserved portion of the kernel's vnet
region with the help of a the kernel linker.

Virtualized global variables exist in per-vnet memory set up when the
network stack instance is created, and are initialized statically from
the reference copy. Run-time access occurs via an accessor macro, which
converts from the current vnet and requested symbol to a per-vnet
address. When "options VIMAGE" is not compiled into the kernel, normal
global ELF symbols will be used instead and indirection is avoided.

This change restores static initialization for network stack global
variables, restores support for non-global symbols and types, eliminates
the need for many subsystem constructors, eliminates large per-subsystem
structures that caused many binary compatibility issues both for
monitoring applications (netstat) and kernel modules, removes the
per-function INIT_VNET_*() macros throughout the stack, eliminates the
need for vnet_symmap ksym(2) munging, and eliminates duplicate
definitions of virtualized globals under VIMAGE_GLOBALS.

Bump __FreeBSD_version and update UPDATING.

Portions submitted by: bz
Reviewed by: bz, zec
Discussed with: gnn, jamie, jeff, jhb, julian, sam
Suggested by: peter
Approved by: re (kensmith)
0808d0b1a67c8f05c240f53f38787bd0ab1209dd 23-Jun-2009 bz <bz@FreeBSD.org> After cleaning up rt_tables from vnet.h and cleaning up opt_route.h
a lot of files no longer need route.h either. Garbage collect them.
While here remove now unneeded vnet.h #includes as well.
5243d2d206ac372ee679c11bde715a4a4f2f93fd 01-Jun-2009 pjd <pjd@FreeBSD.org> - Rename IP_NONLOCALOK IP socket option to IP_BINDANY, to be more consistent
with OpenBSD (and BSD/OS originally). We can't easly do it SOL_SOCKET option
as there is no more space for more SOL_SOCKET options, but this option also
fits better as an IP socket option, it seems.
- Implement this functionality also for IPv6 and RAW IP sockets.
- Always compile it in (don't use additional kernel options).
- Remove sysctl to turn this functionality on and off.
- Introduce new privilege - PRIV_NETINET_BINDANY, which allows to use this
functionality (currently only unjail root can use it).

Discussed with: julian, adrian, jhb, rwatson, kmacy
a17331003b62d651244b335330ed0202c58d376e 14-May-2009 rwatson <rwatson@FreeBSD.org> Staticize two functions not used outside of in_pcb.c: in_pcbremlists() and

MFC after: 1 month
39b6dc8ba2de1c81754454858aae4fc4b706bdbf 30-Apr-2009 zec <zec@FreeBSD.org> Permit buiding kernels with options VIMAGE, restricted to only a single
active network stack instance. Turning on options VIMAGE at compile
time yields the following changes relative to default kernel build:

1) V_ accessor macros for virtualized variables resolve to structure
fields via base pointers, instead of being resolved as fields in global
structs or plain global variables. As an example, V_ifnet becomes:

options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet
default build: vnet_net_0._ifnet
options VIMAGE_GLOBALS: ifnet

2) INIT_VNET_* macros will declare and set up base pointers to be used
by V_ accessor macros, instead of resolving to whitespace:

INIT_VNET_NET(ifp->if_vnet); becomes

struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET];

3) Memory for vnet modules registered via vnet_mod_register() is now
allocated at run time in sys/kern/kern_vimage.c, instead of per vnet
module structs being declared as globals. If required, vnet modules
can now request the framework to provide them with allocated bzeroed
memory by filling in the vmi_size field in their vmi_modinfo structures.

4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are
extended to hold a pointer to the parent vnet. options VIMAGE builds
will fill in those fields as required.

5) curvnet is introduced as a new global variable in options VIMAGE
builds, always pointing to the default and only struct vnet.

6) struct sysctl_oid has been extended with additional two fields to
store major and minor virtualization module identifiers, oid_v_subs and
oid_v_mod. SYSCTL_V_* family of macros will fill in those fields
accordingly, and store the offset in the appropriate vnet container
struct in oid_arg1.
In sysctl handlers dealing with virtualized sysctls, the
SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target
variable and make it available in arg1 variable for further processing.

Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have
been deleted.

Reviewed by: bz, rwatson
Approved by: julian (mentor)
84c8e0e7557bd9f9cd5433330c9c0c6562382c39 16-Apr-2009 kmacy <kmacy@FreeBSD.org> s/void/void */
f35899195a21305052f137d1b95e2e56cfc429a4 16-Apr-2009 kmacy <kmacy@FreeBSD.org> restore spare pointers for MFCing
5767ab208711618fa55ff91914391b26324023cf 15-Apr-2009 kmacy <kmacy@FreeBSD.org> - convert pspare pointers in inpcb to an llentry and rtentry cache
- add flags to indicate their validity
a895457456e69a2ae8694cbe5dea3d25586103ee 15-Apr-2009 kmacy <kmacy@FreeBSD.org> - add second flags field to to inpcb
- update comments in vflag
d35e8eca0371060b928d2555f07b152e2f32143b 15-Apr-2009 kmacy <kmacy@FreeBSD.org> provide additional convenience macros for inpcb locking (upgrade, downgrade, exclusive)
fbd36468425151a258c7a90df46c0de1a069a137 10-Apr-2009 kmacy <kmacy@FreeBSD.org> Import "flowid" support for serializing flows across transmit queues

Reviewed by: rwatson and jeli
038bfe209eeb1f951b217069a584edbcc92d0f2c 15-Mar-2009 rwatson <rwatson@FreeBSD.org> Correct a number of evolved problems with inp_vflag and inp_flags:
certain flags that should have been in inp_flags ended up in inp_vflag,
meaning that they were inconsistently locked, and in one case,
interpreted. Move the following flags from inp_vflag to gaps in the
inp_flags space (and clean up the inp_flags constants to make gaps
more obvious to future takers):


Some aspects of this change have no effect on kernel ABI at all, as these
are UDP/TCP/IP-internal uses; however, netstat and sockstat detect
INP_TIMEWAIT when listing TCP sockets, so any MFC will need to take this
into account.

MFC after: 1 week (or after dependencies are MFC'd)
Reviewed by: bz
fae6f1ab823ff29d6165da7eb2dcae9534b1428c 11-Mar-2009 rwatson <rwatson@FreeBSD.org> Add INP_INHASHLIST flag for inpcb->inp_flags to indicate whether
or not the inpcb is currenty on various hash lookup lists, rather
than using (lport != 0) to detect this. This means that the full
4-tuple of a connection can be retained after close, which should
lead to more sensible netstat output in the window between TCP
close and socket close.

MFC after: 2 weeks
f0bf25503d9702e1e95bb364d811c1243d538b61 10-Mar-2009 rwatson <rwatson@FreeBSD.org> Remove unused v6 macro aliases for inpcb fields:


Remove unused v6 macro aliases for inpcb flags:


References to in6p_lport and in6_fport in sockstat are also replaced with
normal inp_lport and inp_fport references.

MFC after: 3 days
Reviewed by: bz
d085935615c53a5ff051ca4ba0309c43ea364749 10-Mar-2009 rwatson <rwatson@FreeBSD.org> Remove now-unused INP_UNMAPPABLEOPTS.

MFC after: 3 days
Discussed with: bz
e2eee65f2168a3fcb7a12e27d463de4003f878c8 09-Jan-2009 adrian <adrian@FreeBSD.org> Implement a new IP option (not compiled/enabled by default) to allow
applications to specify a non-local IP address when bind()'ing a socket
to a local endpoint.

This allows applications to spoof the client IP address of connections
if (obviously!) they somehow are able to receive the traffic normally
destined to said clients.

This patch doesn't include any changes to ipfw or the bridging code to
redirect the client traffic through the PCB checks so TCP gets a shot
at it. The normal behaviour is that packets with a non-local destination
IP address are not handled locally. This can be dealth with some IPFW hackery;
modifications to IPFW to make this less hacky will occur in subsequent

Thanks to Julian Elischer and others at Ironport. This work was approved
and donated before Cisco acquired them.

Obtained from: Julian Elischer and others
MFC after: 2 weeks
d0cece42b9e5bb36b8b2473caed8df147b13c9c5 23-Dec-2008 kmacy <kmacy@FreeBSD.org> IF_RELENG7 185850:186420

merge latest from 7 stable
b1db56aa98b15cfd499cef9616516622f29ad0b4 17-Dec-2008 bz <bz@FreeBSD.org> Another step assimilating IPv[46] PCB code:
normalize IN6P_* compat flags usage to their equialent
INP_* counterpart.

Discussed with: rwatson
Reviewed by: rwatson
MFC after: 4 weeks
ea0d9d2e9af995a203d1871d5aded293a98f5d68 17-Dec-2008 bz <bz@FreeBSD.org> Use inc_flags instead of the inc_isipv6 alias which so far
had been the only flag with random usage patterns.
Switch inc_flags to be used as a real bit field by using
INC_ISIPV6 with bitops to check for the 'isipv6' condition.

While here fix a place or two where in case of v4 inc_flags
were not properly initialized before.[1]

Found by: rwatson during review [1]
Discussed with: rwatson
Reviewed by: rwatson
MFC after: 4 weeks
83a32f8750135cc82be5727b54cb42797923009b 11-Dec-2008 bz <bz@FreeBSD.org> Put a global variables, which were virtualized but formerly
missed under VIMAGE_GLOBAL.

Start putting the extern declarations of the virtualized globals
under VIMAGE_GLOBAL as the globals themsevles are already.
This will help by the time when we are going to remove the globals

While there garbage collect a few dead externs from ip6_var.h.

Sponsored by: The FreeBSD Foundation
6f7ed797d320c33ec8c210b24fb80ef01c5e5f13 10-Dec-2008 kmacy <kmacy@FreeBSD.org> IF_RELENG7 184527:185849
7424137d3ce93d16fd9fc4f65c121198eca76d7e 09-Dec-2008 rwatson <rwatson@FreeBSD.org> Update comment on INP_TIMEWAIT to say what it's about, as we caution
regarding the misplacement of flags in inp_vflag in an earlier comment.

MFC after: pretty soon
6fc2ee1cb035510c1d55b6c664a8e463df5f06f0 09-Dec-2008 rwatson <rwatson@FreeBSD.org> Move macros defining flags and shortcus to nested structure fields in
inpcbinfo below the structure definition in order to make inpcbinfo
fit on a single printed page; related style tweaks.

MFC after: pretty soon
9010318345e58def3a559ec0d3682e75e3f9cc56 08-Dec-2008 rwatson <rwatson@FreeBSD.org> Add a reference count to struct inpcb, which may be explicitly
incremented using in_pcbref(), and decremented using in_pcbfree()
or inpcbrele(). Protocols using only current in_pcballoc() and
in_pcbfree() calls will see the same semantics, but it is now
possible for TCP to call in_pcbref() and in_pcbrele() to prevent
an inpcb from being freed when both tcbinfo and per-inpcb locks
are released. This makes it possible to safely transition from
holding only the inpcb lock to both tcbinfo and inpcb lock
without re-looking up a connection in the input path, timer
path, etc.

Notice that in_pcbrele() does not unlock the connection after
decrementing the refcount, if the connection remains, so that
the caller can continue to use it; in_pcbrele() returns a flag
indicating whether or not the inpcb pointer is still valid, and
in_pcbfee() is now a simple wrapper around in_pcbrele().

MFC after: 1 month
Discussed with: bz, kmacy
Reviewed by: bz, gnn, kmacy
Tested by: kmacy
19b6af98ec71398e77874582eb84ec5310c7156f 22-Nov-2008 dfr <dfr@FreeBSD.org> Clone Kip's Xen on stable/6 tree so that I can work on improving FreeBSD/amd64
performance in Xen's HVM mode.
815d52c5df6a76286604478e5223d2f2c87b2c04 19-Nov-2008 zec <zec@FreeBSD.org> Change the initialization methodology for global variables scheduled
for virtualization.

Instead of initializing the affected global variables at instatiation,
assign initial values to them in initializer functions. As a rule,
initialization at instatiation for such variables should never be
introduced again from now on. Furthermore, enclose all instantiations
of such global variables in #ifdef VIMAGE_GLOBALS blocks.

Essentialy, this change should have zero functional impact. In the next
phase of merging network stack virtualization infrastructure from
p4/vimage branch, the new initialization methology will allow us to
switch between using global variables and their counterparts residing in
virtualization containers with minimum code churn, and in the long run
allow us to intialize multiple instances of such container structures.

Discussed at: devsummit Strassburg
Reviewed by: bz, julian
Approved by: julian (mentor)
Obtained from: //depot/projects/vimage-commit2/...
X-MFC after: never
Sponsored by: NLnet Foundation, The FreeBSD Foundation
27c09cdcec9ef5edfa7f22d40d6925ab4e1b7f75 01-Nov-2008 kmacy <kmacy@FreeBSD.org> IF_RELENG7 183757:184526
0991899a98572c9cfb8466895fc7966cdd80fa66 20-Oct-2008 bz <bz@FreeBSD.org> Bring over the change switching from using sequential to random
ephemeral port allocation as implemented in netinet/in_pcb.c rev. 1.143
(initially from OpenBSD) and follow-up commits during the last four and
a half years including rev. 1.157, 1.162 and 1.199.
This now is relying on the same infrastructure as has been implemented
in in_pcb.c since rev. 1.199.

Reviewed by: silby, rpaulo, mlaier
MFC after: 2 months
cf5320822f93810742e3d4a1ac8202db8482e633 19-Oct-2008 lulf <lulf@FreeBSD.org> - Import the HEAD csup code which is the basis for the cvsmode work.
77f80e067299bffd274515d2db4724fd7d914bec 04-Oct-2008 bz <bz@FreeBSD.org> Cache so_cred as inp_cred in the inpcb.
This means that inp_cred is always there, even after the socket
has gone away. It also means that it is constant for the lifetime
of the inp.
Both facts lead to simpler code and possibly less locking.

Suggested by: rwatson
Reviewed by: rwatson
MFC after: 6 weeks
X-MFC Note: use a inp_pspare for inp_cred
4102fcb55a1a0ea3ac81004e40b16a685851174b 02-Oct-2008 rwatson <rwatson@FreeBSD.org> Merge r183460 and r183461 from head to stable/7:

Fix typo in comment.

Expand comments relating various detach/free/drop inpcb routines.

Approved by: re (kib)
402e931839664f52e93554a1f23e3a7b7dafa04d 29-Sep-2008 rwatson <rwatson@FreeBSD.org> Fix typo in comment.

MFC after: 3 days
e7d522f44036542b4f281e30911cae0cf7fb6241 01-Sep-2008 kmacy <kmacy@FreeBSD.org> MFC 177530:
Insulate inpcb consumers outside the stack from the lock type and offset within the pcb by adding accessor functions.
0a0ca493dd48297d8cdb9d051e7c05d731e61a4c 18-Aug-2008 rwatson <rwatson@FreeBSD.org> Merge r178285, r178318, r178319, r178320, r178321, r178322, r178325,
r178376, r178377, r178378, r178419, r179412, r179414, r180127, r180338,
r180343, r180344, r180346, r180348, r180368, r180422, r180429, r180536,
r180558, r180589, r181364, r181365 from head to stable/7:

Introduce and use rwlocks throughout the inpcbinfo and inpcb infrastructure,
and protocols that depend on that infrastructure, including UDP, TCP, and
IP raw sockets. Significant parts of this work were reviewed by Bjoern Zeeb,
and tested by Paul Saab, Kris Keneway, and George Neville-Neil, whose
contributions to this work are greatly appreciated.

Tested by: ps, kris, gnn, Mike Tancsa <mike at sentex dot net>
Reviewed by: bz, des
f9ebb230ca1b1ab7963cd38728716b73299417dc 07-Aug-2008 rwatson <rwatson@FreeBSD.org> Minor white space tweaks.

MFC after: 1 week
41a2b3d77885c9a3d342b54377400116a3bf0ee9 02-Aug-2008 bz <bz@FreeBSD.org> MFC: r180427,
cvs rev. 1.209 in_pcb.c, 1.109 in_pcb.h
cvs rev. 1.93 in6_pcb.c, 1.22 in6_pcb.h, 1.54 in6_src.c

Pass the ucred along into in{,6}_pcblookup_local for upcoming
prison checks.
a6e52bf81b7c56274190da4f5c8558e6b45310d7 02-Aug-2008 bz <bz@FreeBSD.org> MFC: r180425,
cvs rev. 1.208 in_pcb.c, 1.108 in_pcb.h, 1.92 in6_pcb.c, 1.21 in6_pcb.h

For consistency take lport as u_short in in{,6}_pcblookup_local.
All callers either pass in an u_short or u_int16_t.
eec763b9607cc41b8ae0f622f506cbf8332d15f2 31-Jul-2008 kmacy <kmacy@FreeBSD.org> MFC inp accessor functions
c6ea9ce3a54532e261ad579528381ae724921de1 30-Jul-2008 kmacy <kmacy@FreeBSD.org> add INP_W(UN)LOCK forward compat macros
dc8d54c205784683ec1aae7ecf1f24fe1f6cb2c0 24-Jul-2008 julian <julian@FreeBSD.org> MFC an ABI compatible implementation of Multiple routing tables.
See the commit message for
version 1.129 (svn change # 178888) for more info.

Obtained from: Ironport (Cisco Systems)
d4098f774ebbce6aed8ba093eccf3f26eb13faef 22-Jul-2008 avatar <avatar@FreeBSD.org> Trying to fix compilation bustage:
- removing 'const' qualifier from an input parameter to conform to the type
required by rw_assert();
- using in_addr->s_addr to retrive 32 bits address value.

Observed by: tinderbox
887a78e4a470f3696639fe9f59dae6116cd858d3 21-Jul-2008 kmacy <kmacy@FreeBSD.org> make new accessor functions consistent with existing style
d24f4bd48af063b31a11caa2939d7cb9e7cb791e 21-Jul-2008 kmacy <kmacy@FreeBSD.org> add inpcb accessor functions for fields needed by TOE devices
6d9661b22428319f8d87375a3d6f69afda13469a 15-Jul-2008 rwatson <rwatson@FreeBSD.org> Merge last of a series of rwlock conversion changes to UDP, which
completes the move to a fully parallel UDP transmit path by using
global read, rather than write, locking of inpcbinfo in further
semi-connected cases:

- Add macros to allow try-locking of inpcb and inpcbinfo.
- Always acquire an incpcb read lock in udp_output(), which stablizes the
local inpcb address and port bindings in order to determine what further
locking is required:
- If the inpcb is currently not bound (at all) and are implicitly
connecting, we require inpcbinfo and inpcb write locks, so drop the
read lock and re-acquire.
- If the inpcb is bound for at least one of the port or address, but an
explicit source or destination is requested, trylock the inpcbinfo
lock, and if that fails, drop the inpcb lock, lock the global lock,
and relock the inpcb lock.
- Otherwise, no further locking is required (common case).
- Update comments.

In practice, this means that the vast majority of consumers of UDP sockets
will not acquire any exclusive locks at the socket or UDP levels of the
network stack. This leads to a marked performance improvement in several
important workloads, including BIND, nsd, and memcached over UDP, as well
as significant improvements in pps microbenchmarks.

The plan is to MFC all of the rwlock changes to RELENG_7 once they have
settled for a weeks in the tree.

Tested by: ps, kris (older revision), bde
MFC after: 3 weeks
362cb79214c7ca17170d1530859aaf180e618d75 10-Jul-2008 bz <bz@FreeBSD.org> Pass the ucred along into in{,6}_pcblookup_local for upcoming
prison checks.

Reviewed by: rwatson
4b9bb0069f0a7dd4fc5ebacc519965f51c558c9d 10-Jul-2008 bz <bz@FreeBSD.org> For consistency take lport as u_short in in{,6}_pcblookup_local.
All callers either pass in an u_short or u_int16_t.

Reviewed by: rwatson
e31c8aa8e526df26ff15a5aab776940a6ebaf2aa 08-Jul-2008 rwatson <rwatson@FreeBSD.org> Provide some initial chicken-scratching annotations of locking for
struct inpcb.

Prodded by: bz
MFC after: 3 days
1dfc5c98a4f7c32163dfdc61e390ccf805385108 09-May-2008 julian <julian@FreeBSD.org> Add code to allow the system to handle multiple routing tables.
This particular implementation is designed to be fully backwards compatible
and to be MFC-able to 7.x (and 6.x)

Currently the only protocol that can make use of the multiple tables is IPv4
Similar functionality exists in OpenBSD and Linux.

From my notes:


One thing where FreeBSD has been falling behind, and which by chance I
have some time to work on is "policy based routing", which allows
packet streams to be routed by more than just the destination address.


I want to make some form of this available in the 6.x tree
(and by extension 7.x) , but FreeBSD in general needs it so I might as
well do it in -current and back port the portions I need.

One of the ways that this can be done is to have the ability to
instantiate multiple kernel routing tables (which I will now
refer to as "Forwarding Information Bases" or "FIBs" for political
correctness reasons). Which FIB a particular packet uses to make
the next hop decision can be decided by a number of mechanisms.
The policies these mechanisms implement are the "Policies" referred
to in "Policy based routing".

One of the constraints I have if I try to back port this work to
6.x is that it must be implemented as a EXTENSION to the existing
ABIs in 6.x so that third party applications do not need to be
recompiled in timespan of the branch.

This first version will not have some of the bells and whistles that
will come with later versions. It will, for example, be limited to 16
tables in the first commit.
Implementation method, Compatible version. (part 1)
For this reason I have implemented a "sufficient subset" of a
multiple routing table solution in Perforce, and back-ported it
to 6.x. (also in Perforce though not always caught up with what I
have done in -current/P4). The subset allows a number of FIBs
to be defined at compile time (8 is sufficient for my purposes in 6.x)
and implements the changes needed to allow IPV4 to use them. I have not
done the changes for ipv6 simply because I do not need it, and I do not
have enough knowledge of ipv6 (e.g. neighbor discovery) needed to do it.

Other protocol families are left untouched and should there be
users with proprietary protocol families, they should continue to work
and be oblivious to the existence of the extra FIBs.

To understand how this is done, one must know that the current FIB
code starts everything off with a single dimensional array of
pointers to FIB head structures (One per protocol family), each of
which in turn points to the trie of routes available to that family.

The basic change in the ABI compatible version of the change is to
extent that array to be a 2 dimensional array, so that
instead of protocol family X looking at rt_tables[X] for the
table it needs, it looks at rt_tables[Y][X] when for all
protocol families except ipv4 Y is always 0.
Code that is unaware of the change always just sees the first row
of the table, which of course looks just like the one dimensional
array that existed before.

The entry points rtrequest(), rtalloc(), rtalloc1(), rtalloc_ign()
are all maintained, but refer only to the first row of the array,
so that existing callers in proprietary protocols can continue to
do the "right thing".
Some new entry points are added, for the exclusive use of ipv4 code
called in_rtrequest(), in_rtalloc(), in_rtalloc1() and in_rtalloc_ign(),
which have an extra argument which refers the code to the correct row.

In addition, there are some new entry points (currently called
rtalloc_fib() and friends) that check the Address family being
looked up and call either rtalloc() (and friends) if the protocol
is not IPv4 forcing the action to row 0 or to the appropriate row
if it IS IPv4 (and that info is available). These are for calling
from code that is not specific to any particular protocol. The way
these are implemented would change in the non ABI preserving code
to be added later.

One feature of the first version of the code is that for ipv4,
the interface routes show up automatically on all the FIBs, so
that no matter what FIB you select you always have the basic
direct attached hosts available to you. (rtinit() does this

You CAN delete an interface route from one FIB should you want
to but by default it's there. ARP information is also available
in each FIB. It's assumed that the same machine would have the
same MAC address, regardless of which FIB you are using to get
to it.

This brings us as to how the correct FIB is selected for an outgoing
IPV4 packet.

Firstly, all packets have a FIB associated with them. if nothing
has been done to change it, it will be FIB 0. The FIB is changed
in the following ways.

Packets fall into one of a number of classes.

1/ locally generated packets, coming from a socket/PCB.
Such packets select a FIB from a number associated with the
socket/PCB. This in turn is inherited from the process,
but can be changed by a socket option. The process in turn
inherits it on fork. I have written a utility call setfib
that acts a bit like nice..

setfib -3 ping target.example.com # will use fib 3 for ping.

It is an obvious extension to make it a property of a jail
but I have not done so. It can be achieved by combining the setfib and
jail commands.

2/ packets received on an interface for forwarding.
By default these packets would use table 0,
(or possibly a number settable in a sysctl(not yet)).
but prior to routing the firewall can inspect them (see below).
(possibly in the future you may be able to associate a FIB
with packets received on an interface.. An ifconfig arg, but not yet.)

3/ packets inspected by a packet classifier, which can arbitrarily
associate a fib with it on a packet by packet basis.
A fib assigned to a packet by a packet classifier
(such as ipfw) would over-ride a fib associated by
a more default source. (such as cases 1 or 2).

4/ a tcp listen socket associated with a fib will generate
accept sockets that are associated with that same fib.

5/ Packets generated in response to some other packet (e.g. reset
or icmp packets). These should use the FIB associated with the
packet being reponded to.

6/ Packets generated during encapsulation.
gif, tun and other tunnel interfaces will encapsulate using the FIB
that was in effect withthe proces that set up the tunnel.
thus setfib 1 ifconfig gif0 [tunnel instructions]
will set the fib for the tunnel to use to be fib 1.

Routing messages would be associated with their
process, and thus select one FIB or another.
messages from the kernel would be associated with the fib they
refer to and would only be received by a routing socket associated
with that fib. (not yet implemented)

In addition Netstat has been edited to be able to cope with the
fact that the array is now 2 dimensional. (It looks in system
memory using libkvm (!)). Old versions of netstat see only the first FIB.

In addition two sysctls are added to give:
a) the number of FIBs compiled in (active)
b) the default FIB of the calling process.

Early testing experience:

Basically our (IronPort's) appliance does this functionality already
using ipfw fwd but that method has some drawbacks.

For example,
It can't fully simulate a routing table because it can't influence the
socket's choice of local address when a connect() is done.

Testing during the generating of these changes has been
remarkably smooth so far. Multiple tables have co-existed
with no notable side effects, and packets have been routes

ipfw has grown 2 new keywords:

setfib N ip from anay to any
count ip from any to any fib N

In pf there seems to be a requirement to be able to give symbolic names to the
fibs but I do not have that capacity. I am not sure if it is required.

SCTP has interestingly enough built in support for this, called VRFs
in Cisco parlance. it will be interesting to see how that handles it
when it suddenly actually does something.

Where to next:

After committing the ABI compatible version and MFCing it, I'd
like to proceed in a forward direction in -current. this will
result in some roto-tilling in the routing code.

Firstly: the current code's idea of having a separate tree per
protocol family, all of the same format, and pointed to by the
1 dimensional array is a bit silly. Especially when one considers that
there is code that makes assumptions about every protocol having the
same internal structures there. Some protocols don't WANT that
sort of structure. (for example the whole idea of a netmask is foreign
to appletalk). This needs to be made opaque to the external code.

My suggested first change is to add routing method pointers to the
'domain' structure, along with information pointing the data.
instead of having an array of pointers to uniform structures,
there would be an array pointing to the 'domain' structures
for each protocol address domain (protocol family),
and the methods this reached would be called. The methods would have
an argument that gives FIB number, but the protocol would be free
to ignore it.

When the ABI can be changed it raises the possibilty of the
addition of a fib entry into the "struct route". Currently,
the structure contains the sockaddr of the desination, and the resulting
fib entry. To make this work fully, one could add a fib number
so that given an address and a fib, one can find the third element, the
fib entry.

Interaction with the ARP layer/ LL layer would need to be
revisited as well. Qing Li has been working on this already.

This work was sponsored by Ironport Systems/Cisco

Reviewed by: several including rwatson, bz and mlair (parts each)
Obtained from: Ironport systems/Cisco
ca47fccd6b260693108c5ee5634bd0e011c67f5e 17-Apr-2008 rwatson <rwatson@FreeBSD.org> Convert pcbinfo and inpcb mutexes to rwlocks, and modify macros to
explicitly select write locking for all use of the inpcb mutex.
Update some pcbinfo lock assertions to assert locked rather than
write-locked, although in practice almost all uses of the pcbinfo
rwlock main exclusive, and all instances of inpcb lock acquisition
are exclusive.

This change should introduce (ideally) little functional change.
However, it lays the groundwork for significantly increased
parallelism in the TCP/IP code.

MFC after: 3 months
Tested by: kris (superset of committered patch)
59f40fe008f4f66b473a637491209c6601043f6d 24-Mar-2008 kmacy <kmacy@FreeBSD.org> change inp_wlock_assert to inp_lock_assert
08877248a38446e0ce22265220ccaa032ea39840 24-Mar-2008 kmacy <kmacy@FreeBSD.org> Label inp as unused in the non-INVARIANTS case
fb74f62b24c43c7cc8805f33c598d246a929f9f0 23-Mar-2008 kmacy <kmacy@FreeBSD.org> Insulate inpcb consumers outside the stack from the lock type and offset within the pcb by adding accessor functions.

Reviewed by: rwatson
MFC after: 3 weeks
a5002e6b8597d3911ca2106e12147c870fd0fbbc 07-Dec-2007 kmacy <kmacy@FreeBSD.org> Add padding for anticipated functionality
- vimage
- multiq
- host rtentry caching

Rename spare used by 80211 to if_llsoftc

Reviewed by: rwatson, gnn
Approved by: re(gnn)
12b5f9c8c99a01b1d40e88aaa1a58ce757e68d5e 07-Dec-2007 kmacy <kmacy@FreeBSD.org> Add padding for anticipated functionality
- vimage
- multiq
- host rtentry caching

Rename spare used by 80211 to if_llsoftc

Reviewed by: rwatson, gnn
MFC after: 1 day
23574c86734ab5cb088584d30345e698cbbeaef2 06-Aug-2007 rwatson <rwatson@FreeBSD.org> Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which
previously conditionally acquired Giant based on debug.mpsafenet. As that
has now been removed, they are no longer required. Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.

While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option. Clean up some related gotos for

Reviewed by: bz, csjp
Tested by: kris
Approved by: re (kensmith)
0cd74db89b7c7ca5bface8b05ae8263c0a54217b 01-Jul-2007 gnn <gnn@FreeBSD.org> Commit IPv6 support for FAST_IPSEC to the tree.
This commit includes only the kernel files, the rest of the files
will follow in a second commit.

Reviewed by: bz
Approved by: re
Supported by: Secure Computing
47d37a80be0931ad72e67db6ba915221afdfeb4f 11-May-2007 rwatson <rwatson@FreeBSD.org> Reduce network stack oddness: implement .pru_sockaddr and .pru_peeraddr
protocol entry points using functions named proto_getsockaddr and
proto_getpeeraddr rather than proto_setsockaddr and proto_setpeeraddr.
While it's true that sockaddrs are allocated and set, the net effect is
to retrieve (get) the socket address or peer address from a socket, not
set it, so align names to that intent.
a9656f2df2346810815a1d1181ab96c2aba09a45 01-May-2007 rwatson <rwatson@FreeBSD.org> Remove unused pcbinfo arguments to in_setsockaddr() and
c27ef034149379823d3f1a1ed091d766fe5abb17 30-Apr-2007 rwatson <rwatson@FreeBSD.org> Rename some fields of struct inpcbinfo to have the ipi_ prefix,
consistent with the naming of other structure field members, and
reducing improper grep matches. Clean up and comment structure
fields in structure definition.
82cdcabbd175ee88a1f8ed9d5aa090cf4b00e96a 04-Apr-2007 andre <andre@FreeBSD.org> Add INP_INFO_UNLOCK_ASSERT() and use it in tcp_input(). Also add some
further INP_INFO_WLOCK_ASSERT() while there.
890976965dc9f0e43608a4ef1ea5fc2e9fcd17ea 04-Apr-2007 andre <andre@FreeBSD.org> Some local and style(9) cleanups.
5723c5ca71375f8a43292be1432a69696c3efe62 28-Mar-2007 rwatson <rwatson@FreeBSD.org> Remove stale comment about not enabling inpcb and inpcbinfo lock assertions
when IPv6 is enabled.

MFC after: 3 days
6a5d54ffd20da2b170bed379fd7faebe76a1a71a 17-Feb-2007 rwatson <rwatson@FreeBSD.org> Add "show inpcb", "show tcpcb" DDB commands, which should come in handy
for debugging sblock and other network panics.
939d61a17e2c48529a105f9f0b43305f8bd1178e 16-Feb-2007 rwatson <rwatson@FreeBSD.org> Remove unused inp6_ifindex field from inpcb, as well as unused macro
shortcut for it.
9ff22a0b6e5f437b270f0c832eff0b0c19b987e3 16-Feb-2007 rwatson <rwatson@FreeBSD.org> Remove unused in6p_ip6_hlim macro shortcut for non-present
inp_depend6.inp6_hlim field in the inpcb.
6739423b1075a61edcf30703783cf178045fac1d 29-Dec-2006 jhb <jhb@FreeBSD.org> MFC: Close some races between enumerating inpcb's and tearing them down by
making the mutex portion of struct inpcb type-stable and never destroying
ee45008a04a10f664724c4c90ea2666a3a8fe8b8 20-Aug-2006 dwmalone <dwmalone@FreeBSD.org> MFC: Make net.inet.ip.portrange.reservedhigh and
net.inet.ip.portrange.reservedlow apply to IPv6 aswell as IPv4. Update
a cut'n'paste comment so that it is a bit more up to date.
ee0a5eb928ae5ccdf1a0e619b4ba6e93d19db5fb 18-Jul-2006 ups <ups@FreeBSD.org> Fix race conditions on enumerating pcb lists by moving the initialization
( and where appropriate the destruction) of the pcb mutex to the init/finit
functions of the pcb zones.
This allows locking of the pcb entries and race condition free comparison
of the generation count.
Rearrange locking a bit to avoid extra locking operation to update the generation
count in in_pcballoc(). (in_pcballoc now returns the pcb locked)

I am planning to convert pcb list handling from a type safe to a reference count
model soon. ( As this allows really freeing the PCBs)

Reviewed by: rwatson@, mohans@
MFC after: 1 week
072b8792e4b1cf8b7ff80dcce1be50a18384d555 11-Jun-2006 rwatson <rwatson@FreeBSD.org> Merge in_pcb.h:1.84 from HEAD to RELENG_6:

Minor style tweak: tab after #define, not space.
5d598011b534415b6bfa0b82fe291c836516bbd0 25-Apr-2006 rwatson <rwatson@FreeBSD.org> Abstract inpcb drop logic, previously just setting of INP_DROPPED in TCP,
into in_pcbdrop(). Expand logic to detach the inpcb from its bound
address/port so that dropping a TCP connection releases the inpcb resource
reservation, which since the introduction of socket/pcb reference count
updates, has been persisting until the socket closed rather than being
released implicitly due to prior freeing of the inpcb on TCP drop.

MFC after: 3 months
d67aff8ec432c4f2d1e0c1acf831f238931c2429 03-Apr-2006 rwatson <rwatson@FreeBSD.org> Change inp_ppcb from caddr_t to void *, fix/remove associated related

Consistently use intotw() to cast inp_ppcb pointers to struct tcptw *

Consistently use intotcpcb() to cast inp_ppcb pointers to struct tcpcb *

Don't assign tp to the results to intotcpcb() during variable declation
at the top of functions, as that is before the asserts relating to
locking have been performed. Do this later in the function after
appropriate assertions have run to allow that operation to be conisdered

MFC after: 3 months
71cc03392bbc78f93765e5550fc35f98c373df04 01-Apr-2006 rwatson <rwatson@FreeBSD.org> Break out in_pcbdetach() into two functions:

- in_pcbdetach(), which removes the link between an inpcb and its

- in_pcbfree(), which frees a detached pcb.

Unlike the previous in_pcbdetach(), neither of these functions will
attempt to conditionally free the socket, as they are responsible only
for managing in_pcb memory. Mirror these changes into in6_pcbdetach()
by breaking it into in6_pcbdetach() and in6_pcbfree().

While here, eliminate undesired checks for NULL inpcb pointers in
sockets, as we will now have as an invariant that sockets will always
have valid so_pcb pointers.

MFC after: 3 months
a3688cc84e2cc795f00344b147ed53c98c15520e 26-Mar-2006 rwatson <rwatson@FreeBSD.org> Define two new inpcb flags in the inp_vflag field, which for whatever
reason, seems to be where new flags are getting defined:

INP_DROPPED - The protocol has terminated this connection and the socket
is not reusable: when the socket code enters the protocol,
an error is immediately returned. This will substitute for
NULLing the so_pcb socket field, helping to implement the
invariant that all valid sockets have valid pcb's in TCP.

INP_SOCKREF - The protocol has become the owner of the socket reference,
and will need to free it when freeing the pcb, which will
be used when a TCP socket is closed but still has queued

MFC after: 1 month
864627f0332d2a6530b2c98bb783503e37c8c709 26-Mar-2006 rwatson <rwatson@FreeBSD.org> Minor style tweak: tab after #define, not space.

MFC after: 1 month
2dd230f5c3b1fcc26c7cbcf8fd6067cebfec0d10 19-Mar-2006 dwmalone <dwmalone@FreeBSD.org> Make net.inet.ip.portrange.reservedhigh and
net.inet.ip.portrange.reservedlow apply to IPv6 aswell as IPv4.

We could have made new sysctls for IPv6, but that potentially makes
things complicated for mapped addresses. This seems like the least
confusing option and least likely to cause obscure problems in the

This change makes the mac_portacl module useful with IPv6 apps.

Reviewed by: ume
MFC after: 1 month
882a00088153db211bd33548fb783b20f2ee16b6 02-Oct-2005 andre <andre@FreeBSD.org> MFC IP_DONTFRAG IP socket option.

Approved by: re (scottl)
1d50cd7eb988660990181de5c74b6e63d57812ab 01-Oct-2005 andre <andre@FreeBSD.org> MFC: IP_MINTTL socket option.

Approved by: re (scottl)
bedcd4ace8e6c1ce8c4308a2e5dd2e0a92d9ac06 26-Sep-2005 andre <andre@FreeBSD.org> Implement IP_DONTFRAG IP socket option enabling the Don't Fragment
flag on IP packets. Currently this option is only repected on udp
and raw ip sockets. On tcp sockets the DF flag is controlled by the
path MTU discovery option.

Sending a packet larger than the MTU size of the egress interface
returns an EMSGSIZE error.

Discussed with: rwatson
Sponsored by: TCP/IP Optimization Fundraise 2005
573a9535a81268ee8fa937d020dad86235127d2c 22-Aug-2005 andre <andre@FreeBSD.org> Add socketoption IP_MINTTL. May be used to set the minimum acceptable
TTL a packet must have when received on a socket. All packets with a
lower TTL are silently dropped. Works on already connected/connecting
and listening sockets for RAW/UDP/TCP.

This option is only really useful when set to 255 preventing packets
from outside the directly connected networks reaching local listeners
on sockets.

Allows userland implementation of 'The Generalized TTL Security Mechanism
(GTSM)' according to RFC3682. Examples of such use include the Cisco IOS
BGP implementation command "neighbor ttl-security".

MFC after: 2 weeks
Sponsored by: TCP/IP Optimization Fundraise 2005
a50ffc29129a52835a39bf4868cd5facdc7dce30 07-Jan-2005 imp <imp@FreeBSD.org> /* -> /*- for license, minor formatting changes
c79cd91efc05ca91a24c2adea62738a1e660528a 02-Jan-2005 silby <silby@FreeBSD.org> Port randomization leads to extremely fast port reuse at high
connection rates, which is causing problems for some users.

To retain the security advantage of random ports and ensure
correct operation for high connection rate users, disable
port randomization during periods of high connection rates.

Whenever the connection rate exceeds randomcps (10 by default),
randomization will be disabled for randomtime (45 by default)
seconds. These thresholds may be tuned via sysctl.

Many thanks to Igor Sysoev, who proved the necessity of this
change and tested many preliminary versions of the patch.

MFC After: 20 seconds
f2988d54dee0ffdc5f894558569060c50692848d 05-Dec-2004 rwatson <rwatson@FreeBSD.org> Define INP_UNLOCK_ASSERT() to assert that an inpcb is unlocked.

MFC after: 2 weeks
23afa2eef1039011de5f37f016a491ec611e7039 19-Oct-2004 andre <andre@FreeBSD.org> Add a macro for the destruction of INP_INFO_LOCK's used by loadable modules.
87aa99bbbbf620c4ce98996d472fdae45f077eae 16-Aug-2004 rwatson <rwatson@FreeBSD.org> White space cleanup for netinet before branch:

- Trailing tab/space cleanup
- Remove spurious spaces between or before tabs

This change avoids touching files that Andre likely has in his working
set for PFIL hooks changes for IPFW/DUMMYNET.

Approved by: re (scottl)
Submitted by: Xin LI <delphij@frontfree.net>
bc0c49120545212dcc8b185d81971a000bf866d1 04-Aug-2004 rwatson <rwatson@FreeBSD.org> Now that IPv6 performs basic in6pcb and inpcb locking, enable inpcb
lock assertions even if IPv6 is compiled into the kernel. Previously,
inclusion of IPv6 and locking assertions would result in a rapid
assertion failure as IPv6 was not properly locking inpcbs.
355a8ec49460b1417c4b242f64a7b6c04922f783 13-Jul-2004 stefanf <stefanf@FreeBSD.org> Remove erroneous semicolons.
93baf0b01a4613cd177b1fd2c4b2c26f01456e94 24-Jun-2004 rwatson <rwatson@FreeBSD.org> When asserting non-Giant locks in the network stack, also assert
Giant if debug.mpsafenet=0, as any points that require synchronization
in the SMPng world also required it in the Giant-world:

- inpcb locks (including IPv6)
- inpcbinfo locks (including IPv6)
- dummynet subsystem lock
- ipfw2 subsystem lock
b49b7fe7994689a25dfc2162fe02f1d030360089 07-Apr-2004 imp <imp@FreeBSD.org> Remove advertising clause from University of California Regent's
license, per letter dated July 22, 1999 and email from Peter Wemm,
Alan Cox and Robert Watson.

Approved by: core, peter, alc, rwatson
49554d1bd86202a6694cfff7c5fdc77511af0d50 27-Mar-2004 pjd <pjd@FreeBSD.org> Reduce 'td' argument to 'cred' (struct ucred) argument in those functions:
- in_pcbbind(),
- in_pcbbind_setup(),
- in_pcbconnect(),
- in_pcbconnect_setup(),
- in6_pcbbind(),
- in6_pcbconnect(),
- in6_pcbsetport().
"It should simplify/clarify things a great deal." --rwatson

Requested by: rwatson
Reviewed by: rwatson, ume
02bc13377989c8a404cdeacd93dd1dabc710c44a 27-Mar-2004 pjd <pjd@FreeBSD.org> Remove unused argument.

Reviewed by: ume
89f5b6c374a428764f3f2b3d77d7d9fc228a4c99 25-Mar-2004 pjd <pjd@FreeBSD.org> Remove unused function.
It was used in FreeBSD 4.x, but now we're using cr_canseesocket().
960b35f03a6c555c26345ba9adaa33039083e66f 26-Nov-2003 sam <sam@FreeBSD.org> Split the "inp" mutex class into separate classes for each of divert,
raw, tcp, udp, raw6, and udp6 sockets to avoid spurious witness

Reviewed by: rwatson
Approved by: re (rwatson)
6164d7c280688f20cf827e8374984c6e0175fab0 20-Nov-2003 andre <andre@FreeBSD.org> Introduce tcp_hostcache and remove the tcp specific metrics from
the routing table. Move all usage and references in the tcp stack
from the routing table metrics to the tcp hostcache.

It caches measured parameters of past tcp sessions to provide better
initial start values for following connections from or to the same
source or destination. Depending on the network parameters to/from
the remote host this can lead to significant speedups for new tcp
connections after the first one because they inherit and shortcut
the learning curve.

tcp_hostcache is designed for multiple concurrent access in SMP
environments with high contention and is hash indexed by remote
ip address.

It removes significant locking requirements from the tcp stack with
regard to the routing table.

Reviewed by: sam (mentor), bms
Reviewed by: -net, -current, core@kame.net (IPv6 parts)
Approved by: re (scottl)
9c969b771a32651104f16586408deb67d7039014 18-Nov-2003 rwatson <rwatson@FreeBSD.org> Introduce a MAC label reference in 'struct inpcb', which caches
the MAC label referenced from 'struct socket' in the IPv4 and
IPv6-based protocols. This permits MAC labels to be checked during
network delivery operations without dereferencing inp->inp_socket
to get to so->so_label, which will eventually avoid our having to
grab the socket lock during delivery at the network layer.

This change introduces 'struct inpcb' as a labeled object to the
MAC Framework, along with the normal circus of entry points:
initialization, creation from socket, destruction, as well as a
delivery access control check.

For most policies, the inpcb label will simply be a cache of the
socket label, so a new protocol switch method is introduced,
pr_sosetlabel() to notify protocols that the socket layer label
has been updated so that the cache can be updated while holding
appropriate locks. Most protocols implement this using
pru_sosetlabel_null(), but IPv4/IPv6 protocols using inpcbs use
the the worker function in_pcbsosetlabel(), which calls into the
MAC Framework to perform a cache update.

Biba, LOMAC, and MLS implement these entry points, as do the stub
policy, and test policy.

Reviewed by: sam, bms
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
e0d3008a3fb80872d82c1b1ed81635db1b62364e 08-Nov-2003 sam <sam@FreeBSD.org> add locking assertions that turn into noops if INET6 is configured;
this is necessary because the ipv6 code shares the in_pcb code with
ipv4 but (presently) lacks proper locking

Supported by: FreeBSD Foundation
38391a78340c898e1eb86451d35f8c754330f7a0 24-Oct-2003 ume <ume@FreeBSD.org> correct tab and order.
881c4fa39150df7d0de2dae7ae808f6a73cb199a 24-Oct-2003 ume <ume@FreeBSD.org> Switch Advanced Sockets API for IPv6 from RFC2292 to RFC3542
(aka RFC2292bis). Though I believe this commit doesn't break
backward compatibility againt existing binaries, it breaks
backward compatibility of API.
Now, the applications which use Advanced Sockets API such as
telnet, ping6, mld6query and traceroute6 use RFC3542 API.

Obtained from: KAME
3af3c5ae44ef98b9f2da135dcb64cfc12acd0f39 20-Aug-2003 bms <bms@FreeBSD.org> Add the IP_ONESBCAST option, to enable undirected IP broadcasts to be sent on
specific interfaces. This is required by aodvd, and may in future help us
in getting rid of the requirement for BPF from our import of isc-dhcp.

Suggested by: fenestro
Obtained from: BSD/OS
Reviewed by: mini, sam
Approved by: jake (mentor)
6afaafd2aaf53c793eefeb6d602c1038625e9bff 29-Apr-2003 mdodd <mdodd@FreeBSD.org> IP_RECVTTL socket option.

Reviewed by: Stuart Cheshire <cheshire@apple.com>
ccc6071f7ea7e2ba54dfcf45ff8afda2e395aa3d 02-Apr-2003 mdodd <mdodd@FreeBSD.org> Back out support for RFC3514.

RFC3514 poses an unacceptale risk to compliant systems.
e72fdee732ab55fc784034c81ccedda4b5279816 01-Apr-2003 mdodd <mdodd@FreeBSD.org> Implement support for RFC 3514 (The Security Flag in the IPv4 Header).
(See: ftp://ftp.rfc-editor.org/in-notes/rfc3514.txt)

This fulfills the host requirements for userland support by
way of the setsockopt() IP_EVIL_INTENT message.

There are three sysctl tunables provided to govern system behavior.


Enables support for rfc3514. As this is an
Informational RFC and support is not yet widespread
this option is disabled by default.


If set the host will discard all received evil packets.


If set the host will discard all transmitted evil packets.

The IP statistics counter 'ips_evil' (available via 'netstat') provides
information on the number of 'evil' packets recieved.

For reference, the '-E' option to 'ping' has been provided to demonstrate
and test the implementation.
a8bc02dcb257f24a8246bb1c31abe58bf12ebd04 19-Feb-2003 jlemon <jlemon@FreeBSD.org> Add a TCP TIMEWAIT state which uses less space than a fullblown TCP
control block. Allow the socket and tcpcb structures to be freed
earlier than inpcb. Update code to understand an inp w/o a socket.

Reviewed by: hsu, silby, jayanth
Sponsored by: DARPA, NAI Labs
da0bbc8eafcf817e5cc34a47ec84ff9c12e384d5 12-Nov-2002 hsu <hsu@FreeBSD.org> Turn off duplicate lock checking for inp locks because udp_input()
intentionally locks two inp records simultaneously.
a5bc5c7b7ede3570b7cb9530c8d8f2847945f0fa 21-Oct-2002 iedowse <iedowse@FreeBSD.org> Replace in_pcbladdr() with a more generic inner subroutine for
in_pcbconnect() called in_pcbconnect_setup(). This version performs
all of the functions of in_pcbconnect() except for the final
committing of changes to the PCB. In the case of an EADDRINUSE error
it can also provide to the caller the PCB of the duplicate connection,
avoiding an extra in_pcblookup_hash() lookup in tcp_connect().

This change will allow the "temporary connect" hack in udp_output()
to be removed and is part of the preparation for adding the
IP_SENDSRCADDR control message.

Discussed on: -net
Approved by: re
1b97e2dc441e3e2d9bc1df9086655dd77899b3fd 20-Oct-2002 iedowse <iedowse@FreeBSD.org> Split out most of the logic from in_pcbbind() into a new function
called in_pcbbind_setup() that does everything except commit the
changes to the PCB. There should be no functional change here, but
in_pcbbind_setup() will be used by the soon-to-appear IP_SENDSRCADDR
control message implementation to check or allocate the source
address and port.

Discussed on: -net
Approved by: re
0ef6c52bbcc67b0dce67c7ad7a6fc685828a6400 16-Oct-2002 sam <sam@FreeBSD.org> Tie new "Fast IPsec" code into the build. This involves the usual
configuration stuff as well as conditional code in the IPv4 and IPv6
areas. Everything is conditional on FAST_IPSEC which is mutually
exclusive with IPSEC (KAME IPsec implmentation).

As noted previously, don't use FAST_IPSEC with INET6 at the moment.

Reviewed by: KAME, rwatson
Approved by: silence
Supported by: Vernier Networks
58f67268dfe45f4ceaf9573e37063e5ee431e691 05-Sep-2002 bde <bde@FreeBSD.org> Fixed namespace pollution in uma changes:
- use `struct uma_zone *' instead of uma_zone_t, so that <sys/uma.h> isn't
a prerequisite.
- don't include <sys/uma.h>.
Namespace pollution makes "opaque" types like uma_zone_t perfectly
non-opaque. Such types should never be used (see style(9)).

Fixed subsequently grwon dependencies of this header on its own pollution:
- include <sys/_mutex.h> and its prerequisite <sys/_lock.h> instead of
depending on namespace pollution 2 layers deep in <sys/uma.h>.
7199888e8f06576febc8146d51d09b781470e8ce 21-Aug-2002 truckman <truckman@FreeBSD.org> Create new functions in_sockaddr(), in6_sockaddr(), and
in6_v4mapsin6_sockaddr() which allocate the appropriate sockaddr_in*
structure and initialize it with the address and port information passed
as arguments. Use calls to these new functions to replace code that is
replicated multiple times in in_setsockaddr(), in_setpeeraddr(),
in6_setsockaddr(), in6_setpeeraddr(), in6_mapped_sockaddr(), and
in6_mapped_peeraddr(). Inline COMMON_END in tcp_usr_accept() so that
we can call in_sockaddr() with temporary copies of the address and port
after the PCB is unlocked.

Fix the lock violation in tcp6_usr_accept() (caused by calling MALLOC()
inside in6_mapped_peeraddr() while the PCB is locked) by changing
the implementation of tcp6_usr_accept() to match tcp_usr_accept().

Reviewed by: suz
cd3f29ae66aa80d1901c45d70d4eff79ce3ad9e1 22-Jul-2002 ume <ume@FreeBSD.org> do not refer to IN6P_BINDV6ONLY anymore.

Obtained from: KAME
MFC after: 1 week
abda76de0b81d58e1eb0e275c4e384fe97cca491 14-Jun-2002 hsu <hsu@FreeBSD.org> Notify functions can destroy the pcb, so they have to return an
indication of whether this happenned so the calling function
knows whether or not to unlock the pcb.

Submitted by: Jennifer Yang (yangjihui@yahoo.com)
Bug reported by: Sid Carter (sidcarter@symonds.net)
cd25d4648fdd5f53f76f460b7f57015bdc89bb56 10-Jun-2002 hsu <hsu@FreeBSD.org> Lock up inpcb.

Submitted by: Jennifer Yang <yangjihui@yahoo.com>
6615797e535b58bbb6a5cd4a450b6634a80713a1 09-Apr-2002 jhb <jhb@FreeBSD.org> Change the first argument of prison_xinpcb() to be a thread pointer instead
of a proc pointer so that prison_xinpcb() can use td_ucred.
867fc1ed1cceb768a61213161aa1a6ae8b1b4dc7 24-Mar-2002 bde <bde@FreeBSD.org> Fixed some style bugs in the removal of __P(()). Continuation lines
were not outdented to preserve non-KNF lining up of code with parentheses.
Switch to KNF formatting.
0a59f1223c856d6130a3ef3b5c5f27b2a6a2296f 20-Mar-2002 jeff <jeff@FreeBSD.org> Switch vm_zone.h with uma.h. Change over to uma interfaces.
357e37e023059920b1f80494e489797e2f69a3dd 19-Mar-2002 alfred <alfred@FreeBSD.org> Remove __P.
2923687da3c046deea227e675d5af075b9fa52d4 19-Mar-2002 jeff <jeff@FreeBSD.org> This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@
96af38570e5bb404afd2fddd82c10f429fe147b6 25-Feb-2002 alfred <alfred@FreeBSD.org> Document what inpcb->inp_vflag is for.

Submitted by: Marco Molteni <molter@tin.it>
0696a32b7bbace12fce1b3dbe9186f1e579dd591 27-Nov-2001 rwatson <rwatson@FreeBSD.org> Add include of net/route.h, as structures moved around due to the
syncache rely on 'struct route' being defined. This fixes the
LINT build some.
a3c1c9fdb4ec0da65e5e02c396bbd5bb22a16c8b 22-Nov-2001 jlemon <jlemon@FreeBSD.org> Introduce a syncache, which enables FreeBSD to withstand a SYN flood
DoS in an improved fashion over the existing code.

Reviewed by: silby (in a previous iteration)
Sponsored by: DARPA, NAI Labs
5596676e6c6c1e81e899cd0531f9b1c28a292669 12-Sep-2001 julian <julian@FreeBSD.org> KSE Milestone 2
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha
215c0c107efbdc79c1fc7eb07321d5285fc338d2 04-Aug-2001 ume <ume@FreeBSD.org> When running aplication joined multicast address,
removing network card, and kill aplication.
imo_membership[].inm_ifp refer interface pointer
after removing interface.
When kill aplication, release socket,and imo_membership.
imo_membership use already not exist interface pointer.
Then, kernel panic.

PR: 29345
Submitted by: Inoue Yuichi <inoue@nd.net.fujitsu.co.jp>
Obtained from: KAME
MFC after: 3 days
832f8d224926758a9ae0b23a6b45353e44fbc87a 11-Jun-2001 ume <ume@FreeBSD.org> Sync with recent KAME.
This work was based on kame-20010528-freebsd43-snap.tgz and some
critical problem after the snap was out were fixed.
There are many many changes since last KAME merge.

- The definitions of SADB_* in sys/net/pfkeyv2.h are still different
from RFC2407/IANA assignment because of binary compatibility
issue. It should be fixed under 5-CURRENT.
- ip6po_m member of struct ip6_pktopts is no longer used. But, it
is still there because of binary compatibility issue. It should
be removed under 5-CURRENT.

Reviewed by: itojun
Obtained from: KAME
MFC after: 3 weeks
8260da124e28847c02e67eab46bceb9d967f5c75 26-Feb-2001 jlemon <jlemon@FreeBSD.org> Remove in_pcbnotify and use in_pcblookup_hash to find the cb directly.

For TCP, verify that the sequence number in the ICMP packet falls within
the tcp receive window before performing any actions indicated by the
icmp packet.

Clean up some layering violations (access to tcp internals from in_pcb)
65fa889a568f28016bc5b0bec0e238ae1ba5f299 22-Feb-2001 jesper <jesper@FreeBSD.org> Redo the security update done in rev 1.54 of src/sys/netinet/tcp_subr.c
and 1.84 of src/sys/netinet/udp_usrreq.c

The changes broken down:

- remove 0 as a wildcard for addresses and port numbers in
- add src/sys/netinet/in_pcb.c:in_pcbnotifyall() used to notify
all sessions with the specific remote address.
- change
- src/sys/netinet/udp_usrreq.c:udp_ctlinput()
- src/sys/netinet/tcp_subr.c:tcp_ctlinput()
to use in_pcbnotifyall() to notify multiple sessions, instead of
using in_pcbnotify() with 0 as src address and as port numbers.
- remove check for src port == 0 in
- src/sys/netinet/tcp_subr.c:tcp_ctlinput()
- src/sys/netinet/udp_usrreq.c:udp_ctlinput()
as they are no longer needed.
- move handling of redirects and host dead from in_pcbnotify() to
udp_ctlinput() and tcp_ctlinput(), so they will call
in_pcbnotifyall() to notify all sessions with the specific
remote address.

Approved by: jlemon
Inspired by: NetBSD
6bfb7240b822195a74d4fa5a8268f2143dc0102e 24-Dec-2000 phk <phk@FreeBSD.org> Update the "icmp_admin_prohib_like_rst" code to check the tcp-window and
to be configurable with respect to acting only in SYN or in all TCP states.

PR: 23665
Submitted by: Jesper Skriver <jesper@skriver.dk>
961b97d43458f3c57241940cabebb3bedf7e4c00 26-May-2000 jake <jake@FreeBSD.org> Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others
d93fbc99166053b75c2eeb69b5cb603cfaf79ec0 23-May-2000 jake <jake@FreeBSD.org> Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd
b42951578188c5aab5c9f8cbcde4a743f8092cdc 02-Apr-2000 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'ALSA'.
15b9bcb121e1f3735a2c98a11afdb52a03301d7e 29-Dec-1999 peter <peter@FreeBSD.org> Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot). This is consistant with the other
BSD's who made this change quite some time ago. More commits to come.
70f0bdf6818a73c858bc47a23afc1e9d7c56d716 07-Dec-1999 shin <shin@FreeBSD.org> udp IPv6 support, IPv6/IPv4 tunneling support in kernel,
packet divert at kernel for IPv6/IPv4 translater daemon

This includes queue related patch submitted by jburkhol@home.com.

Submitted by: queue related patch from jburkhol@home.com
Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project
cad2014b2749528351ec5180e88a5929efebbfc4 22-Nov-1999 shin <shin@FreeBSD.org> KAME netinet6 basic part(no IPsec,no V6 Multicast Forwarding, no UDP/TCP
for IPv6 yet)

With this patch, you can assigne IPv6 addr automatically, and can reply to
IPv6 ping.

Reviewed by: freebsd-arch, cvs-committers
Obtained from: KAME project
7efc91cadcfeb421fc4d02ba94db784616f3714c 05-Nov-1999 shin <shin@FreeBSD.org> KAME related header files additions and merges.
(only those which don't affect c source files so much)

Reviewed by: cvs-committers
Obtained from: KAME project
3b842d34e82312a8004a7ecd65ccdb837ef72ac1 28-Aug-1999 peter <peter@FreeBSD.org> $Id$ -> $FreeBSD$
ca21a25f173ed030b0093e4d83140e3b0b43db01 28-Apr-1999 phk <phk@FreeBSD.org> This Implements the mumbled about "Jail" feature.

This is a seriously beefed up chroot kind of thing. The process
is jailed along the same lines as a chroot does it, but with
additional tough restrictions imposed on what the superuser can do.

For all I know, it is safe to hand over the root bit inside a
prison to the customer living in that prison, this is what
it was developed for in fact: "real virtual servers".

Each prison has an ip number associated with it, which all IP
communications will be coerced to use and each prison has its own

Needless to say, you need more RAM this way, but the advantage is
that each customer can run their own particular version of apache
and not stomp on the toes of their neighbors.

It generally does what one would expect, but setting up a jail
still takes a little knowledge.

A few notes:

I have no scripts for setting up a jail, don't ask me for them.

The IP number should be an alias on one of the interfaces.

mount a /proc in each jail, it will make ps more useable.

/proc/<pid>/status tells the hostname of the prison for
jailed processes.

Quotas are only sensible if you have a mountpoint per prison.

There are no privisions for stopping resource-hogging.

Some "#ifdef INET" and similar may be missing (send patches!)

If somebody wants to take it from here and develop it into
more of a "virtual machine" they should be most welcome!

Tools, comments, patches & documentation most welcome.

Have fun...

Sponsored by: http://www.rndassociates.com/
Run for almost a year by: http://www.servetheweb.com/
bbc4497adab2d7702eab9a609897b2e5f672289e 15-May-1998 wollman <wollman@FreeBSD.org> Convert socket structures to be type-stable and add a version number.

Define a parameter which indicates the maximum number of sockets in a
system, and use this to size the zone allocators used for sockets and
for certain PCBs.

Convert PF_LOCAL PCB structures to be type-stable and add a version number.

Define an external format for infomation about socket structures and use
it in several places.

Define a mechanism to get all PF_LOCAL and PF_INET PCB lists through
sysctl(3) without blocking network interrupts for an unreasonable
length of time. This probably still has some bugs and/or race
conditions, but it seems to work well enough on my machines.

It is now possible for `netstat' to get almost all of its information
via the sysctl(3) interface rather than reading kmem (changes to follow).
dee7f44b926e2ae7097a847424fc1f14a017ba0d 28-Mar-1998 bde <bde@FreeBSD.org> Fixed style bugs (mostly) in previous commit.
d43e6115b673fe03cf9272a6c74af0c41177fef9 24-Mar-1998 wollman <wollman@FreeBSD.org> Use the zone allocator to allocate inpcbs and tcpcbs. Each protocol creates
its own zone; this is used particularly by TCP which allocates both inpcb and
tcpcb in a single allocation. (Some hackery ensures that the tcpcb is
reasonably aligned.) Also keep track of the number of pcbs of each type
allocated, and keep a generation count (instance version number) for future
7262ff6e58b1d30213c744786d6687611d4695c7 27-Jan-1998 dg <dg@FreeBSD.org> Improved connection establishment performance by doing local port lookups via
a hashed port list. In the new scheme, in_pcblookup() goes away and is
replaced by a new routine, in_pcblookup_local() for doing the local port
check. Note that this implementation is space inefficient in that the PCB
struct is now too large to fit into 128 bytes. I might deal with this in the
future by using the new zone allocator, but I wanted these changes to be
extensively tested in their current form first.

1) Fixed off-by-one errors in the port lookup loops in in_pcbbind().
2) Got rid of some unneeded rehashing. Adding a new routine, in_pcbinshash()
to do the initialial hash insertion.
3) Renamed in_pcblookuphash() to in_pcblookup_hash() for easier readability.
4) Added a new routine, in_pcbremlists() to remove the PCB from the various
hash lists.
5) Added/deleted comments where appropriate.
6) Removed unnecessary splnet() locking. In general, the PCB functions should
be called at splnet()...there are unfortunately a few exceptions, however.
7) Reorganized a few structs for better cache line behavior.
8) Killed my TCP_ACK_HACK kludge. It may come back in a different form in
the future, however.

These changes have been tested on wcarchive for more than a month. In tests
done here, connection establishment overhead is reduced by more than 50
times, thus getting rid of one of the major networking scalability problems.

Still to do: make tcp_fastimo/tcp_slowtimo scale well for systems with a
large number of connections. tcp_fastimo is easy; tcp_slowtimo is difficult.

WARNING: Anything that knows about inpcb and tcpcb structs will have to be
recompiled; at the very least, this includes netstat(1).
0506343883d62f6649f7bbaf1a436133cef6261d 11-Jan-1998 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'jb'.
7c6e96080c4fb49bf912942804477d202a53396c 10-Jan-1998 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'JB'.
4542c1cf5d7077caf33d6d9468f5e647cd9d19e5 16-Aug-1997 wollman <wollman@FreeBSD.org> Fix all areas of the system (or at least all those in LINT) to avoid storing
socket addresses in mbufs. (Socket buffers are the one exception.) A number
of kernel APIs needed to get fixed in order to make this happen. Also,
fix three protocol families which kept PCBs in mbufs to not malloc them
instead. Delete some old compatibility cruft while we're at it, and add
some new routines in the in_cksum family.
6afbf203bd570424ecf3f9d9d9ced17f82c81adc 27-Apr-1997 wollman <wollman@FreeBSD.org> The long-awaited mega-massive-network-code- cleanup. Part I.

This commit includes the following changes:
1) Old-style (pr_usrreq()) protocols are no longer supported, the compatibility
glue for them is deleted, and the kernel will panic on boot if any are compiled

2) Certain protocol entry points are modified to take a process structure,
so they they can easily tell whether or not it is possible to sleep, and
also to access credentials.

3) SS_PRIV is no more, and with it goes the SO_PRIVSTATE setsockopt()
call. Protocols should use the process pointer they are now passed.

4) The PF_LOCAL and PF_ROUTE families have been updated to use the new
style, as has the `raw' skeleton family.

5) PF_LOCAL sockets now obey the process's umask when creating a socket
in the filesystem.

As a result, LINT is now broken. I'm hoping that some enterprising hacker
with a bit more time will either make the broken bits work (should be
easy for netipx) or dike them out.
3913f72826062a29e3639b65feb86391e7b95f4e 03-Apr-1997 dg <dg@FreeBSD.org> Reorganize elements of the inpcb struct to take better advantage of
cache lines. Removed the struct ip proto since only a couple of chars
were actually being used in it. Changed the order of compares in the
PCB hash lookup to take advantage of partial cache line fills (on PPro).

Discussed-with: wollman
1e7a910ca151c2606d05de7a8b9fa8d216282613 03-Mar-1997 dg <dg@FreeBSD.org> Improved performance of hash algorithm while (hopefully) not reducing
the quality of the hash distribution. This does not fix a problem dealing
with poor distribution when using lots of IP aliases and listening
on the same port on every one of them...some other day perhaps; fixing
that requires significant code changes.
The use of xor was inspired by David S. Miller <davem@jenolan.rutgers.edu>
94b6d727947e1242356988da003ea702d41a97de 22-Feb-1997 peter <peter@FreeBSD.org> Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.
9c02696981a3e802694006b17a0654f76ffe2c87 18-Feb-1997 wollman <wollman@FreeBSD.org> Convert raw IP from mondo-switch-statement-from-Hell to
pr_usrreqs. Collapse duplicates with udp_usrreq.c and
tcp_usrreq.c (calling the generic routines in uipc_socket2.c and
in_pcb.c). Calling sockaddr()_ or peeraddr() on a detached
socket now traps, rather than harmlessly returning an error; this
should never happen. Allow the raw IP buffer sizes to be
controlled via sysctl.
808a36ef658c1810327b5d329469bcf5dad24b28 14-Jan-1997 jkh <jkh@FreeBSD.org> Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
51fa6f0e6c383037d81db2731606bf56378e1128 11-Nov-1996 fenner <fenner@FreeBSD.org> Add the IP_RECVIF socket option, which supplies a packet's incoming interface
using a sockaddr_dl.

Fix the other packet-information socket options (SO_TIMESTAMP, IP_RECVDSTADDR)
to work for multicast UDP and raw sockets as well. (They previously only
worked for unicast UDP).
0ddb96bd1e62a8fbdb72a3a0ddaf0584557a132a 30-Oct-1996 peter <peter@FreeBSD.org> Fix braino on my part. When we have three different port ranges (default,
"high" and "secure"), we can't use a single variable to track the most
recently used port in all three ranges.. :-] This caused the next
transient port to be allocated from the start of the range more often than
it should.
00503a161c79ade0719914a9a6702edb1b8cf4ce 07-Oct-1996 dg <dg@FreeBSD.org> Improved in_pcblookuphash() to support wildcarding, and changed relavent
callers of it to take advantage of this. This reduces new connection
request overhead in the face of a large number of PCBs in the system.
Thanks to David Filo <filo@yahoo.com> for suggesting this and providing
a sample implementation (which wasn't used, but showed that it could be

Reviewed by: wollman
b8aebaca2624179f7a6d556675d3c246c8aca35c 23-Aug-1996 phk <phk@FreeBSD.org> Mark sockets where the kernel chose the port# for.
This can be used by netstat to behave more intelligently.
fe35eac01c2144b50535ae23a00660c11524fd22 22-Feb-1996 peter <peter@FreeBSD.org> Make the default behavior of local port assignment match traditional
systems (my last change did not mix well with some firewall
configurations). As much as I dislike firewalls, this is one thing I
I was not prepared to break by default.. :-)

Allow the user to nominate one of three ranges of port numbers as
candidates for selecting a local address to replace a zero port number.
The ranges are selected via a setsockopt(s, IPPROTO_IP, IP_PORTRANGE, &arg)
call. The three ranges are: default, high (to bypass firewalls) and
low (to get a port below 1024).

The default and high port ranges are sysctl settable under sysctl

This code also fixes a potential deadlock if the system accidently ran out
of local port addresses. It'd drop into an infinite while loop.

The secure port selection (for root) should reduce overheads and increase
reliability of rlogin/rlogind/rsh/rshd if they are modified to take
advantage of it.

Partly suggested by: pst
Reviewed by: wollman
fafc2e709f36ec2cbf2778f5d75bfcf8f171cee5 05-Dec-1995 bde <bde@FreeBSD.org> Added explicit include of <sys/queue.h>. Currently, some things only
compile because <vm/vm.h> happens to be gratuitously included before
<netinet/in_pcb.h> and <vm/vm.h> happens to include <sys/queue.h>.
db2c71245d8bd7171d58bbd567c7a24804e752e5 14-Nov-1995 phk <phk@FreeBSD.org> New style sysctl & staticize alot of stuff.
86f1bc4514fdcfd255f37f3218fe234bdc3664fc 05-Nov-1995 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'LINUX'.
eb729114f74ffed2c6e8b7f7dd1ed8cf4584cee1 10-Apr-1995 dg <dg@FreeBSD.org> Backed out Jordan's #include of queue.h
409c5ad6ef29d6e9945433b439e43780609d38d9 09-Apr-1995 jkh <jkh@FreeBSD.org> #include <sys/queue.h> or die horribly.
919fdebd0e121a9cf0b773da1b2886e1d0b05b56 09-Apr-1995 dg <dg@FreeBSD.org> Implemented PCB hashing. Includes new functions in_pcbinshash, in_pcbrehash,
and in_pcblookuphash.
289f11acb49b6dbb3081e09bf94a86f008f55814 16-Mar-1995 bde <bde@FreeBSD.org> Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) and most of the warnings from
`gcc -Wnested-externs'. Fix all the bugs found. There were no serious
2e14d9ebc3d3592c67bdf625af9ebe0dfc386653 14-Mar-1995 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'MATT_THOMAS'.
34cd81d75f398ee455e61969b118639dacbfd7a6 23-Sep-1994 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'MACKERRAS'.
8197ce5e98353ade5c0651b18d741110a142e3c8 21-Aug-1994 paul <paul@FreeBSD.org> Made idempotent.

Submitted by: Paul
e16baf7a5fe7ac1453381d0017ed1dcdeefbc995 07-Aug-1994 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'SUNRPC'.
8d205697aac53476badf354623abd4e1c7bc5aff 02-Aug-1994 dg <dg@FreeBSD.org> Added $Id$
2469c867a164210ce96143517059f21db7f1fd17 25-May-1994 rgrimes <rgrimes@FreeBSD.org> The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman
8fb65ce818b3e3c6f165b583b910af24000768a5 24-May-1994 rgrimes <rgrimes@FreeBSD.org> BSD 4.4 Lite Kernel Sources