History log of /freebsd-head/sys/kern/kern_descrip.c
Revision Date Author Comments
e75280f8644acdaa030b14dbf02b49d5d089fe66 27-Apr-2020 mjg <mjg@FreeBSD.org> pwd: unbreak repeated calls to set_rootvnode

Prior to the change the once set pointer would never be updated.

Unbreaks reboot -r.

Reported by: Ross Gohlke
79165c9642bbe7e5407803bbcb74a2d71e54f2a9 14-Apr-2020 kevans <kevans@FreeBSD.org> Mark closefrom(2) COMPAT12, reimplement in libc to wrap close_range

Include a temporarily compatibility shim as well for kernels predating
close_range, since closefrom is used in some critical areas.

Reviewed by: markj (previous version), kib
Differential Revision: https://reviews.freebsd.org/D24399
4045a67bf312de8b554d5fd91c53d13c59664a2a 13-Apr-2020 kevans <kevans@FreeBSD.org> close_range/closefrom: fix regression from close_range introduction

close_range will clamp the range between [0, fdp->fd_lastfile], but failed
to take into account that fdp->fd_lastfile can become -1 if all fds are
closed. =-( In this scenario, just return because there's nothing further we
can do at the moment.

Add a test case for this, fork() and simply closefrom(0) twice in the child;
on the second invocation, fdp->fd_lastfile == -1 and will trigger a panic
before this change.

X-MFC-With: r359836
6371039d47fcde7da8301b6243867c1bbfe0793e 12-Apr-2020 kevans <kevans@FreeBSD.org> Implement a close_range(2) syscall

close_range(min, max, flags) allows for a range of descriptors to be
closed. The Python folk have indicated that they would much prefer this
interface to closefrom(2), as the case may be that they/someone have special
fds dup'd to higher in the range and they can't necessarily closefrom(min)
because they don't want to hit the upper range, but relocating them to lower
isn't necessarily feasible.

sys_closefrom has been rewritten to use kern_close_range() using ~0U to
indicate closing to the end of the range. This was chosen rather than
requiring callers of kern_close_range() to hold FILEDESC_SLOCK across the
call to kern_close_range for simplicity.

The flags argument of close_range(2) is currently unused, so any flags set
is currently EINVAL. It was added to the interface in Linux so that future
flags could be added for, e.g., "halt on first error" and things of this
nature.

This patch is based on a syscall of the same design that is expected to be
merged into Linux.

Reviewed by: kib, markj, vangyzen (all slightly earlier revisions)
Differential Revision: https://reviews.freebsd.org/D21627
217fa09bf639260f4fe7c9415d8f42b141637d51 19-Mar-2020 markj <markj@FreeBSD.org> kern_dup(): Call filecaps_free_prep() in a write section.

filecaps_free_prep() bzeros the capabilities structure and we need to be
careful to synchronize with unlocked readers, which expect a consistent
rights structure.

Reviewed by: kib, mjg
Reported by: syzbot+5f30b507f91ddedded21@syzkaller.appspotmail.com
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D24120
705aabfbf80dcd5fa051dbe8be59f7c27c22bf73 08-Mar-2020 mjg <mjg@FreeBSD.org> fd: use smr for managing struct pwd

This has a side effect of eliminating filedesc slock/sunlock during path
lookup, which in turn removes contention vs concurrent modifications to the fd
table.

Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D23889
17e6de13da30c67328dd108a50e1df85dc18e8b2 01-Mar-2020 mjg <mjg@FreeBSD.org> fd: move vnodes out of filedesc into a dedicated structure

The new structure is copy-on-write. With the assumption that path lookups are
significantly more frequent than chdirs and chrooting this is a win.

This provides stable root and jail root vnodes without the need to reference
them on lookup, which in turn means less work on globally shared structures.
Note this also happens to fix a bug where jail vnode was never referenced,
meaning subsequent access on lookup could run into use-after-free.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23884
f3a026564b69c0ad1f029560d63c9fbcc5448428 01-Mar-2020 mjg <mjg@FreeBSD.org> fd: make fgetvp_rights work without the filedesc lock

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23883
00d3456c7721c4ab632188c1be537f918904f619 15-Feb-2020 dim <dim@FreeBSD.org> Merge ^/head r357931 through r357965.
b25754c877d57ac9c9232af6ca20f400e01b8a69 15-Feb-2020 mjg <mjg@FreeBSD.org> fd: use new capsicum helpers
66db69737099118ee638dfd77a24cede1842671d 14-Feb-2020 mjg <mjg@FreeBSD.org> fd: remove no longer needed atomic_load_ptr casts
2353097db00096647de4613025544acdae368f86 14-Feb-2020 mjg <mjg@FreeBSD.org> fd: annotate finstall with prediction branches
73ba42a4914fc50a6a9e626e4263ef9c4cd4b275 14-Feb-2020 kevans <kevans@FreeBSD.org> u_char -> vm_prot_t in a couple of places, NFC

The latter is a typedef of the former; the typedef exists and these bits are
representing vmprot values, so use the correct type.

Submitted by: sigsys@gmail.com
MFC after: 3 days
adbdb897689b456848bfb9567a2dc8afa3595a5b 05-Feb-2020 mjg <mjg@FreeBSD.org> fd: always nullify *fdp in fget* routines

Some consumers depend on the pointer being NULL if an error is returned.

The guarantee got broken in r357469.

Reported by: https://syzkaller.appspot.com/bug?extid=0c9b05e2b727aae21eef
Noted by: markj
4fae937212ed9ff941155cc1c3710f29c01c1078 03-Feb-2020 mjg <mjg@FreeBSD.org> fd: partially unengrish the previous commit
e0daf4956ae67ee92d21f534064ae49d97ab08fa 03-Feb-2020 mjg <mjg@FreeBSD.org> fd: streamline fget_unlocked

clang has the unfortunate property of paying little attention to prediction
hints when faced with a loop spanning the majority of the rotuine.

In particular fget_unlocked has an unlikely corner case where it starts almost
from scratch. Faced with this clang generates a maze of taken jumps, whereas
gcc produces jump-free code (in the expected case).

Work around the problem by providing a variant which only tries once and
resorts to calling the original code if anything goes wrong.

While here note that the 'seq' parameter is almost never passed, thus the
seldom users are redirected to call it directly.
ecb991e675416271c1bf87cc6f33567fd19f536e 03-Feb-2020 mjg <mjg@FreeBSD.org> fd: remove the seq argument from fget_unlocked

It is almost always NULL.
89d1f1812c43a06a41b9e6db849a060a588181c8 03-Feb-2020 mjg <mjg@FreeBSD.org> fd: remove the seq argument from fget routines

It is almost always NULL.
21df633ceb7c53b43a556030dcd9434c878f344f 03-Feb-2020 mjg <mjg@FreeBSD.org> ktrace: provide ktrstat_error

This eliminates a branch from its consumers trading it for an extra call
if ktrace is enabled for curthread. Given that this is almost never true,
the tradeoff is worth it.
763314a49222307ef8fed4ad551bd23d53ccd59f 03-Feb-2020 mjg <mjg@FreeBSD.org> capsicum: faster cap_rights_contains

Instead of doing a 2 iteration loop (determined at runeimt), take advantage
of the fact that the size is already known.

While here provdie cap_check_inline so that fget_unlocked does not have to
do a function call.

Verified with the capsicum suite /usr/tests.
f8e2d90c739aabb7880e845e8b76804864acd920 03-Feb-2020 mjg <mjg@FreeBSD.org> fd: fix f_count acquire in fget_unlocked

The code was using a hand-rolled fcmpset loop, while in other places the same
count is manipulated with the refcount API.

This transferred from a stylistic issue into a bug after the API got extended
to support flags. As a result the hand-rolled loop could bump the count high
enough to set the bit flag. Another bump + refcount_release would then free
the file prematurely.

The bug is only present in -CURRENT.
e98538b60126e9b45f983d0a5cc33a6c9eb94840 02-Feb-2020 mjg <mjg@FreeBSD.org> fd: sprinkle some predits around fget

clang inlines fget -> _fget into kern_fstat and eliminates several checkes,
but prior to this change it would assume fget_unlocked was likely to fail
and consequently avoidable jumps got generated.
ebb1f3a14f9393e74175b7848287ea0689dd0bba 02-Feb-2020 mjg <mjg@FreeBSD.org> fd: use atomic_load_ptr instead of hand-rolled cast through volatile

No change in assembly.
fa60d0fc5121c0f566b8f83ab65f428d635cd687 17-Jan-2020 mjg <mjg@FreeBSD.org> vfs: provide F_ISUNIONSTACK as a kludge for libc

Prior to introduction of this op libc's readdir would call fstatfs(2), in
effect unnecessarily copying kilobytes of data just to check fs name and a
mount flag.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D23162
f121d45000fd1c42611ca1e54872bd4545398933 03-Jan-2020 mjg <mjg@FreeBSD.org> vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
9c60b86beb253f0df1c80dbd129b2303219d94e3 11-Dec-2019 mjg <mjg@FreeBSD.org> fd: static-ize and devolatile openfiles

Almost all access is using atomics. The only read is sysctl which should use
a whole-int-at-a-time friendly read internally.
20c5c1bf19b5e55caee0588d92a6f4bd834edea4 02-Oct-2019 markj <markj@FreeBSD.org> Disallow fcntl(F_READAHEAD) when the vnode is not a regular file.

The mountpoint may not have defined an iosize parameter, so an attempt
to configure readahead on a device file can lead to a divide-by-zero
crash.

The sequential heuristic is not applied to I/O to or from device files,
and posix_fadvise(2) returns an error when v_type != VREG, so perform
the same check here.

Reported by: syzbot+e4b682208761aa5bc53a@syzkaller.appspotmail.com
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21864
13d4dfe478f82d11e0c7d467bf4cd4e03a0112b6 25-Sep-2019 kevans <kevans@FreeBSD.org> [1/3] Add mostly Linux-compatible file sealing support

File sealing applies protections against certain actions
(currently: write, growth, shrink) at the inode level. New fileops are added
to accommodate seals - EINVAL is returned by fcntl(2) if they are not
implemented.

Reviewed by: markj, kib
Differential Revision: https://reviews.freebsd.org/D21391
7d29da5483221ca7732e6aa620946fa805283dbb 21-Jul-2019 kib <kib@FreeBSD.org> Check and avoid overflow when incrementing fp->f_count in
fget_unlocked() and fhold().

On sufficiently large machine, f_count can be legitimately very large,
e.g. malicious code can dup same fd up to the per-process
filedescriptors limit, and then fork as much as it can.
On some smaller machine, I see
kern.maxfilesperproc: 939132
kern.maxprocperuid: 34203
which already overflows u_int. More, the malicious code can create
transient references by sending fds over unix sockets.

I realized that this check is missed after reading
https://secfault-security.com/blog/FreeBSD-SA-1902.fd.html

Reviewed by: markj (previous version), mjg
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20947
9d687d2f39a1ee9d7567d1d12cd4f078c6f921aa 29-Jun-2019 markj <markj@FreeBSD.org> Use a consistent snapshot of the fd's rights in fget_mmap().

fget_mmap() translates rights on the descriptor to a VM protection
mask. It was doing so without holding any locks on the descriptor
table, so a writer could simultaneously be modifying those rights.
Such a situation would be detected using a sequence counter, but
not before an inconsistency could trigger assertion failures in
the capability code.

Fix the problem by copying the fd's rights to a structure on the stack,
and perform the translation only once we know that that snapshot is
consistent.

Reported by: syzbot+ae359438769fda1840f8@syzkaller.appspotmail.com
Reviewed by: brooks, mjg
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20800
927d2d494a9cb7f9497c82759439b82ac7e4516e 20-Jun-2019 asomers <asomers@FreeBSD.org> fcntl: fix overflow when setting F_READAHEAD

VOP_READ and VOP_WRITE take the seqcount in blocks in a 16-bit field.
However, fcntl allows you to set the seqcount in bytes to any nonnegative
31-bit value. The result can be a 16-bit overflow, which will be
sign-extended in functions like ffs_read. Fix this by sanitizing the
argument in kern_fcntl. As a matter of policy, limit to IO_SEQMAX rather
than INT16_MAX.

Also, fifos have overloaded the f_seqcount field for a completely different
purpose ever since r238936. Formalize that by using a union type.

Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20710
f6d894c8f2afb51786f37ef95eedb365fb1833b4 23-May-2019 kib <kib@FreeBSD.org> Make pack_kinfo() available for external callers.

Reviewed by: jilles, tmunro
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D20258
fb4ce630e036f6b73bef06c3c4b9c7bf363a9b23 25-Mar-2019 markj <markj@FreeBSD.org> Reject F_SETLK_REMOTE commands when sysid == 0.

A sysid of 0 denotes the local system, and some handlers for remote
locking commands do not attempt to deal with local locks. Note that
F_SETLK_REMOTE is only available to privileged users as it is intended
to be used as a testing interface.

Reviewed by: kib
Reported by: syzbot+9c457a6ae014a3281eb8@syzkaller.appspotmail.com
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19702
df6fb642aae895f69350d233b49505578b941949 27-Feb-2019 mjg <mjg@FreeBSD.org> Rename seq to seqc to avoid namespace clashes with Linux

Linux generates the content of procfs files using a mechanism prefixed with
seq_*. This in particular came up with recent gcov import.

Sponsored by: The FreeBSD Foundation
ba1160d69871ac2f2d4dd9cdeba066a6b93b5504 23-Feb-2019 mmacy <mmacy@FreeBSD.org> Change seq_read to seq_load to avoid namespace conflicts with lkpi

MFC after: 1 week
Sponsored by: iX Systems
7354c4db663a9cba79053a06a33487ce98fb4939 20-Feb-2019 markj <markj@FreeBSD.org> Remove an obsolete comment.

MFC after: 3 days
8598ea893e5596ee2ebedcbc0d141b446a31e1e9 14-Dec-2018 mjg <mjg@FreeBSD.org> vfs: mostly depessimize NDINIT_ALL

1) filecaps_init was unnecesarily a function call
2) an asignment at the end was preventing tail calling of cap_rights_init

Sponsored by: The FreeBSD Foundation
ba8523cc7ccc6027be946711bffc2f09c73bd667 11-Dec-2018 mjg <mjg@FreeBSD.org> fd: dedup code in sys_getdtablesize

Sponsored by: The FreeBSD Foundation
45f96abb722ab10494806d7866ce4890f7b57d3b 11-Dec-2018 mjg <mjg@FreeBSD.org> fd: tidy up closing a fd

- avoid a call to knote_close in the common case
- annotate mqueue as unlikely

Sponsored by: The FreeBSD Foundation
78cf9b9e38c0d584da0d9f57994634c1662947a9 11-Dec-2018 mjg <mjg@FreeBSD.org> fd: stop looking for exact freefile after allocation

If a lower fd is closed later, the lookup goes to waste. Allocation
always performs the lookup anyway.

Sponsored by: The FreeBSD Foundation
69fc5b56ceb51c5825d189e80dd94c7f0072635e 07-Dec-2018 mjg <mjg@FreeBSD.org> fd: use racct_set_unlocked

Sponsored by: The FreeBSD Foundation
76d3335601171b4a37a77d07624cc51356fd08dd 07-Dec-2018 mjg <mjg@FreeBSD.org> fd: try do less work with the lock in dup

Sponsored by: The FreeBSD Foundation
77488e7af0beafa41d679d91eae3b0132f559862 29-Nov-2018 mjg <mjg@FreeBSD.org> fd: unify fd range check across the routines

While here annotate out of range as unlikely.

Sponsored by: The FreeBSD Foundation
468ae8ae63263d99fbacde5728c918708162c380 12-Oct-2018 mjg <mjg@FreeBSD.org> capsicum: provide cap_rights_fde_inline

Reading caps is in the hot path (on each successful fd lookup), but
completely unnecessarily requires a function call.

Approved by: re (gjb)
Sponsored by: The FreeBSD Foundation
4279599452b564d5824480e060105ec8232207e6 20-Sep-2018 mjg <mjg@FreeBSD.org> fd: prevent inlining of _fdrop thorough kern_descrip.c

fdrop is used in several places in the file and almost never has to call
_fdrop. Thus inlining it is a pure waste of space.

Approved by: re (kib)
9a3371344fc75a1ea0f41705181f885ff6b2636e 12-Jul-2018 mjg <mjg@FreeBSD.org> fd: stop passing M_ZERO to uma_zalloc

The optimisation seen with malloc cannot be used here as zone sizes are
now known at compilation. Thus bzero by hand to get the optimisation
instead.
39f527e7eedb9cb4fb142f42f239c78ab536c970 10-Jul-2018 brooks <brooks@FreeBSD.org> Use uintptr_t alone when assigning to kvaddr_t variables.

Suggested by: jhb
8baf738e843a341f722c0f6057c4d2dfbf8ec351 06-Jul-2018 brooks <brooks@FreeBSD.org> Correct breakage on 32-bit platforms from r335979.
6615ed4c6149c9810ba766e318f591d44a0596df 05-Jul-2018 brooks <brooks@FreeBSD.org> Make struct xinpcb and friends word-size independent.

Replace size_t members with ksize_t (uint64_t) and pointer members
(never used as pointers in userspace, but instead as unique
idenitifiers) with kvaddr_t (uint64_t). This makes the structs
identical between 32-bit and 64-bit ABIs.

On 64-bit bit systems, the ABI is maintained. On 32-bit systems,
this is an ABI breaking change. The ABI of most of these structs
was previously broken in r315662. This also imposes a small API
change on userspace consumers who must handle kernel pointers
becoming virtual addresses.

PR: 228301 (exp-run by antoine)
Reviewed by: jtl, kib, rwatson (various versions)
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D15386
98274e3f11ad5ebefe16b45787a648fc2f3626e2 01-Jun-2018 emaste <emaste@FreeBSD.org> ANSIfy sys/kern
3886da5a93cf99e230802a741651fd31e8982370 19-May-2018 mmacy <mmacy@FreeBSD.org> capsicum: propagate const correctness
a0bd5d3d7ffae2d09d6ae3cb12bed3ca80e88928 09-May-2018 mmacy <mmacy@FreeBSD.org> Eliminate the overhead of gratuitous repeated reinitialization of cap_rights

- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.

Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month
491580c1f91dc4581b8b75782a8ac05a4827d5da 04-May-2018 mmacy <mmacy@FreeBSD.org> `dup1_processes -t 96 -s 5` on a dual 8160

x dup_before
+ dup_after
+------------------------------------------------------------+
| x + |
|x x x x ++ ++|
| |____AM___| |AM||
+------------------------------------------------------------+
N Min Max Median Avg Stddev
x 5 1.514954e+08 1.5230351e+08 1.5206157e+08 1.5199371e+08 341205.71
+ 5 1.5494336e+08 1.5519569e+08 1.5511982e+08 1.5508323e+08 96232.829
Difference at 95.0% confidence
3.08952e+06 +/- 365604
2.03266% +/- 0.245071%
(Student's t, pooled s = 250681)

Reported by: mjg@
MFC after: 1 week
49cba071c4770743ccd85e6036f76470b245992e 22-Apr-2018 mjg <mjg@FreeBSD.org> lockf: slightly depessimize

1. check if P_ADVLOCK is already set and if so, don't lock to set it
(stolen from DragonFly)
2. when trying for fast path unlock, check that we are doing unlock
first instead of taking the interlock for no reason (e.g. if we want
to *lock*). whilere make it more likely that falling fast path will
not take the interlock either by checking for state

Note the code is severely pessimized both single- and multithreaded.
6d687d59193e83dfb04dbbbd5b50ee74e906d857 17-Apr-2018 jhb <jhb@FreeBSD.org> Properly do a deep copy of the ioctls capability array for fget_cap().

fget_cap() tries to do a cheaper snapshot of a file descriptor without
holding the file descriptor lock. This snapshot does not do a deep
copy of the ioctls capability array, but instead uses a different
return value to inform the caller to retry the copy with the lock
held. However, filecaps_copy() was returning 1 to indicate that a
retry was required, and fget_cap() was checking for 0 (actually
'!filecaps_copy()'). As a result, fget_cap() did not do a deep copy
of the ioctls array and just reused the original pointer. This cause
multiple file descriptor entries to think they owned the same pointer
and eventually resulted in duplicate frees.

The only code path that I'm aware of that triggers this is to create a
listen socket that has a restricted list of ioctls and then call
accept() which calls fget_cap() with a valid filecaps structure from
getsock_cap().

To fix, change the return value of filecaps_copy() to return true if
it succeeds in copying the caps and false if it fails because the lock
is required. I find this more intuitive than fixing the caller in
this case. While here, change the return type from 'int' to 'bool'.

Finally, make filecaps_copy() more robust in the failure case by not
copying any of the source filecaps structure over. This avoids the
possibility of leaking a pointer into a structure if a similar future
caller doesn't properly handle the return value from filecaps_copy()
at the expense of one more branch.

I also added a test case that panics before this change and now passes.

Reviewed by: kib
Discussed with: mjg (not a fan of the extra branch)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D15047
9d79658aab1a30f34fee169ce74bdff4ca405c18 06-Apr-2018 brooks <brooks@FreeBSD.org> Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
37f992163fbc8a172756c94e4aa2eee0c4567a3f 28-Mar-2018 mjg <mjg@FreeBSD.org> fd: tighten seq protected areas to not contain malloc/free
0aaf564f9456c645cdd13ad7545d6b84d60bd5d5 17-Jan-2018 jhb <jhb@FreeBSD.org> Use long for the last argument to VOP_PATHCONF rather than a register_t.

pathconf(2) and fpathconf(2) both return a long. The kern_[f]pathconf()
functions now accept a pointer to a long value rather than modifying
td_retval directly. Instead, the system calls explicitly store the
returned long value in td_retval[0].

Requested by: bde
Reviewed by: kib
Sponsored by: Chelsio Communications
b7316728818bbe262e6c8fc44ab37b1ba3bd21a8 19-Dec-2017 jhb <jhb@FreeBSD.org> Add a custom VOP_PATHCONF method for fdescfs.

The method handles NAME_MAX and LINK_MAX explicitly. For all other
pathconf variables, the method passes the request down to the underlying
file descriptor. This requires splitting a kern_fpathconf() syscallsubr
routine out of sys_fpathconf(). Also, to avoid lock order reversals with
vnode locks, the fdescfs vnode is unlocked around the call to
kern_fpathconf(), but with the usecount of the vnode bumped.

MFC after: 1 month
Sponsored by: Chelsio Communications
4736ccfd9c3411d50371d7f21f9450a47c19047e 20-Nov-2017 pfg <pfg@FreeBSD.org> sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
006bffeca3780f48981b8d3293d0019a7dac2372 14-Jun-2017 rlibby <rlibby@FreeBSD.org> ddb show files: fix up file types and whitespace

This makes ddb show files more descriptive and also adjusts the
whitespace to align the columns for non-32-bit architectures.

Reviewed by: cem (previous version), jhb
Approved by: markj (mentor)
Differential Revision: https://reviews.freebsd.org/D11061
6a43deb7a3a4bcfbcd7043b1fb69e35711e54462 05-Jun-2017 kib <kib@FreeBSD.org> Add sysctl vfs.ino64_trunc_error controlling action on truncating
inode number or link count for the ABI compat binaries.

Right now, and by default after the change, too large 64bit values are
silently truncated to 32 bits. Enabling the knob causes the system to
return EOVERFLOW for stat(2) family of compat syscalls when some
values cannot be completely represented by the old structures. For
getdirentries(2), knob skips the dirents which would cause non-trivial
truncation of d_ino.

EOVERFLOW error is specified by the X/Open 1996 LFS document
('Adding Support for Arbitrary File Sizes to the Single UNIX
Specification').

Based on the discussion with: bde
Sponsored by: The FreeBSD Foundation
e75ba1d5c4c79376a78351c8544388491db49664 23-May-2017 kib <kib@FreeBSD.org> Commit the 64-bit inode project.

Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439
9dc8f703d1518fa11315a2cdcd1669c00f4a2a60 07-Apr-2017 cem <cem@FreeBSD.org> kern_descrip: Move kinfo_ofile size assert under COMPAT_FREEBSD7

The size and structure are not used outside of FreeBSD 7 compatibility ABIs.

Sponsored by: Dell EMC Isilon
c2e2a66783447cad78a03598033c222dd28f7301 05-Feb-2017 ngie <ngie@FreeBSD.org> MFhead@r313266
0d3f5cb833838d06297da1aa2b72fb135e2f139f 05-Feb-2017 mjg <mjg@FreeBSD.org> fd: switch fget_unlocked to atomic_fcmpset
f7bc913c03f72cd7b8e2315e198c8f852d13d30a 30-Jan-2017 mjg <mjg@FreeBSD.org> fd: sprinkle __read_mostly and __exclusive_cache_line
bd0b52fc1f15d6a8a99825302e7ae6163eb9b01e 28-Jan-2017 bapt <bapt@FreeBSD.org> Revert crap accidentally committed
02ac05d57247cac359365e52b87799c38066e31a 28-Jan-2017 bapt <bapt@FreeBSD.org> Revert r312923 a better approach will be taken later
eaea1f53fc3306343a58a9a09856b6a85a9d3fda 13-Jan-2017 glebius <glebius@FreeBSD.org> Remove deprecated fgetsock() and fputsock().
290ab10d4da58c1dcad380bbc7b65adfce4ddd61 01-Jan-2017 mjg <mjg@FreeBSD.org> fd: access openfiles once in falloc_noinstall

This is similar to what's done with nprocs.

Note this is only a band aid.
f4dcd1882ef64086e3a3376512b3cd902f03c109 30-Dec-2016 mjg <mjg@FreeBSD.org> Remove cpu_spinwait after seq_consistent.

It does not add any benefit as the read routine will do it as necessary.
f03b37f3e8c9e424c3a74ed50c1ed0fcf5053530 12-Dec-2016 mjg <mjg@FreeBSD.org> vfs: add vrefact, to be used when the vnode has to be already active

This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.

Use it in some of the places which are guaranteed to see already active
vnodes.

Reviewed by: kib (previous version)
312591feddbdd5949c3bf821ebaa7f11027c4296 22-Nov-2016 rwatson <rwatson@FreeBSD.org> Audit 'fd' and 'cmd' arguments to fcntl(2), and when generating BSM,
always audit the file-descriptor number and vnode information for all
fnctl(2) commands, not just locking-related ones. This was likely an
oversight in the original adaptation of this code from XNU.

MFC after: 3 days
Sponsored by: DARPA, AFRL
35bdd4b3f9415e7925fee856bef6dc1709de61aa 24-Sep-2016 julian <julian@FreeBSD.org> Give the user a clue as to which process hit maxfiles.

MFC after: 1 week
Sponsored by: Panzura
a50a02f73465e6718ca0c96a9cf79841ad9d490b 23-Sep-2016 oshogbo <oshogbo@FreeBSD.org> fd: fix up fget_cap

If the kernel is not compiled with the CAPABILITIES kernel options
fget_unlocked doesn't return the sequence number so fd_modify will
always report modification, in that case we got infinity loop.

Reported by: br
Reviewed by: mjg
Tested by: br, def
2d588d63f85f3a641cd8ccf1f675d31d27f513ae 23-Sep-2016 mjg <mjg@FreeBSD.org> fd: fix up fgetvp_rights after r306184

fget_cap_locked returns a referenced file, but the fgetvp_rights does
not need it. Instead, due to the filedesc lock being held, it can
ref the vnode after the file was looked up.

Fix up fget_cap_locked to be consistent with other _locked helpers and not
ref the file.

This plugs a leak introduced in r306184.

Pointy hat to: mjg, oshogbo
f8bf539062acdd798382bed10f867a6b0d34bd30 22-Sep-2016 oshogbo <oshogbo@FreeBSD.org> fd: simplify fgetvp_rights by using fget_cap_locked

Reviewed by: mjg
00b67b15b9ffa1019fe84745accba07152c64e44 15-Sep-2016 emaste <emaste@FreeBSD.org> Renumber license clauses in sys/kern to avoid skipping #3
30c9f8790211894de27be0eab37519588cf284f2 12-Sep-2016 oshogbo <oshogbo@FreeBSD.org> fd: add fget_cap and fget_cap_locked primitives

They can be used to obtain capabilities along with a referenced fp.

Reviewed by: mjg@
248ff360f4c721615c93774ffcec24d841322b09 01-Sep-2016 emaste <emaste@FreeBSD.org> allow kern.proc.nfds sysctl in capability mode

Reviewed by: allanjude
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D7733
1a26a9e106760d3cb0b003fb2d0c4b3a03777304 30-Aug-2016 mjg <mjg@FreeBSD.org> fd: add fdeget_locked and use in kern_descrip
0a1afe2a20b29643b9b97bf3b80187acd8888749 10-Aug-2016 mjg <mjg@FreeBSD.org> sigio: do a lockless check in funsetownlist

There is no need to grab the lock first to see if sigio is used, and it
typically is not.
6c02e936c3595a0ebbc386649bd9adcaf3beb6ec 10-Jul-2016 rwatson <rwatson@FreeBSD.org> Audit file-descriptor arguments to I/O system calls such as
read(2), write(2), dup(2), and mmap(2). This auditing is not
required by the Common Criteria (and hence was not being
performed), but is valuable in both contemporary live analysis
and forensic use cases.

MFC after: 3 days
Sponsored by: DARPA, AFRL
2c796ad6408184cf2bb690fa55a4fb5c6a894f35 27-Jun-2016 bdrewery <bdrewery@FreeBSD.org> MFC r298819:

sys/kern: spelling fixes in comments.
3cca53e0a8711f87ce3032fadabbbd3241514ca2 27-May-2016 mjg <mjg@FreeBSD.org> fd: provide a common exit point for unlock in kern_dup

While here assert dropped filedesc lock on return from closefp.
00d578928eca75be320b36d37543a7e2a4f9fbdb 27-May-2016 grehan <grehan@FreeBSD.org> Create branch for bhyve graphics import.
4376e44d3a1b68847c2e83ec9dd34a601e9dd791 08-May-2016 mjg <mjg@FreeBSD.org> fd: assert dropped filedesc lock in fdcloseexec
28823d06561e2e9911180b17a57e05ff19d7cbf6 29-Apr-2016 pfg <pfg@FreeBSD.org> sys/kern: spelling fixes in comments.

No functional change.
e05176a63dbba4794d3d611cf9072885b3cf1eb3 29-Mar-2016 glebius <glebius@FreeBSD.org> The sendfile(2) allows to send extra data from userspace before the file
data (headers). Historically the size of the headers was not checked
against the socket buffer space. Application could easily overcommit the
socket buffer space.

With the new sendfile (r293439) the problem remained, but a KASSERT was
inserted that checked that amount of data written to the socket matches
its space. In case when size of headers is bigger that socket space,
KASSERT fires. Without INVARIANTS the new sendfile won't panic, but
would report incorrect amount of bytes sent.

o With this change, the headers copyin is moved down into the cycle, after
the sbspace() check. The uio size is trimmed by socket space there,
which fixes the overcommit problem and its consequences.
o The compatibility handling for FreeBSD 4 sendfile headers API is pushed
up the stack to syscall wrappers. This required a copy and paste of the
code, but in turn this allowed to remove extra stack carried parameter
from fo_sendfile_t, and embrace entire compat code into #ifdef. If in
future we got more fo_sendfile_t function, the copy and paste level would
even reduce.

Reviewed by: emax, gallatin, Maxim Dounin <mdounin mdounin.ru>
Tested by: Vitalij Satanivskij <satan ukr.net>
Sponsored by: Netflix
1b87e4306ee815e729858821fecb0e8826c836fc 09-Mar-2016 jhb <jhb@FreeBSD.org> Simplify AIO initialization now that it is standard.

- Mark AIO system calls as STD and remove the helpers to dynamically
register them.
- Use COMPAT6 for the old system calls with the older sigevent instead of
an 'o' prefix.
- Simplify the POSIX configuration to note that AIO is always available.
- Handle AIO in the default VOP_PATHCONF instead of special casing it in
the pathconf() system call. fpathconf() is still hackish.
- Remove freebsd32_aio_cancel() as it just called the native one directly.

Reviewed by: kib
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5589
cb56e836d9a9db908a5f1e021e6483e1e8aab048 10-Feb-2016 jhb <jhb@FreeBSD.org> MFC 287442,287537,288944:
Fix corruption of coredumps due to procstat notes changing size during
coredump generation. The changes in r287442 required some reworking
since the 'fo_fill_kinfo' file op does not exist in stable/10.

287442:
Detect badly behaved coredump note helpers

Coredump notes depend on being able to invoke dump routines twice; once
in a dry-run mode to get the size of the note, and another to actually
emit the note to the corefile.

When a note helper emits a different length section the second time
around than the length it requested the first time, the kernel produces
a corrupt coredump.

NT_PROCSTAT_FILES output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' fd table
via vn_fullpath. As vnodes may move around during dump, this is racy.

So:

- Detect badly behaved notes in putnote() and pad underfilled notes.

- Add a fail point, debug.fail_point.fill_kinfo_vnode__random_path to
exercise the NT_PROCSTAT_FILES corruption. It simply picks random
lengths to expand or truncate paths to in fo_fill_kinfo_vnode().

- Add a sysctl, kern.coredump_pack_fileinfo, to allow users to
disable kinfo packing for PROCSTAT_FILES notes. This should avoid
both FILES note corruption and truncation, even if filenames change,
at the cost of about 1 kiB in padding bloat per open fd. Document
the new sysctl in core.5.

- Fix note_procstat_files to self-limit in the 2nd pass. Since
sometimes this will result in a short write, pad up to our advertised
size. This addresses note corruption, at the risk of sometimes
truncating the last several fd info entries.

- Fix NT_PROCSTAT_FILES consumers libutil and libprocstat to grok the
zero padding.

287537:
Follow-up to r287442: Move sysctl to compiled-once file

Avoid duplicate sysctl nodes.

288944:
Fix core corruption caused by race in note_procstat_vmmap

This fix is spiritually similar to r287442 and was discovered thanks to
the KASSERT added in that revision.

NT_PROCSTAT_VMMAP output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' vm map
via vn_fullpath. As vnodes may move during coredump, this is racy.

We do not remove the race, only prevent it from causing coredump
corruption.

- Add a sysctl, kern.coredump_pack_vmmapinfo, to allow users to disable
kinfo packing for PROCSTAT_VMMAP notes. This avoids VMMAP corruption
and truncation, even if names change, at the cost of up to PATH_MAX
bytes per mapped object. The new sysctl is documented in core.5.

- Fix note_procstat_vmmap to self-limit in the second pass. This
addresses corruption, at the cost of sometimes producing a truncated
result.

- Fix PROCSTAT_VMMAP consumers libutil (and libprocstat, via copy-paste)
to grok the new zero padding.

Approved by: re (gjb)
bc4e60465f09836d24e92a50cb8e5273c32421d5 07-Nov-2015 mjg <mjg@FreeBSD.org> fd: implement kern.proc.nfds sysctl

Intended purpose is to provide an equivalent of OpenBSD's getdtablecount
syscall for the compat library..
519b1a72110a3fcabd50992789ac60e739783c6f 07-Sep-2015 mjg <mjg@FreeBSD.org> fd: make rights a mandatory argument to fgetvp_rights

The only caller already always passes rights.
cc8534cb73a97c9ee7e5e8868797861b114dfaa8 07-Sep-2015 mjg <mjg@FreeBSD.org> fd: make the common case in filecaps_copy work lockless

The filedesc lock is only needed if ioctls caps are present, which is a
rare situation. This is a step towards reducing the scope of the filedesc
lock.
f96df638b82d62592ace30c48021cadb525e0095 03-Sep-2015 cem <cem@FreeBSD.org> Detect badly behaved coredump note helpers

Coredump notes depend on being able to invoke dump routines twice; once
in a dry-run mode to get the size of the note, and another to actually
emit the note to the corefile.

When a note helper emits a different length section the second time
around than the length it requested the first time, the kernel produces
a corrupt coredump.

NT_PROCSTAT_FILES output length, when packing kinfo structs, is tied to
the length of filenames corresponding to vnodes in the process' fd table
via vn_fullpath. As vnodes may move around during dump, this is racy.

So:

- Detect badly behaved notes in putnote() and pad underfilled notes.

- Add a fail point, debug.fail_point.fill_kinfo_vnode__random_path to
exercise the NT_PROCSTAT_FILES corruption. It simply picks random
lengths to expand or truncate paths to in fo_fill_kinfo_vnode().

- Add a sysctl, kern.coredump_pack_fileinfo, to allow users to
disable kinfo packing for PROCSTAT_FILES notes. This should avoid
both FILES note corruption and truncation, even if filenames change,
at the cost of about 1 kiB in padding bloat per open fd. Document
the new sysctl in core.5.

- Fix note_procstat_files to self-limit in the 2nd pass. Since
sometimes this will result in a short write, pad up to our advertised
size. This addresses note corruption, at the risk of sometimes
truncating the last several fd info entries.

- Fix NT_PROCSTAT_FILES consumers libutil and libprocstat to grok the
zero padding.

With suggestions from: bjk, jhb, kib, wblock
Approved by: markj (mentor)
Relnotes: yes
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3548
0c1fc3bcd233287a005584c770d900a6d4a650cd 02-Sep-2015 mjg <mjg@FreeBSD.org> fd: remove UMA_ZONE_ZINIT argument from Files zone

Originally it was added in order to prevent trashing of objects with
INVARIANTS enabled. The same effect is now provided with mere UMA_ZONE_NOFREE.

This reverts r286921.

Discussed with: kib
730b59f53ec753ae0d0e38da33598a8106bbc339 19-Aug-2015 kib <kib@FreeBSD.org> fget_unlocked() depends on the freed struct file f_count field being
zero. The file_zone if no-free, but r284861 added trashing of the
freed memory. Most visible manifestation of the issue were 'memory
modified after free' panics for the file zone, triggered from
falloc_noinstall().

Add UMA_ZONE_ZINIT flag to turn off trashing. Mjg noted that it makes
sense to not trash freed memory for any non-free zone, which will be
done later.

Reported and tested by: pho
Discussed with: mjg
Sponsored by: The FreeBSD Foundation
4072f1cf769d51d0ca066d83b0e921cf3ec74bb6 29-Jul-2015 ed <ed@FreeBSD.org> Introduce falloc_caps() to create descriptors with capabilties in place.

falloc_noinstall() followed by finstall() allows you to create and
install file descriptors with custom capabilities. Add falloc_caps()
that can do both of these actions in one go.

This will be used by CloudABI to create pipes with custom capabilities.

Reviewed by: mjg
8a59bb0b0a335464001e96a099f58def1ab97a42 28-Jul-2015 kib <kib@FreeBSD.org> MFC r285134 (by mjg):
fd: de-k&r-ify functions + some whitespace fixes

MFC r285269:
Handle copyout for the fcntl(F_OGETLK) using oflock structure.
adb070c566016d743098c1caf4b71760ca04d0fa 16-Jul-2015 mjg <mjg@FreeBSD.org> fd: partially deduplicate fdescfree and fdescfree_remapped

This also moves vrele of cdir/rdir/jdir vnodes earlier, which should not
matter.
a85d617da64df13a78f7bad95ba6c0a6be6da840 16-Jul-2015 ed <ed@FreeBSD.org> Implement CloudABI's exec() call.

Summary:
In a runtime that is purely based on capability-based security, there is
a strong emphasis on how programs start their execution. We need to make
sure that we execute an new program with an exact set of file
descriptors, ensuring that credentials are not leaked into the process
accidentally.

Providing the right file descriptors is just half the problem. There
also needs to be a framework in place that gives meaning to these file
descriptors. How does a CloudABI mail server know which of the file
descriptors corresponds to the socket that receives incoming emails?
Furthermore, how will this mail server acquire its configuration
parameters, as it cannot open a configuration file from a global path on
disk?

CloudABI solves this problem by replacing traditional string command
line arguments by tree-like data structure consisting of scalars,
sequences and mappings (similar to YAML/JSON). In this structure, file
descriptors are treated as a first-class citizen. When calling exec(),
file descriptors are passed on to the new executable if and only if they
are referenced from this tree structure. See the cloudabi-run(1) man
page for more details and examples (sysutils/cloudabi-utils).

Fortunately, the kernel does not need to care about this tree structure
at all. The C library is responsible for serializing and deserializing,
but also for extracting the list of referenced file descriptors. The
system call only receives a copy of the serialized data and a layout of
what the new file descriptor table should look like:

int proc_exec(int execfd, const void *data, size_t datalen, const int *fds,
size_t fdslen);

This change introduces a set of fd*_remapped() functions:

- fdcopy_remapped() pulls a copy of a file descriptor table, remapping
all of the file descriptors according to the provided mapping table.
- fdinstall_remapped() replaces the file descriptor table of the process
by the copy created by fdcopy_remapped().
- fdescfree_remapped() frees the table in case we aborted before
fdinstall_remapped().

We then add a function exec_copyin_data_fds() that builds on top these
functions. It copies in the data and constructs a new remapped file
descriptor. This is used by cloudabi_sys_proc_exec().

Test Plan:
cloudabi-run(1) is capable of spawning processes successfully, providing
it data and file descriptors. procstat -f seems to confirm all is good.
Regular FreeBSD processes also work properly.

Reviewers: kib, mjg

Reviewed By: mjg

Subscribers: imp

Differential Revision: https://reviews.freebsd.org/D3079
a85ed5531d84d12ab5ad8b50190028c228d7d861 11-Jul-2015 mjg <mjg@FreeBSD.org> Create a dedicated function for ensuring that cdir and rdir are populated.

Previously several places were doing it on its own, partially
incorrectly (e.g. without the filedesc locked) or even actively harmful
by populating jdir or assigning rootvnode without vrefing it.

Reviewed by: kib
c71e9ab8634afefe13ecbf6d9f4d812ed55d78fb 11-Jul-2015 mjg <mjg@FreeBSD.org> Move chdir/chroot-related fdp manipulation to kern_descrip.c

Prefix exported functions with pwd_.

Deduplicate some code by adding a helper for setting fd_cdir.

Reviewed by: kib
2efe5a9a7a0d0640addd5cc9da5d44dff0d43e05 10-Jul-2015 mjg <mjg@FreeBSD.org> fd: further cleanup of kern_dup

- make mode enum start from 0 so that the assertion covers all cases [1]
- rename prefix _CLOEXEC flag with _FLAG
- postpone fhold on the old file descriptor, which eliminates the need to fdrop
in error cases.
- fixup FDDUP_FCNTL check missed in the previous commit

This removes 'fp == oldfde->fde_file' assertion which had little value. kern_dup
only calls fd-related functions which cannot drop the lock or a whole lot of
races would be introduced.

Noted by: kib [1]
b3aa72d2a3afd8802add40beef8863175fbb206e 10-Jul-2015 mjg <mjg@FreeBSD.org> fd: split kern_dup flags argument into actual flags and a mode

Tidy up the code inside to switch on the mode.
8cbb0879ba9efc6e4b370f1241205968342b2a05 09-Jul-2015 ed <ed@FreeBSD.org> Add implementations for some of the CloudABI file descriptor system calls.

All of the CloudABI system calls that operate on file descriptors of an
arbitrary type are prefixed with fd_. This change adds wrappers for
most of these system calls around their FreeBSD equivalents.

The dup2() system call present on CloudABI deviates from POSIX, in the
sense that it can only be used to replace existing file descriptor. It
cannot be used to create new ones. The reason for this is that this is
inherently thread-unsafe. Furthermore, there is no need on CloudABI to
use fixed file descriptor numbers. File descriptors 0, 1 and 2 have no
special meaning.

This change exposes the kern_dup() through <sys/syscallsubr.h> and puts
the FDDUP_* flags in <sys/filedesc.h>. It then adds a new flag,
FDDUP_MUSTREPLACE to force that file descriptors are replaced -- not
allocated.

Differential Revision: https://reviews.freebsd.org/D3035
Reviewed by: mjg
d4c928e3e77b95956aa4d990feb5861421ec3db4 09-Jul-2015 mjg <mjg@FreeBSD.org> fd: prepare do_dup for being exported

- rename it to kern_dup.
- prefix flags with FD
- assert that correct flags were passed
79ef161338c558af0dbe0928d4315be38a814695 08-Jul-2015 kib <kib@FreeBSD.org> Handle copyout for the fcntl(F_OGETLK) using oflock structure.
Otherwise, kernel overwrites a word past the destination.

Submitted by: walter@pelissero.de
PR: 196718
MFC after: 1 week
feeee4c707ab28bab6cd1180144cef39832a1026 05-Jul-2015 mjg <mjg@FreeBSD.org> fd: make 'rights' a manadatory argument to fget* functions
a0dcef47a21a3f97f54486d540a1c1c4ef230c3f 04-Jul-2015 mjg <mjg@FreeBSD.org> fd: de-k&r-ify functions + some whitespace fixes

No functional changes.
e1055c772b450db60c53109ac296de1c9322438d 21-Jun-2015 trasz <trasz@FreeBSD.org> MFC r282213:

Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

MFC r282901:

Build GENERIC with RACCT/RCTL support by default. Note that it still
needs to be enabled by adding "kern.racct.enable=1" to /boot/loader.conf.

Note those two are MFC-ed together, because the latter one changes the
name of RACCT_DISABLED option to RACCT_DEFAULT_TO_DISABLED. Should have
committed the renaming separately...

Relnotes: yes
Sponsored by: The FreeBSD Foundation
a5a3a94b02f2b62a521d49f600b762567d4f8451 16-Jun-2015 mjg <mjg@FreeBSD.org> fd: make rights a mandatory argument to fget_unlocked
1b5e46102cb321470ac57b25aeac1c8febd34655 16-Jun-2015 mjg <mjg@FreeBSD.org> fd: don't unnecessary copy capabilities in _fget
1c9277463fcfb21d0d28621797c0f088ca64994c 14-Jun-2015 mjg <mjg@FreeBSD.org> fd: reduce excessive zeroing on fd close

fde_file as NULL is already an indicator of an unused fd. All other
fields are populated when fp is installed.
98e752b84d7e81cbd426946b8caf6bdc96493ce4 14-Jun-2015 mjg <mjg@FreeBSD.org> fd: move out actual fp installation to _finstall

Use it in fd passing functions as the first step towards fd code cleanup.
b6be1c5ace6f352e4f44e0ef31ed744b19658c5e 10-Jun-2015 mjg <mjg@FreeBSD.org> Fixup the build after r284215.

Submitted by: Ivan Klymenko <fidaj ukr.net> [slighly modified]
d7bc9285a673d676370f95a84ce93ef553c8688c 10-Jun-2015 mjg <mjg@FreeBSD.org> Implement lockless resource limits.

Use the same scheme implemented to manage credentials.

Code needing to look at process's credentials (as opposed to thred's) is
provided with *_proc variants of relevant functions.

Places which possibly had to take the proc lock anyway still use the proc
pointer to access limits.
59391cba660199284969090c9993ba04ba5846c7 10-Jun-2015 mjg <mjg@FreeBSD.org> fd: remove fdesc_mtx
34814bbfee78c6ce02f8dbfa7c2a287f529f832f 10-Jun-2015 mjg <mjg@FreeBSD.org> fd: use atomics to manage fd_refcnt and fd_holcnt

This gets rid of fdesc_mtx.
3dafd57ac77c8c05972132e9c4996e10cec248a8 18-May-2015 mjg <mjg@FreeBSD.org> fd: fix imbalanced fdp unlock in F_SETLK and F_GETLK

MFC after: 3 days
802017a04b7fb1bc31576aa2de108cf67083c42c 29-Apr-2015 trasz <trasz@FreeBSD.org> Add kern.racct.enable tunable and RACCT_DISABLED config option.
The point of this is to be able to add RACCT (with RACCT_DISABLED)
to GENERIC, to avoid having to rebuild the kernel to use rctl(8).

Differential Revision: https://reviews.freebsd.org/D2369
Reviewed by: kib@
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
c892ebeb5ce275cd0d5b28f1a2406ec478c4657b 26-Apr-2015 mjg <mjg@FreeBSD.org> fd: plug an always overwritten initialization in fdalloc
22da590f1183b3e1c50dcae3b7b86f7d0f98f847 11-Apr-2015 mjg <mjg@FreeBSD.org> fd: remove filedesc argument from fdclose

Just accept a thread instead. This makes it consistent with fdalloc.

No functional changes.
ce00fb8c102a2d4ec56ee97773d623128731e3f2 24-Mar-2015 mjg <mjg@FreeBSD.org> filedesc: microoptimize fget_unlocked by getting rid of fd < 0 branch

Casting fd to an unsigned type simplifies fd range coparison to mere checking
if the result is bigger than the table.
6102a34d3875e6b3f22e0245d7698fe549c674a1 19-Mar-2015 rwatson <rwatson@FreeBSD.org> Merge r263233 from HEAD to stable/10:

Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

Sponsored by: Google, Inc.
35d6abc9b429d82132df4ea09cb15630d41a5a8d 14-Mar-2015 dim <dim@FreeBSD.org> Merge ^/head r279985 through r279994.
0dd684d23f5fa183acbeba8ba19a2dcd92d01c7d 14-Mar-2015 ian <ian@FreeBSD.org> Set the SBUF_INCLUDENUL flag in sbuf_new_for_sysctl() so that sysctl
strings returned to userland include the nulterm byte.

Some uses of sbuf_new_for_sysctl() write binary data rather than strings;
clear the SBUF_INCLUDENUL flag after calling sbuf_new_for_sysctl() in
those cases. (Note that the sbuf code still automatically adds a nulterm
byte in sbuf_finish(), but since it's not included in the length it won't
get copied to userland along with the binary data.)

Remove explicit adding of a nulterm byte in a couple places now that it
gets done automatically by the sbuf drain code.

PR: 195668
df6775dbed4dc2d519fc9c29871f761a9cfa370e 12-Mar-2015 kib <kib@FreeBSD.org> MFC r272566:
Convert -1 from sbuf_bcat() to ENOMEM.
2f36a264640b2de955efad3cc496186c55dbd4dc 18-Feb-2015 mjg <mjg@FreeBSD.org> filedesc: obtain a stable copy of credentials in fget_unlocked

This was broken in r278930.

While here tidy up fget_mmap to use fdp from local var instead of obtaining
the same pointer from td.
0a219ba739034ca95b28aa25c7cba77221c8a5b0 17-Feb-2015 mjg <mjg@FreeBSD.org> filedesc: simplify fget_unlocked & friends

Introduce fget_fcntl which performs appropriate checks when needed.
This removes a branch from fget_unlocked.

Introduce fget_mmap dealing with cap_rights_to_vmprot conversion.
This removes a branch from _fget.

Modify fget_unlocked to pass sequence counter to interested callers so
that they can perform their own checks and make sure the result was
otained from stable & current state.

Reviewed by: silence on -hackers
9bc86796d313e438f3a218fdb300d416f6300f23 21-Jan-2015 mjg <mjg@FreeBSD.org> filedesc: avoid spurious copying of capabilities in fget_unlocked

We obtain a stable copy and store it in local 'fde' variable. Storing another
copy (based on aforementioned variable) does not serve any purpose.

No functional changes.
e15a87cc6afbdc8c09e8b31228795a62e99888c0 21-Jan-2015 mjg <mjg@FreeBSD.org> filedesc: return 0 from badfo_close

The only potential in-tree consumer (_fdrop) special-cased it and returns 0
0 on its own instead of calling badfo_close.

Remove the special case since it is not needed and very unlikely to encounter
anyway.

No objections from: kib
4b90cc79eee81c2c8350ef5dfecc1ea5892360e8 21-Jan-2015 mjg <mjg@FreeBSD.org> filedesc: fix whitespace nits in fget and fget_read

No functional changes.
03fe27a77383de96ee608557018620836fe5cb12 21-Jan-2015 mjg <mjg@FreeBSD.org> filedesc: plug a test for impossible condition in _fget
58529d92bdbb7dcb7b249197a57fa24ce121ceea 24-Nov-2014 dim <dim@FreeBSD.org> Merge ^/head r274961 through r274978.
c5e82d754f77fcaf7a68c60ff32dec3bdd556f1b 24-Nov-2014 jhb <jhb@FreeBSD.org> Properly initialize the capability rights for vnodes exported to procstat
that aren't for file descriptors (cwd, jdir, tracevp, etc.).

Submitted by: Mikhail <mp@lenta.ru>
e31a493d7e30916835bfeb0f6c490b22bac757bd 23-Nov-2014 mjg <mjg@FreeBSD.org> filedesc: plug a test for impossible condition in fgetvp_rights
077a8b14ec3f0d00ec056b20dc578f3345fedb21 13-Nov-2014 mjg <mjg@FreeBSD.org> filedesc: fixup fdinit to lock fdp and preapare files conditinally

Not all consumers providing fdp to copy from want files.

Perhaps these functions should be reorganized to better express the outcome.

This fixes up panics after r273895 .

Reported by: markj
b4ef709604332a259f2a08f546cceec6ab3ecace 13-Nov-2014 kib <kib@FreeBSD.org> Remove the no-at variants of the kern_xx() syscall helpers. E.g., we
have both kern_open() and kern_openat(); change the callers to use
kern_openat().

This removes one (sometimes two) levels of indirection and
consolidates arguments checks.

Reviewed by: mckusick
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
53273c84d07812ce5a86db41d425a50e20397b64 11-Nov-2014 glebius <glebius@FreeBSD.org> Remove SF_KQUEUE code. This code was developed at Netflix, but was not
ever used. It didn't go into stable/10, neither was documented.
It might be useful, but we collectively decided to remove it, rather
leave it abandoned and unmaintained. It is removed in one single
commit, so restoring it should be easy, if anyone wants to reopen
this idea.

Sponsored by: Netflix
7e57127b4607f638b116e0253fbba4e182a03d7b 06-Nov-2014 mjg <mjg@FreeBSD.org> Add sysctl kern.proc.cwd

It returns only current working directory of given process which saves a lot of
overhead over kern.proc.filedesc if given proc has a lot of open fds.

Submitted by: Tiwei Bie <btw mail.ustc.edu.cn> (slightly modified)
X-Additional: JuniorJobs project
48a19ff17aa0a4efbd8420705119a8c2ec40b1d6 06-Nov-2014 mjg <mjg@FreeBSD.org> filedesc: avoid taking fdesc_mtx when not necessary in fddrop

No functional changes.
355e7bb0055af5cdc352b748130558416b1e9e7d 06-Nov-2014 mjg <mjg@FreeBSD.org> filedesc: just free old tables without altering the list which is freed anyway

No functional changes.
0983cfdba142006d551e5067810585f2bf67f4e0 03-Nov-2014 mjg <mjg@FreeBSD.org> filedesc: plus sys/kdb.h include which crept in with r274007
04a088dde4e41f9dc15a66768e025e2f4ab2dcbd 03-Nov-2014 mjg <mjg@FreeBSD.org> filedesc: plug unnecessary fdp NULL checks in fdescfreee and fdcopy

Anything reaching these functions has fd table.
120816c07fe9c9926d7d5a42e22ad3af2377a078 03-Nov-2014 mjg <mjg@FreeBSD.org> filedesc: create a dedicated zone for struct filedesc0

Currently sizeof(struct filedesc0) is 1096 bytes, which means allocations from
malloc use 2048 bytes.

There is no easy way to shrink the structure <= 1024 an it is likely to grow in
the future.
d8d7f263db13447a4115367740022849e8c1f88d 02-Nov-2014 mjg <mjg@FreeBSD.org> filedesc: move freeing old tables to fdescfree

They cannot be accessed by anyone and hold count only protects the structure
from being freed.
31183326d58ddfc64a2482d58f96b366c399b3a3 02-Nov-2014 mjg <mjg@FreeBSD.org> filedesc: factor out some code out of fdescfree

Previously it had a huge self-contained chunk dedicated to dealing with shared
tables.

No functional changes.
79f817d7d7f04a098d568f2871bc2de40a8f867c 02-Nov-2014 mjg <mjg@FreeBSD.org> filedesc: tidy up fdcheckstd

No functional changes.
22a53e3b5ace7a690b1f0bb73f790f6d348f9b24 02-Nov-2014 mjg <mjg@FreeBSD.org> filedesc: lock filedesc lock in fdcloseexec only when needed
5b231323b27a7f3aa66ccfd3b009539a117e6ec1 31-Oct-2014 mjg <mjg@FreeBSD.org> filedesc: drop retval argument from do_dup

It was almost always td_retval anyway.

For the one case where it is not, preserve the old value across the call.
6b53d30f115a091f8d1e52ee34c6cb4e5a5a5448 31-Oct-2014 mjg <mjg@FreeBSD.org> filedesc: fix missed comments about fdsetugidsafety

While here just note that both fdsetugidsafety and fdcheckstd take sleepable
locks.
efbe4d69c81daa28d9d9b872462adb13aea4510f 31-Oct-2014 mjg <mjg@FreeBSD.org> filedesc: make fdinit return with source filedesc locked and new one sized
appropriately

Assert FILEDESC_XLOCK_ASSERT only for already used tables in fdgrowtable.
We don't have to call it with the lock held if we are just creating new
filedesc.

As a side note, strictly speaking processes can have fdtables with
fd_lastfile = -1, but then they cannot enter fdgrowtable. Very first file
descriptor they get will be 0 and the only syscall allowing to choose fd number
requires an active file descriptor. Should this ever change, we can add an 'init'
(or similar) parameter to fdgrowtable.
9772964585c87a3d304974442e63ba84dc775710 31-Oct-2014 mjg <mjg@FreeBSD.org> filedesc: iterate over fd table only once in fdcopy

While here add 'fdused_init' which does not perform unnecessary work.

Drop FILEDESC_LOCK_ASSERT from fdisused and rely on callers to hold
it when appropriate. This function is only used with INVARIANTS.

No functional changes intended.
94f45340d928db4648ba2327caf5ae979a6eacfe 31-Oct-2014 mjg <mjg@FreeBSD.org> filedesc: tidy up fdfree

Implement fdefree_last variant and get rid of 'last' parameter.

No functional changes.
02363563c8ca3a71033a91a65698f16c1cb90ec7 31-Oct-2014 mjg <mjg@FreeBSD.org> filedesc: tidy up fdcopy a little bit

Test for file availability by fde_file != NULL instead of fdisused, this is
consistent with similar checks later.

Drop badfileops check. badfileops don't have DFLAG_PASSABLE set, so it was never
reached in practice.

fdiused is now only used in some KASSERTS, so ifdef it under INVARIANTS.

No functional changes.
cda1078a58af7470bae0d4534c76c588f4f79277 30-Oct-2014 mjg <mjg@FreeBSD.org> filedesc: make sure to force table reload in fget_unlocked when count == 0

This is a fixup to r273843.
569cf8ac16dba31f925cd8cda2c15ea8f3df4333 30-Oct-2014 mjg <mjg@FreeBSD.org> filedesc: microoptimize fget_unlocked by retrying obtaining reference count
without restarting whole lookup

Restart is only needed when fp was closed by current process, which is a much
rarer event than ref/deref by some other thread.
5bb6a8bca1bb93742c27552b40fa4d271db0beb7 30-Oct-2014 mjg <mjg@FreeBSD.org> filedesc: get rid of atomic_load_acq_int from fget_unlocked

A read barrier was necessary because fd table pointer and table size were
updated separately, opening a window where fget_unlocked could read new size
and old pointer.

This patch puts both these fields into one dedicated structure, pointer to which
is later atomically updated. As such, fget_unlocked only needs data a dependency
barrier which is a noop on all supported architectures.

Reviewed by: kib (previous version)
MFC after: 2 weeks
1f41d295fba2d409098181fe1212b0bbad862311 27-Oct-2014 hselasky <hselasky@FreeBSD.org> MFC r263710, r273377, r273378, r273423 and r273455:

- De-vnet hash sizes and hash masks.
- Fix multiple issues related to arguments passed to SYSCTL macros.

Sponsored by: Mellanox Technologies
04223abe342610734aaa84917dc413327296e4b9 22-Oct-2014 mjg <mjg@FreeBSD.org> filedesc assert that table size is at least 3 in fdsetugidsafety

Requested by: kib
2ebe66c2905234e9e3b5057b47542a9b61d7144c 22-Oct-2014 mjg <mjg@FreeBSD.org> filedesc: cleanup setugidsafety a little

Rename it to fdsetugidsafety for consistency with other functions.

There is no need to take filedesc lock if not closing any files.

The loop has to verify each file and we are guaranteed fdtable has space
for at least 20 fds. As such there is no need to check fd_lastfile.

While here tidy up is_unsafe.
49c137f7be5791eee8102395257cdf48b40c81f7 21-Oct-2014 hselasky <hselasky@FreeBSD.org> Fix multiple incorrect SYSCTL arguments in the kernel:

- Wrong integer type was specified.

- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.

- Logical OR where binary OR was expected.

- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.

- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.

- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.

- Updated "EXAMPLES" section in SYSCTL manual page.

MFC after: 3 days
Sponsored by: Mellanox Technologies
f1a57b3826491cc58fd220f317ce0e5bbaec9edf 20-Oct-2014 mjg <mjg@FreeBSD.org> filedesc: plug 2 write-only variables

Reported by: Coverity
CID: 1245745, 1245746
ece6d4cf1c1c4c0fdde05aab65c2b7e3c96cf53f 15-Oct-2014 mjg <mjg@FreeBSD.org> filedesc: plug 2 assignments to M_ZERO-ed pointers in falloc_noinstall

No functional changes.
6eb4db1c33a22c087b64c3d0d15b9d5e0b54e7fd 14-Oct-2014 mjg <mjg@FreeBSD.org> MFC r269023,r272503,r272505,r272523,r272567,r272569,r272574

Prepare fget_unlocked for reading fd table only once.

Some capsicum functions accept fdp + fd and lookup fde based on that.
Add variants which accept fde.

===============================

Add sequence counters with memory barriers.

Current implementation is somewhat simplistic and hackish,
will be improved later after possible memory barrier overhaul.

===============================

Plug capability races.

fp and appropriate capability lookups were not atomic, which could result in
improper capabilities being checked.

This could result either in protection bypass or in a spurious ENOTCAPABLE.

Make fp + capability check atomic with the help of sequence counters.

===============================

Put and #ifdef _KERNEL around the #include for opt_capsicum.h to
hopefully allow the build to finish after r272505.

===============================

filedesc: fix up breakage introduced in 272505

Include sequence counter supports incoditionally [1]. This fixes reprted build
problems with e.g. nvidia driver due to missing opt_capsicum.h.

Replace fishy looking sizeof with offsetof. Make fde_seq the last member in
order to simplify calculations.

===============================

Keep struct filedescent comments within 80-char limit.

===============================

seq_t needs to be visible to userspace
98fa5f5d8bbc4ed8f0651db2df217dd5478f289f 05-Oct-2014 mjg <mjg@FreeBSD.org> filedesc: fix up breakage introduced in 272505

Include sequence counter supports incoditionally [1]. This fixes reprted build
problems with e.g. nvidia driver due to missing opt_capsicum.h.

Replace fishy looking sizeof with offsetof. Make fde_seq the last member in
order to simplify calculations.

Suggested by: kib [1]
X-MFC: with 272505
2ad09fbf89ce51a0a513c5cc0f55e34513fd72b7 05-Oct-2014 kib <kib@FreeBSD.org> On error, sbuf_bcat() returns -1. Some callers returned this -1 to
the upper layers, which interpret it as errno value, which happens to
be ERESTART. The result was spurious restarts of the sysctls in loop,
e.g. kern.proc.proc, instead of returning ENOMEM to caller.

Convert -1 from sbuf_bcat() to ENOMEM, when returning to the callers
expecting errno.

In collaboration with: pho
Sponsored by: The FreeBSD Foundation (kib)
MFC after: 1 week
c0fe514f041daa8ac93204d67c4261bb177c2668 04-Oct-2014 mjg <mjg@FreeBSD.org> Plug capability races.

fp and appropriate capability lookups were not atomic, which could result in
improper capabilities being checked.

This could result either in protection bypass or in a spurious ENOTCAPABLE.

Make fp + capability check atomic with the help of sequence counters.

Reviewed by: kib
MFC after: 3 weeks
2a73c68cd0a6c420152d56ec65fd38b319af3f39 28-Sep-2014 kib <kib@FreeBSD.org> MFC r272132:
Fix fcntl(2) compat32 after r270691.

Approved by: re (glebius)
c60179fc3c7373df6621acbd14a4a663a7d913c6 26-Sep-2014 mjg <mjg@FreeBSD.org> MFC r270993:

Fix up proc_realparent to always return correct process.

Prior to the change it would always return initproc for non-traced processes.

This fixes a regression in inferior().

Approved by: re (marius)
54f38c8738077e829c5528976a3269afc01ff7ee 26-Sep-2014 mjg <mjg@FreeBSD.org> Make do_dup() static and move relevant macros to kern_descrip.c

No functional changes.
d972eee1e7d12a787bf64c545caa54bbde6c780a 25-Sep-2014 kib <kib@FreeBSD.org> Fix fcntl(2) compat32 after r270691. The copyin and copyout of the
struct flock are done in the sys_fcntl(), which mean that compat32 used
direct access to userland pointers.

Move code from sys_fcntl() to new wrapper, kern_fcntl_freebsd(), which
performs neccessary userland memory accesses, and use it from both
native and compat32 fcntl syscalls.

Reported by: jhibbits
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
8f082668d04c6a91668059ff9bdebcfa153839f0 22-Sep-2014 jhb <jhb@FreeBSD.org> Add a new fo_fill_kinfo fileops method to add type-specific information to
struct kinfo_file.
- Move the various fill_*_info() methods out of kern_descrip.c and into the
various file type implementations.
- Rework the support for kinfo_ofile to generate a suitable kinfo_file object
for each file and then convert that to a kinfo_ofile structure rather than
keeping a second, different set of code that directly manipulates
type-specific file information.
- Remove the shm_path() and ksem_info() layering violations.

Differential Revision: https://reviews.freebsd.org/D775
Reviewed by: kib, glebius (earlier version)
4cd91e9d81f8eee5a5ab7b6250d49c03383d1b96 12-Sep-2014 jhb <jhb@FreeBSD.org> Fix various issues with invalid file operations:
- Add invfo_rdwr() (for read and write), invfo_ioctl(), invfo_poll(),
and invfo_kqfilter() for use by file types that do not support the
respective operations. Home-grown versions of invfo_poll() were
universally broken (they returned an errno value, invfo_poll()
uses poll_no_poll() to return an appropriate event mask). Home-grown
ioctl routines also tended to return an incorrect errno (invfo_ioctl
returns ENOTTY).
- Use the invfo_*() functions instead of local versions for
unsupported file operations.
- Reorder fileops members to match the order in the structure definition
to make it easier to spot missing members.
- Add several missing methods to linuxfileops used by the OFED shim
layer: fo_write(), fo_truncate(), fo_kqfilter(), and fo_stat(). Most
of these used invfo_*(), but a dummy fo_stat() implementation was
added.
a17a2d515687e5cf2389f6b4e46ce2b248aa2fe6 12-Sep-2014 jhb <jhb@FreeBSD.org> Simplify vntype_to_kinfo() by returning when the desired value is found
instead of breaking out of the loop and then immediately checking the loop
index so that if it was broken out of the proper value can be returned.

While here, use nitems().
64b244d971739cfa64d1e65431e44177f595d116 05-Sep-2014 mjg <mjg@FreeBSD.org> Plug unnecessary fp assignments in kern_fcntl.

No functional changes.
1ac724b05e688dc4ddb61cf32a78e861ceab2383 26-Aug-2014 glebius <glebius@FreeBSD.org> - Remove socket file operations declaration from sys/file.h.
- Make them static in sys_socket.c.
- Provide generic invfo_truncate() instead of soo_truncate().

Sponsored by: Netflix
Sponsored by: Nginx, Inc.
c5c0a26f761859a952b782533f0d2ec25a1ec748 26-Aug-2014 mjg <mjg@FreeBSD.org> Fix up races with f_seqcount handling.

It was possible that the kernel would overwrite user-supplied hint.

Abuse vnode lock for this purpose.

In collaboration with: kib
MFC after: 1 week
ce59684e4d3b1070bd191b87caba0fd4cd628ce0 17-Aug-2014 mjg <mjg@FreeBSD.org> MFC r268505, r268507:

Avoid relocking filedesc lock when closing fds during fdp destruction.

Don't call bzero nor fdunused from fdfree for such cases. It would do
unnecessary work and complain that the lock is not taken.

=======

Don't zero fd_nfiles during fdp destruction.

Code trying to take a look has to check fd_refcnt and it is 0 by that time.

This is a follow up to r268505, without this the code would leak memory for
tables bigger than the default.
cc95000af61296b985c1c678b6bc27bd2793810e 23-Jul-2014 mjg <mjg@FreeBSD.org> Prepare fget_unlocked for reading fd table only once.

Some capsicum functions accept fdp + fd and lookup fde based on that.
Add variants which accept fde.

Reviewed by: pjd
MFC after: 1 week
99b5ccae7ca41a5f7e0999be7e2020ead11b5c37 10-Jul-2014 mjg <mjg@FreeBSD.org> Don't zero fd_nfiles during fdp destruction.

Code trying to take a look has to check fd_refcnt and it is 0 by that time.

This is a follow up to r268505, without this the code would leak memory for
tables bigger than the default.

MFC after: 1 week
a97cff3c13480569ea1b205849c693e784adb730 10-Jul-2014 mjg <mjg@FreeBSD.org> Avoid relocking filedesc lock when closing fds during fdp destruction.

Don't call bzero nor fdunused from fdfree for such cases. It would do
unnecessary work and complain that the lock is not taken.

MFC after: 1 week
302b3764d6b807bf0f5ea610145ac630d9621afd 06-Jul-2014 mjg <mjg@FreeBSD.org> MFC r268001:
Make fdunshare accept only td parameter.

Proc had to match the thread anyway and 2 parameters were inconsistent
with the rest.
63523f9ee102fe01f49d27f93b1c7826f21279cb 06-Jul-2014 mjg <mjg@FreeBSD.org> MFC r268000:

Make sure to always clear p_fd for process getting rid of its filetable.

Filetable can be shared with other processes. Previous code failed to
clear the pointer for all but the last process getting rid of the table.
This is mostly cosmetics.

Get rid of 'This should happen earlier' comment. Clearing the pointer in
this place is fine as consumers can reliably check for files availability
by inspecting fd_refcnt and vnodes availabity by NULL-checking them.
134eeca7553ded4c64041465105cebf1c444c9ca 06-Jul-2014 mjg <mjg@FreeBSD.org> MFC r267760:
Tidy up fd-related functions called by do_execve

o assert in each one that fdp is not shared
o remove unnecessary NULL checks - all userspace processes have fdtables
and kernel processes cannot execve
o remove comments about the danger of fd_ofiles getting reallocated - fdtable
is not shared and fd_ofiles could be only reallocated if new fd was about to be
added, but if that was possible the code would already be buggy as setugidsafety
work could be undone
c9bb8da01167a7b5ceaa16259e6b6bab943b7ea9 06-Jul-2014 mjg <mjg@FreeBSD.org> MFC r267755:

Don't take filedesc lock in fdunshare().

We can read refcnt safely and only care if it is equal to 1.

If it could suddenly change from 1 to something bigger the code would be
buggy even in the previous form and transitions from > 1 to 1 are equally racy
and harmless (we copy even though there is no need).
4ec4a6585547ae29000dc768cf108796f523d42a 06-Jul-2014 mjg <mjg@FreeBSD.org> MFC r267710:

fd: replace fd_nfiles with fd_lastfile where appropriate

fd_lastfile is guaranteed to be the biggest open fd, so when the intent
is to iterate over active fds or lookup one, there is no point in looking
beyond that limit.

Few places are left unpatched for now.
bfa18e46636c505178365dff658104704cd4e1c8 06-Jul-2014 mjg <mjg@FreeBSD.org> MFC r267708:

do_dup: plug redundant adjustment of fd_lastfile

By that time it was already set by fdalloc, or was there in the first place
if fd is replaced.
aa97fde7f4ab5f1a73e154d32d43dd35a67eae06 06-Jul-2014 mjg <mjg@FreeBSD.org> MFC r265247:
Request a non-exiting process in sysctl_kern_proc_{o,}filedesc

This fixes a race with exit1 freeing p_textvp.
0954f0fb37ce9da4bad57c8b87c2cd58331cff7c 28-Jun-2014 mjg <mjg@FreeBSD.org> Make fdunshare accept only td parameter.

Proc had to match the thread anyway and 2 parameters were inconsistent
with the rest.

MFC after: 1 week
16033c07939486c314de84b0cda1b2e869eccbd0 28-Jun-2014 mjg <mjg@FreeBSD.org> Make sure to always clear p_fd for process getting rid of its filetable.

Filetable can be shared with other processes. Previous code failed to
clear the pointer for all but the last process getting rid of the table.
This is mostly cosmetics.

Get rid of 'This should happen earlier' comment. Clearing the pointer in
this place is fine as consumers can reliably check for files availability
by inspecting fd_refcnt and vnodes availabity by NULL-checking them.

MFC after: 1 week
dc769e3b99fb91d7c0744365e48470b2e288aa23 23-Jun-2014 mjg <mjg@FreeBSD.org> Tidy up fd-related functions called by do_execve

o assert in each one that fdp is not shared
o remove unnecessary NULL checks - all userspace processes have fdtables
and kernel processes cannot execve
o remove comments about the danger of fd_ofiles getting reallocated - fdtable
is not shared and fd_ofiles could be only reallocated if new fd was about to be
added, but if that was possible the code would already be buggy as setugidsafety
work could be undone

MFC after: 1 week
202339afcf6357683f0e37b4ac2145796d6d2fc5 22-Jun-2014 mjg <mjg@FreeBSD.org> Don't take filedesc lock in fdunshare().

We can read refcnt safely and only care if it is equal to 1.

If it could suddenly change from 1 to something bigger the code would be
buggy even in the previous form and transitions from > 1 to 1 are equally racy
and harmless (we copy even though there is no need).

MFC after: 1 week
d74326bc91b75e4a8f54bcc1fdb0a83ab223e560 22-Jun-2014 mjg <mjg@FreeBSD.org> fd: replace fd_nfiles with fd_lastfile where appropriate

fd_lastfile is guaranteed to be the biggest open fd, so when the intent
is to iterate over active fds or lookup one, there is no point in looking
beyond that limit.

Few places are left unpatched for now.

MFC after: 1 week
38cd838637eab022e13231e78d0727fb8ac4da12 22-Jun-2014 mjg <mjg@FreeBSD.org> do_dup: plug redundant adjustment of fd_lastfile

By that time it was already set by fdalloc, or was there in the first place
if fd is replaced.

MFC after: 1 week
1b83ce15243b4ce6c21f6e2b1bde734b589fc453 02-May-2014 mjg <mjg@FreeBSD.org> Request a non-exiting process in sysctl_kern_proc_{o,}filedesc

This fixes a race with exit1 freeing p_textvp.

Suggested by: kib
MFC after: 1 week
439611d0ad97d5b6e4c10a5d2772f493ee8513ec 04-Apr-2014 mjg <mjg@FreeBSD.org> Garbage collect fdavail.

It rarely returns an error and fdallocn handles the failure of fdalloc
just fine.
ae2823466fe35f50ff85af2bd94e6c0084781b88 31-Mar-2014 mjg <mjg@FreeBSD.org> MFC r263530:
Mark the following sysctls as MPSAFE:
kern.file
kern.proc.filedesc
kern.proc.ofiledesc
11fbc59f9b71fd5d059d03fd72cf18b52ab38ac9 31-Mar-2014 mjg <mjg@FreeBSD.org> MFC r263460:
Take filedesc lock only for reading when allocating new fdtable.

Code populating the table does this already.
df8e97fc8b389a343721dbd7f227ed43f43a0038 21-Mar-2014 mjg <mjg@FreeBSD.org> Mark the following sysctls as MPSAFE:
kern.file
kern.proc.filedesc
kern.proc.ofiledesc

MFC after: 7 days
103a66d7d0553e822a2a6ca43656f23d0065a126 21-Mar-2014 mjg <mjg@FreeBSD.org> Take filedesc lock only for reading when allocating new fdtable.

Code populating the table does this already.

MFC after: 1 week
33fdc14c0cd663baae9fad419e3f9cfe12578196 16-Mar-2014 rwatson <rwatson@FreeBSD.org> Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after: 3 weeks
ee7a8407e1a8f9666ede2bfd3eaefecff7cc0d90 02-Mar-2014 bdrewery <bdrewery@FreeBSD.org> MFC r262006,r262328:

r262006:
Fix M_FILEDESC leak in fdgrowtable() introduced in r244510.
fdgrowtable() now only reallocates fd_map when necessary.
r262328:
Style.

Approved by: bapt (mentor, implicit)
1e9cbfb2bdf6972b7d65cbf64a616630bb5ab07e 02-Mar-2014 bdrewery <bdrewery@FreeBSD.org> MFC r262005:

Remove redundant memcpy of fd_ofiles in fdgrowtable()

Approved by: bapt (mentor, implicit)
3948f93fc83c9323caf1f8d15a2737efcd9f4381 24-Feb-2014 mjg <mjg@FreeBSD.org> MFC r262309:

Fix a race between kern_proc_{o,}filedesc_out and fdescfree leading
to use-after-free.

fdescfree proceeds to free file pointers once fd_refcnt reaches 0, but
kern_proc_{o,}filedesc_out only checked for hold count.
e6b40423035d77338ab4182e05eee3fb8a20ef2e 22-Feb-2014 bdrewery <bdrewery@FreeBSD.org> Fix style of comment blocks.

Reported by: peter
Approved by: bapt (mentor, implicit)
X-MFC with: r262006
1c3ca2a367546ecfe8f196c7c9978e80a0f9e566 21-Feb-2014 mjg <mjg@FreeBSD.org> Fix a race between kern_proc_{o,}filedesc_out and fdescfree leading
to use-after-free.

fdescfree proceeds to free file pointers once fd_refcnt reaches 0, but
kern_proc_{o,}filedesc_out only checked for hold count.

MFC after: 3 days
eb1a5f8de9f7ea602c373a710f531abbf81141c4 21-Feb-2014 gjb <gjb@FreeBSD.org> Move ^/user/gjb/hacking/release-embedded up one directory, and remove
^/user/gjb/hacking since this is likely to be merged to head/ soon.

Sponsored by: The FreeBSD Foundation
c581d5764a887d5655bd1705cfd23b1e1400b826 20-Feb-2014 mjg <mjg@FreeBSD.org> MFC r260233:

Plug a memory leak in dup2 when both old and new fd have ioctl caps.
0db3f6b736bd00b9bb0338d5c6a2a1f2c8db8093 17-Feb-2014 bdrewery <bdrewery@FreeBSD.org> Fix M_FILEDESC leak in fdgrowtable() introduced in r244510.

fdgrowtable() now only reallocates fd_map when necessary.

This fixes fdgrowtable() to use the same logic as fdescfree() for
when to free the fd_map. The logic in fdescfree() is intended to
not free the initial static allocation, however the fd_map grows
at a slower rate than the table does. The table is intended to hold
20 fd, but its initial map has many more slots than 20. The slot
sizing causes NDSLOTS(20) through NDSLOTS(63) to be 1 which matches
NDSLOTS(20), so fdescfree() was assuming that the fd_map was still
the initial allocation and not freeing it.

This partially reverts r244510 by reintroducing some of the logic
it removed in fdgrowtable().

Reviewed by: mjg
Approved by: bapt (mentor)
MFC after: 2 weeks
d8cb95cb17021344559f9414e80768f023fce14d 16-Feb-2014 bdrewery <bdrewery@FreeBSD.org> Remove redundant memcpy of fd_ofiles in fdgrowtable() added in r247602

Discussed with: mjg
Approved by: bapt (mentor)
MFC after: 2 weeks
759bcf6814b3dd4f0739a000f5b2f399804eb5b1 07-Jan-2014 mjg <mjg@FreeBSD.org> MFC r260232:
Don't check for fd limits in fdgrowtable_exp.

Callers do that already and additional check races with process
decreasing limits and can result in not growing the table at all, which
is currently not handled.
83ac68548d8e58cbbddde3a08004050d31c24b27 03-Jan-2014 mjg <mjg@FreeBSD.org> Plug a memory leak in dup2 when both old and new fd have ioctl caps.

Reviewed by: pjd
MFC after: 3 days
3e6a8a9133e3b323fb08f2f8c578da80d0e573f0 03-Jan-2014 mjg <mjg@FreeBSD.org> Don't check for fd limits in fdgrowtable_exp.

Callers do that already and additional check races with process
decreasing limits and can result in not growing the table at all, which
is currently not handled.

MFC after: 3 days
6b01bbf146ab195243a8e7d43bb11f8835c76af8 27-Dec-2013 gjb <gjb@FreeBSD.org> Copy head@r259933 -> user/gjb/hacking/release-embedded for initial
inclusion of (at least) arm builds with the release.

Sponsored by: The FreeBSD Foundation
86274dd213a33a0a9012f50b3209adc1d5a20bed 01-Dec-2013 adrian <adrian@FreeBSD.org> Migrate the sendfile_sync structure into a public(ish) API in preparation
for extending and reusing it.

The sendfile_sync wrapper is mostly just a "mbuf transaction" wrapper,
used to indicate that the backing store for a group of mbufs has completed.
It's only being used by sendfile for now and it's only implementing a
sleep/wakeup rendezvous. However, there are other potential signaling
paths (kqueue) and other potential uses (socket zero-copy write) where the
same mechanism would also be useful.

So, with that in mind:

* extract the sendfile_sync code out into sf_sync_*() methods
* teach the sf_sync_alloc method about the current config flag -
it will eventually know about kqueue.
* move the sendfile_sync code out of do_sendfile() - the only thing
it now knows about is the sfs pointer. The guts of the sync
rendezvous (setup, rendezvous/wait, free) is now done in the
syscall wrapper.
* .. and teach the 32-bit compat sendfile call the same.

This should be a no-op. It's primarily preparation work for teaching
the sendfile_sync about kqueue notification.

Tested:

* Peter Holm's sendfile stress / regression scripts

Sponsored by: Netflix, Inc.
4ac2e7d8d9f36a1e48c8bbb46cfcb1997e166d68 30-Nov-2013 pjd <pjd@FreeBSD.org> Make process descriptors standard part of the kernel. rwhod(8) already
requires process descriptors to work and having PROCDESC in GENERIC
seems not enough, especially that we hope to have more and more consumers
in the base.

MFC after: 3 days
7ff487b3a2f97b08f82ffdc0b157adf5e886b4f7 09-Oct-2013 kib <kib@FreeBSD.org> When growing the file descriptor table, new larger memory chunk is
allocated, but the old table is kept around to handle the case of
threads still performing unlocked accesses to it.

Grow the table exponentially instead of increasing its size by
sizeof(long) * 8 chunks when overflowing. This mode significantly
reduces the total memory use for the processes consuming large numbers
of the file descriptors which open them one by one.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (marius)
9375280e4a91021ced232911b9805672cfb89824 09-Oct-2013 kib <kib@FreeBSD.org> Reduce code duplication, introduce the getmaxfd() helper to calculate
the max filedescriptor index.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (marius)
1dc85910546bedced4f5caca6da78dd02e326f4f 26-Sep-2013 jmg <jmg@FreeBSD.org> it must be the last member, not might...

Reviewed by: attilio
Approved by: re (delphij, gjb)
29d161240ea3c6786da8eaac6f6e04b60a3b2316 25-Sep-2013 attilio <attilio@FreeBSD.org> Avoid memory accesses reordering which can result in fget_unlocked()
seeing a stale fd_ofiles table once fd_nfiles is already updated,
resulting in OOB accesses.

Approved by: re (kib)
Sponsored by: EMC / Isilon storage division
Reported and tested by: pho
Reviewed by: benno
1c7defb76e774a54509fd6d99f3b65eec088b4e9 05-Sep-2013 pjd <pjd@FreeBSD.org> Handle cases where capability rights are not provided.

Reported by: kib
029a6f5d92dc57925b5f155d94d6e01fdab7a45d 05-Sep-2013 pjd <pjd@FreeBSD.org> Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

struct cap_rights {
uint64_t cr_rights[CAP_RIGHTS_VERSION + 2];
};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)

#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
void cap_rights_set(cap_rights_t *rights, ...);
void cap_rights_clear(cap_rights_t *rights, ...);
bool cap_rights_is_set(const cap_rights_t *rights, ...);

bool cap_rights_is_valid(const cap_rights_t *rights);
void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

cap_rights_t rights;

cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

#define cap_rights_set(rights, ...) \
__cap_rights_set((rights), __VA_ARGS__, 0ULL)
void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by: The FreeBSD Foundation
722a1a5e5d54a4935a4136368f443f6c88ca0d71 15-Aug-2013 glebius <glebius@FreeBSD.org> Make sendfile() a method in the struct fileops. Currently only
vnode backed file descriptors have this method implemented.

Reviewed by: kib
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
68a112b157c9bb17835202440fd3abcad66b520b 01-Jul-2013 trociny <trociny@FreeBSD.org> Plug up the lock lock leakage when exporting to a short buffer.

Reported by: Alexander Leidinger
Submitted by: mjg
MFC after: 1 week
19b2af52622e036204809e1bbe24aaaedf35061a 28-Jun-2013 mjg <mjg@FreeBSD.org> Remove duplicate NULL check in kern_proc_filedesc_out.
No functional changes.

MFC after: 1 week
9888acd231f181b7e0579f8e0c29ebd374f7c1ff 28-Jun-2013 trociny <trociny@FreeBSD.org> Rework r252313:

The filedesc lock may not be dropped unconditionally before exporting
fd to sbuf: fd might go away during execution. While it is ok for
DTYPE_VNODE and DTYPE_FIFO because the export is from a vrefed vnode
here, for other types it is unsafe.

Instead, drop the lock in export_fd_to_sb(), after preparing data in
memory and before writing to sbuf.

Spotted by: mjg
Suggested by: kib
Review by: kib
MFC after: 1 week
1e5776a323dbd3816b4bd1f6feed36906154f8d8 27-Jun-2013 trociny <trociny@FreeBSD.org> To avoid LOR, always drop the filedesc lock before exporting fd to sbuf.

Reviewed by: kib
MFC after: 3 days
679fa5ed4e91b1d9e2002d8da43d9c31056fe7a4 03-May-2013 jhb <jhb@FreeBSD.org> Similar to 233760 and 236717, export some more useful info about the
kernel-based POSIX semaphore descriptors to userland via procstat(1) and
fstat(1):
- Change sem file descriptors to track the pathname they are associated
with and add a ksem_info() method to copy the path out to a
caller-supplied buffer.
- Use the fo_stat() method of shared memory objects and ksem_info() to
export the path, mode, and value of a semaphore via struct kinfo_file.
- Add a struct semstat to the libprocstat(3) interface along with a
procstat_get_sem_info() to export the mode and value of a semaphore.
- Teach fstat about semaphores and to display their path, mode, and value.

MFC after: 2 weeks
335f3dbd9136ccf61c363fee13dc4a90c89cce50 14-Apr-2013 trociny <trociny@FreeBSD.org> Re-factor the code to provide kern_proc_filedesc_out(), kern_proc_out(),
and kern_proc_vmmap_out() functions to output process kinfo structures
to sbuf, to make the code reusable.

The functions are going to be used in the coredump routine to store
procstat info in the core program header notes.

Reviewed by: kib
MFC after: 3 weeks
1798a915c42b5d2350b14301401678655b36dc60 14-Apr-2013 mjg <mjg@FreeBSD.org> Add fdallocn function and use it when passing fds over unix socket.

This gets rid of "unp_externalize fdalloc failed" panic.

Reviewed by: pjd
MFC after: 1 week
dc5f593dd87ffeed3afc63abd751ce1284f40f95 07-Apr-2013 trociny <trociny@FreeBSD.org> Use pget(9) to reduce code duplication.

MFC after: 1 week
386f382f2ddeaeb0e83bfb511fd8db942023f01a 03-Mar-2013 pjd <pjd@FreeBSD.org> Use dedicated malloc type for filecaps-related data, so we can detect any
memory leaks easier.
1df614f5db4cff750015f3ddbe56c83f7ed799d4 03-Mar-2013 pjd <pjd@FreeBSD.org> Plug memory leaks in file descriptors passing.
f07ebb8888ea42f744890a727e8f6799a1086915 02-Mar-2013 pjd <pjd@FreeBSD.org> Merge Capsicum overhaul:

- Capability is no longer separate descriptor type. Now every descriptor
has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
cap_new(2), which limits capability rights of the given descriptor
without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
that can be used with the new cap_fcntls_limit(2) syscall and retrive
them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
backward API and ABI compatibility there are some incompatible changes
that are described in detail below:

CAP_CREATE old behaviour:
- Allow for openat(2)+O_CREAT.
- Allow for linkat(2).
- Allow for symlinkat(2).
CAP_CREATE new behaviour:
- Allow for openat(2)+O_CREAT.

Added CAP_LINKAT:
- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
- Allow to be target for renameat(2).

Added CAP_SYMLINKAT:
- Allow for symlinkat(2).

Removed CAP_DELETE. Old behaviour:
- Allow for unlinkat(2) when removing non-directory object.
- Allow to be source for renameat(2).

Removed CAP_RMDIR. Old behaviour:
- Allow for unlinkat(2) when removing directory.

Added CAP_RENAMEAT:
- Required for source directory for the renameat(2) syscall.

Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
- Allow for unlinkat(2) on any object.
- Required if target of renameat(2) exists and will be removed by this
call.

Removed CAP_MAPEXEC.

CAP_MMAP old behaviour:
- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
PROT_WRITE.
CAP_MMAP new behaviour:
- Allow for mmap(2)+PROT_NONE.

Added CAP_MMAP_R:
- Allow for mmap(PROT_READ).
Added CAP_MMAP_W:
- Allow for mmap(PROT_WRITE).
Added CAP_MMAP_X:
- Allow for mmap(PROT_EXEC).
Added CAP_MMAP_RW:
- Allow for mmap(PROT_READ | PROT_WRITE).
Added CAP_MMAP_RX:
- Allow for mmap(PROT_READ | PROT_EXEC).
Added CAP_MMAP_WX:
- Allow for mmap(PROT_WRITE | PROT_EXEC).
Added CAP_MMAP_RWX:
- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

Renamed CAP_MKDIR to CAP_MKDIRAT.
Renamed CAP_MKFIFO to CAP_MKFIFOAT.
Renamed CAP_MKNODE to CAP_MKNODEAT.

CAP_READ old behaviour:
- Allow pread(2).
- Disallow read(2), readv(2) (if there is no CAP_SEEK).
CAP_READ new behaviour:
- Allow read(2), readv(2).
- Disallow pread(2) (CAP_SEEK was also required).

CAP_WRITE old behaviour:
- Allow pwrite(2).
- Disallow write(2), writev(2) (if there is no CAP_SEEK).
CAP_WRITE new behaviour:
- Allow write(2), writev(2).
- Disallow pwrite(2) (CAP_SEEK was also required).

Added convinient defines:

#define CAP_PREAD (CAP_SEEK | CAP_READ)
#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
#define CAP_RECV CAP_READ
#define CAP_SEND CAP_WRITE

#define CAP_SOCK_CLIENT \
(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
#define CAP_SOCK_SERVER \
(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
CAP_SETSOCKOPT | CAP_SHUTDOWN)

Added defines for backward API compatibility:

#define CAP_MAPEXEC CAP_MMAP_X
#define CAP_DELETE CAP_UNLINKAT
#define CAP_MKDIR CAP_MKDIRAT
#define CAP_RMDIR CAP_UNLINKAT
#define CAP_MKFIFO CAP_MKFIFOAT
#define CAP_MKNOD CAP_MKNODAT
#define CAP_SOCK_ALL (CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by: The FreeBSD Foundation
Reviewed by: Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with: rwatson, benl, jonathan
ABI compatibility discussed with: kib
9dc77e1662fd8adcb9f240934bb34cb5a041e783 25-Feb-2013 pjd <pjd@FreeBSD.org> Style.

Suggested by: kib
d632a8a17e630c79d1e1879d5cf51703ab1b3545 25-Feb-2013 pjd <pjd@FreeBSD.org> After r237012, the fdgrowtable() doesn't drop the filedesc lock anymore,
so update a stale comment.

Reviewed by: kib, keramida
3a0f30d9ae32247e8b8e0ddd25526fb22b5aa4df 17-Feb-2013 pjd <pjd@FreeBSD.org> Don't treat pointers as booleans.
cfdf5de93930131e16cdced2079ea1053227da4e 13-Feb-2013 ian <ian@FreeBSD.org> Make the F_READAHEAD option to fcntl(2) work as documented: a value of zero
now disables read-ahead. It used to effectively restore the system default
readahead hueristic if it had been changed; a negative value now restores
the default.

Reviewed by: kib
b04cb3ac24c8af0c1b3416bdc47bc70c95319744 31-Jan-2013 pjd <pjd@FreeBSD.org> Remove label that was accidentally moved during Giant removal from VFS.
425c0645c422cfd7e6f73f95ff2a906d75b08460 20-Dec-2012 des <des@FreeBSD.org> Rewrite fdgrowtable() so common mortals can actually understand what
it does and how, and add comments describing the data structures and
explaining how they are managed.
560aa751e0f5cfef868bdf3fab01cdbc5169ef82 22-Oct-2012 kib <kib@FreeBSD.org> Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho
5a2e16924f3798ee7f3bf930c0d43f38f18a69dd 27-Jul-2012 kib <kib@FreeBSD.org> Add F_DUP2FD_CLOEXEC. Apparently Solaris 11 already did this.

Submitted by: Jukka A. Ukkonen <jau iki fi>
PR: standards/169962
MFC after: 1 week
e04825c920221bef72b0adc4386073fe8e20a946 21-Jul-2012 kib <kib@FreeBSD.org> (Incomplete) fixes for symbols visibility issues and style in fcntl.h.

Append '__' prefix to the tag of struct oflock, and put it under BSD
namespace. Structure is needed both by libc and kernel, thus cannot be
hidden under #ifdef _KERNEL.

Move a set of non-standard F_* and O_* constants into BSD namespace.
SUSv4 explicitely allows implemenation to pollute F_* and O_* names
after fcntl.h is included, but it costs us nothing to adhere
to the specification if exact POSIX compliance level is requested by
user code.

Change some spaces after #define to tabs.

Noted by and discussed with: bde
MFC after: 1 week
44af867a120548f41d9a4cd0de2d25b2033bb7ab 19-Jul-2012 kib <kib@FreeBSD.org> Remove line which was accidentally kept in r238614.

Submitted by: pjd
Pointy hat to: kib
MFC after: 1 week
fb0ee769bd9aa180616ea0ffdb18d9c7bd7a7dfa 19-Jul-2012 kib <kib@FreeBSD.org> Implement F_DUPFD_CLOEXEC command for fcntl(2), specified by SUSv4.

PR: standards/169962
Submitted by: Jukka A. Ukkonen <jau iki fi>
MFC after: 1 week
24d9f5c7d649df6299434ef62e52d3f53708225d 09-Jul-2012 mjg <mjg@FreeBSD.org> Follow-up commit to r238220:

Pass only FEXEC (instead of FREAD|FEXEC) in fgetvp_exec. _fget has to check for
!FWRITE anyway and may as well know about FREAD.

Make _fget code a bit more readable by converting permission checking from if()
to switch(). Assert that correct permission flags are passed.

In collaboration with: kib
Approved by: trasz (mentor)
MFC after: 6 days
X-MFC: with r238220
9cc8fb25a6c747a4ce4eff0392c16c81d2135af8 08-Jul-2012 attilio <attilio@FreeBSD.org> MFC
1e0d3fae01871e2dc80f05ead912d9a1d4f56f15 08-Jul-2012 mjg <mjg@FreeBSD.org> Unbreak handling of descriptors opened with O_EXEC by fexecve(2).

While here return EBADF for descriptors opened for writing (previously it was ETXTBSY).

Add fgetvp_exec function which performs appropriate checks.

PR: kern/169651
In collaboration with: kib
Approved by: trasz (mentor)
MFC after: 1 week
53224f018aac13056c11af9d1233317b1754149c 02-Jul-2012 kib <kib@FreeBSD.org> Extend the KPI to lock and unlock f_offset member of struct file. It
now fully encapsulates all accesses to f_offset, and extends f_offset
locking to other consumers that need it, in particular, to lseek() and
variants of getdirentries().

Ensure that on 32bit architectures f_offset, which is 64bit quantity,
always read and written under the mtxpool protection. This fixes
apparently easy to trigger race when parallel lseek()s or lseek() and
read/write could destroy file offset.

The already broken ABI emulations, including iBCS and SysV, are not
converted (yet).

Tested by: pho
No objections from: jhb
MFC after: 3 weeks
0118c8606277dc863307564505e6b7a9e2400e79 17-Jun-2012 pjd <pjd@FreeBSD.org> Don't check for race with close on advisory unlock (there is nothing smart we
can do when such a race occurs). This saves lock/unlock cycle for the filedesc
lock for every advisory unlock operation.

MFC after: 1 month
32ff81e94ffc0093230891906286153e73046786 17-Jun-2012 pjd <pjd@FreeBSD.org> Extend the comment about checking for a race with close to explain why
it is done and why we don't return an error in such case.

Discussed with: kib
MFC after: 1 month
9a81d01ee060edcfe2d98f6c8d5b7863f3120ce0 17-Jun-2012 pjd <pjd@FreeBSD.org> If VOP_ADVLOCK() call or earlier checks failed don't check for a race with
close, because even if we had a race there is nothing to unlock.

Discussed with: kib
MFC after: 1 month
9719a38d39bd5b70947eca50816a9719adeda5c6 16-Jun-2012 pjd <pjd@FreeBSD.org> Revert r237073. 'td' can be NULL here.

MFC after: 1 month
144a7f643e622e83909cec2f65fbffbfd1365293 15-Jun-2012 pjd <pjd@FreeBSD.org> One more attempt to make prototypes formated according to style(9), which
holefully recovers from the "worse than useless" state.

Reported by: bde
MFC after: 1 month
c2fe03ba67f45b969b3449e0afad35e3490be051 14-Jun-2012 pjd <pjd@FreeBSD.org> Remove fdtofp() function and use fget_locked(), which works exactly the same.

MFC after: 1 month
0984458a799973509f49effefcd1c6ed21736777 14-Jun-2012 pjd <pjd@FreeBSD.org> Assert that the filedesc lock is being held when the fdunwrap() function
is called.

MFC after: 1 month
f84f6132c816f638e7839c37cfa905434b0dab08 14-Jun-2012 pjd <pjd@FreeBSD.org> Simplify the code by making more use of the fdtofp() function.

MFC after: 1 month
4a9c37500e595195522b7382715df72b5dd15669 14-Jun-2012 pjd <pjd@FreeBSD.org> - Assert that the filedesc lock is being held when fdisused() is called.
- Fix white spaces.

MFC after: 1 month
7b02ff91719409c8e8110701c1aaba1982b273a1 14-Jun-2012 pjd <pjd@FreeBSD.org> Style fixes and assertions improvements.

MFC after: 1 month
32b7d4b1494f830b495e37e72fe31be77bef62b6 14-Jun-2012 pjd <pjd@FreeBSD.org> Assert that the filedesc lock is not held when closef() is called.

MFC after: 1 month
e1c12932a73c9237a1ecd19487c5f1e439ffb770 14-Jun-2012 pjd <pjd@FreeBSD.org> Style fixes.

Reported by: bde
MFC after: 1 month
2014b8defb40e6ae0712da32f5a4d8b95447c392 14-Jun-2012 pjd <pjd@FreeBSD.org> Remove code duplication from fdclosexec(), which was the reason of the bug
fixed in r237065.

MFC after: 1 month
6634e42976a701b01eb967c1b46dc3f27ac64ef0 14-Jun-2012 pjd <pjd@FreeBSD.org> When we are closing capabilities during exec, we want to call mq_fdclose()
on the underlying object and not on the capability itself.

Similar bug was fixed in r236853.

MFC after: 1 month
841890f62ae87ce2a3baa7dbdf49dc8494db620a 14-Jun-2012 pjd <pjd@FreeBSD.org> Style.

MFC after: 1 month
0ca632f7e9e457afd6f19baf33eb7cb22413134d 13-Jun-2012 pjd <pjd@FreeBSD.org> When checking if file descriptor number is valid, explicitely check for 'fd'
being less than 0 instead of using cast-to-unsigned hack.

Today's commit was brought to you by the letters 'B', 'D' and 'E' :)
f695b590b4c9789422bbd4036eb6bc2792dcfbac 13-Jun-2012 pjd <pjd@FreeBSD.org> Allocate descriptor number in dupfdopen() itself instead of depending on
the caller using finstall().
This saves us the filedesc lock/unlock cycle, fhold()/fdrop() cycle and closes
a race between finstall() and dupfdopen().

MFC after: 1 month
b836448bf31196deb9b7c15c6666ffb60713ab38 13-Jun-2012 pjd <pjd@FreeBSD.org> There is only one caller of the dupfdopen() function, so we can simplify
it a bit:
- We can assert that only ENODEV and ENXIO errors are passed instead of
handling other errors.
- The caller always call finstall() for indx descriptor, so we can assume
it is set. Actually the filedesc lock is dropped between finstall() and
dupfdopen(), so there is a window there for another thread to close the
indx descriptor, but it will be closed in next commit.

Reviewed by: mjg
MFC after: 1 month
29bd2f6d4632d1ccdcb5f08bc6150a1e3047a731 13-Jun-2012 mjg <mjg@FreeBSD.org> Remove 'low' argument from fd_last_used().

This function is static and the only caller always passes 0 as low.

While here update note about return values in comment.

Reviewed by: pjd
Approved by: trasz (mentor)
MFC after: 1 month
1ca4c8cbf909810e0a057b149ac0af3e776a050e 13-Jun-2012 mjg <mjg@FreeBSD.org> Re-apply reverted parts of r236935 by pjd with some changes.

If fdalloc() decides to grow fdtable it does it once and at most doubles
the size. This still may be not enough for sufficiently large fd. Use fd
in calculations of new size in order to fix this.

When growing the table, fd is already equal to first free descriptor >= minfd,
also fdgrowtable() no longer drops the filedesc lock. As a result of this there
is no need to retry allocation nor lookup.

Fix description of fd_first_free to note all return values.

In co-operation with: pjd
Approved by: trasz (mentor)
MFC after: 1 month
bcf3f4263d67f83311b4d5db42c220f7fa151990 12-Jun-2012 pjd <pjd@FreeBSD.org> Revert part of the r236935 for now, until I figure out why it doesn't
work properly.

Reported by: davidxu
ea4cd345da2829316a13c218d8bffbb0e5df06f9 11-Jun-2012 pjd <pjd@FreeBSD.org> fdgrowtable() no longer drops the filedesc lock so it is enough to
retry finding free file descriptor only once after fdgrowtable().

Spotted by: pluknet
MFC after: 1 month
b7902b949c74ec688d4db4bea24c77f6057c9ba3 11-Jun-2012 pjd <pjd@FreeBSD.org> Use consistent way of checking if descriptor number is valid.

MFC after: 1 month
00ef5a8d828c8772798b2815af2b191e9f5ab752 11-Jun-2012 pjd <pjd@FreeBSD.org> Be consistent with white spaces.

MFC after: 1 month
d698b8f8521a6e0f34b5a9c499cf45941387d178 11-Jun-2012 pjd <pjd@FreeBSD.org> Remove code duplicated in kern_close() and do_dup() and use closefp() function
introduced a minute ago.

This code duplication was responsible for the bug fixed in r236853.

Discussed with: kib
Tested by: pho
MFC after: 1 month
c8465e01a12c87cc250fef1a3c3e50a9c5f8b8d5 11-Jun-2012 pjd <pjd@FreeBSD.org> Introduce closefp() function that we will be able to use to eliminate
code duplication in kern_close() and do_dup().

This is committed separately from the actual removal of the duplicated
code, as the combined diff was very hard to read.

Discussed with: kib
Tested by: pho
MFC after: 1 month
cab8c2dc3aae7f1ca63d0c4fe8b5e0c87d02dad7 11-Jun-2012 pjd <pjd@FreeBSD.org> Merge two ifs into one to make the code almost identical to the code in
kern_close().

Discussed with: kib
Tested by: pho
MFC after: 1 month
b903b5753dfa4750266fd0e1dd06ba451ad7076d 11-Jun-2012 pjd <pjd@FreeBSD.org> Move the code around a bit to move two parts of code duplicated from
kern_close() close together.

Discussed with: kib
Tested by: pho
MFC after: 1 month
2042e99ed8ccdef6450180d8e17ea548b55394d3 11-Jun-2012 pjd <pjd@FreeBSD.org> Now that fdgrowtable() doesn't drop the filedesc lock we don't need to
check if descriptor changed from under us. Replace the check with an
assert.

Discussed with: kib
Tested by: pho
MFC after: 1 month
23c7c80ef5754d3744c0e192b9e37c1761cc0305 10-Jun-2012 pjd <pjd@FreeBSD.org> When we are closing capability during dup2(), we want to call mq_fdclose()
on the underlying object and not on the capability itself.

Discussed with: rwatson
Sponsored by: FreeBSD Foundation
MFC after: 1 month
0da1a674198b2c75a90d9f2e0eb63bddc23e73d9 10-Jun-2012 pjd <pjd@FreeBSD.org> Merge two ifs into one. Other minor style fixes.

MFC after: 1 month
67f6f356fc8b6f3b8ea8a77976d91073ab2f1f65 10-Jun-2012 pjd <pjd@FreeBSD.org> Simplify fdtofp().

MFC after: 1 month
0311d1f4cc0be72cf22862e3b1d6f02c40030dae 09-Jun-2012 pjd <pjd@FreeBSD.org> There is no need to drop the FILEDESC lock around malloc(M_WAITOK) anymore, as
we now use sx lock for filedesc structure protection.

Reviewed by: kib
MFC after: 1 month
468d011a0d791a06abe4ff6e3ce9539e05e20e3b 09-Jun-2012 pjd <pjd@FreeBSD.org> Remove now unused variable.

MFC after: 1 month
MFC with: r236820
b9def82bd780b376d5d8385bbcf875c9fa0a6bfd 09-Jun-2012 pjd <pjd@FreeBSD.org> Make some of the loops more readable.

Reviewed by: tegge
MFC after: 1 month
b1dc458d22eda63a1f015bcea5e4235cf78b847d 09-Jun-2012 pjd <pjd@FreeBSD.org> Correct panic message.

MFC after: 1 month
MFC with: r236731
b738e3d5243ec99fe11dc6e44f305d7e7a6e66a0 07-Jun-2012 pjd <pjd@FreeBSD.org> In fdalloc() f_ofileflags for the newly allocated descriptor has to be 0.
Assert that instead of setting it to 0.

Sponsored by: FreeBSD Foundation
MFC after: 1 month
2a42c5c4e9a67b8056503efa70123b09f23811a4 11-Apr-2012 eadler <eadler@FreeBSD.org> Return EBADF instead of EMFILE from dup2 when the second argument is
outside the range of valid file descriptors

PR: kern/164970
Submitted by: Peter Jeremy <peterjeremy@acm.org>
Reviewed by: jilles
Approved by: cperciva
MFC after: 1 week
506e2f15b93a1584a9103782c48037c858a30609 01-Apr-2012 jhb <jhb@FreeBSD.org> Export some more useful info about shared memory objects to userland
via procstat(1) and fstat(1):
- Change shm file descriptors to track the pathname they are associated
with and add a shm_path() method to copy the path out to a caller-supplied
buffer.
- Use the fo_stat() method of shared memory objects and shm_path() to
export the path, mode, and size of a shared memory object via
struct kinfo_file.
- Add a struct shmstat to the libprocstat(3) interface along with a
procstat_get_shm_info() to export the mode and size of a shared memory
object.
- Change procstat to always print out the path for a given object if it
is valid.
- Teach fstat about shared memory objects and to display their path,
mode, and size.

MFC after: 2 weeks
81cae127b038c1d057cde65bd0457d72894c7d86 08-Mar-2012 pho <pho@FreeBSD.org> Free up allocated memory used by posix_fadvise(2).
def2613c2b5ba17fb5dd1cdae01aaf1256abcb50 15-Nov-2011 obrien <obrien@FreeBSD.org> Reformat comment to be more readable in standard Xterm.
(while I'm here, wrap other long lines)
1e2d8c9d67bc3fa3bf3a560b9b8eac1745104048 04-Nov-2011 jhb <jhb@FreeBSD.org> Move the cleanup of f_cdevpriv when the reference count of a devfs
file descriptor drops to zero out of _fdrop() and into devfs_close_f()
as it is only relevant for devfs file descriptors.

Reviewed by: kib
MFC after: 1 week
0a6da59b6136e874b57c98c59fd355951f60b93d 12-Oct-2011 rwatson <rwatson@FreeBSD.org> Correct a bug in export of capability-related information from the sysctls
supporting procstat -f: properly provide capability rights information to
userspace. The bug resulted from a merge-o during upstreaming (or rather,
a failure to properly merge FreeBSD-side changed downstream).

Spotted by: des, kibab
MFC after: 3 days
99851f359e6f006b3223bb37dbc49e751ca8c13a 16-Sep-2011 kmacy <kmacy@FreeBSD.org> In order to maximize the re-usability of kernel code in user space this
patch modifies makesyscalls.sh to prefix all of the non-compatibility
calls (e.g. not linux_, freebsd32_) with sys_ and updates the kernel
entry points and all places in the code that use them. It also
fixes an additional name space collision between the kernel function
psignal and the libc function of the same name by renaming the kernel
psignal kern_psignal(). By introducing this change now we will ease future
MFCs that change syscalls.

Reviewed by: rwatson
Approved by: re (bz)
5ecd1c9d4080f3ae8a48c02523542b308b562160 18-Aug-2011 jonathan <jonathan@FreeBSD.org> Add experimental support for process descriptors

A "process descriptor" file descriptor is used to manage processes
without using the PID namespace. This is required for Capsicum's
Capability Mode, where the PID namespace is unavailable.

New system calls pdfork(2) and pdkill(2) offer the functional equivalents
of fork(2) and kill(2). pdgetpid(2) allows querying the PID of the remote
process for debugging purposes. The currently-unimplemented pdwait(2) will,
in the future, allow querying rusage/exit status. In the interim, poll(2)
may be used to check (and wait for) process termination.

When a process is referenced by a process descriptor, it does not issue
SIGCHLD to the parent, making it suitable for use in libraries---a common
scenario when using library compartmentalisation from within large
applications (such as web browsers). Some observers may note a similarity
to Mach task ports; process descriptors provide a subset of this behaviour,
but in a UNIX style.

This feature is enabled by "options PROCDESC", but as with several other
Capsicum kernel features, is not enabled by default in GENERIC 9.0.

Reviewed by: jhb, kib
Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
011f42054d1f861cd2435866ba646fa0cf752103 16-Aug-2011 kib <kib@FreeBSD.org> Add the fo_chown and fo_chmod methods to struct fileops and use them
to implement fchown(2) and fchmod(2) support for several file types
that previously lacked it. Add MAC entries for chown/chmod done on
posix shared memory and (old) in-kernel posix semaphores.

Based on the submission by: glebius
Reviewed by: rwatson
Approved by: re (bz)
f63d2e920584a3d403a07e765a61eeac57210332 13-Aug-2011 jonathan <jonathan@FreeBSD.org> Allow Capsicum capabilities to delegate constrained
access to file system subtrees to sandboxed processes.

- Use of absolute paths and '..' are limited in capability mode.
- Use of absolute paths and '..' are limited when looking up relative
to a capability.
- When a name lookup is performed, identify what operation is to be
performed (such as CAP_MKDIR) as well as check for CAP_LOOKUP.

With these constraints, openat() and friends are now safe in capability
mode, and can then be used by code such as the capability-mode runtime
linker.

Approved by: re (bz), mentor (rwatson)
Sponsored by: Google Inc
4af919b491560ff051b65cdf1ec730bdeb820b2e 11-Aug-2011 rwatson <rwatson@FreeBSD.org> Second-to-last commit implementing Capsicum capabilities in the FreeBSD
kernel for FreeBSD 9.0:

Add a new capability mask argument to fget(9) and friends, allowing system
call code to declare what capabilities are required when an integer file
descriptor is converted into an in-kernel struct file *. With options
CAPABILITIES compiled into the kernel, this enforces capability
protection; without, this change is effectively a no-op.

Some cases require special handling, such as mmap(2), which must preserve
information about the maximum rights at the time of mapping in the memory
map so that they can later be enforced in mprotect(2) -- this is done by
narrowing the rights in the existing max_protection field used for similar
purposes with file permissions.

In namei(9), we assert that the code is not reached from within capability
mode, as we're not yet ready to enforce namespace capabilities there.
This will follow in a later commit.

Update two capability names: CAP_EVENT and CAP_KEVENT become
CAP_POST_KEVENT and CAP_POLL_KEVENT to more accurately indicate what they
represent.

Approved by: re (bz)
Submitted by: jonathan
Sponsored by: Google Inc
8bec41f41d53e6b4bc7b841dac70034c913d9ec9 20-Jul-2011 jonathan <jonathan@FreeBSD.org> Export capability information via sysctls.

When reporting on a capability, flag the fact that it is a capability,
but also unwrap to report all of the usual information about the
underlying file.

Approved by: re (kib), mentor (rwatson)
Sponsored by: Google Inc
70f535313aa1ca4af3ed671bb68989fe80d925ae 15-Jul-2011 jonathan <jonathan@FreeBSD.org> Add implementation for capabilities.

Code to actually implement Capsicum capabilities, including fileops and
kern_capwrap(), which creates a capability to wrap an existing file
descriptor.

We also modify kern_close() and closef() to handle capabilities.

Finally, remove cap_filelist from struct capability, since we don't
actually need it.

Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
7c2c72616768c7ffbdaafd9240b96c6ec37b1349 08-Jul-2011 jonathan <jonathan@FreeBSD.org> Fix the "passability" test in fdcopy().

Rather than checking to see if a descriptor is a kqueue, check to see if
its fileops flags include DFLAG_PASSABLE.

At the moment, these two tests are equivalent, but this will change with
the addition of capabilities that wrap kqueues but are themselves of type
DTYPE_CAPABILITY. We already have the DFLAG_PASSABLE abstraction, so let's
use it.

This change has been tested with [the newly improved] tools/regression/kqueue.

Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
4a17b24427e55ec7e1862b08a0a9247f0717d010 06-Jul-2011 trasz <trasz@FreeBSD.org> All the racct_*() calls need to happen with the proc locked. Fixing this
won't happen before 9.0. This commit adds "#ifdef RACCT" around all the
"PROC_LOCK(p); racct_whatever(p, ...); PROC_UNLOCK(p)" instances, in order
to avoid useless locking/unlocking in kernels built without "options RACCT".
6abbb93d5fb70390974d5b0bb73e75616bd9c39a 05-Jul-2011 jonathan <jonathan@FreeBSD.org> Rework _fget to accept capability parameters.

This new version of _fget() requires new parameters:
- cap_rights_t needrights
the rights that we expect the capability's rights mask to include
(e.g. CAP_READ if we are going to read from the file)

- cap_rights_t *haverights
used to return the capability's rights mask (ignored if NULL)

- u_char *maxprotp
the maximum mmap() rights (e.g. VM_PROT_READ) that can be permitted
(only used if we are going to mmap the file; ignored if NULL)

- int fget_flags
FGET_GETCAP if we want to return the capability itself, rather than
the underlying object which it wraps

Approved by: mentor (rwatson), re (Capsicum blanket)
Sponsored by: Google Inc
4d4c5b3285343962855e4ac2e891fc6711595b64 30-Jun-2011 jonathan <jonathan@FreeBSD.org> When Capsicum starts creating capabilities to wrap existing file
descriptors, we will want to allocate a new descriptor without installing
it in the FD array.

Split falloc() into falloc_noinstall() and finstall(), and rewrite
falloc() to call them with appropriate atomicity.

Approved by: mentor (rwatson), re (bz)
fc10099a3dc07cf2fd04e9b2d50ffbd560ac70e3 12-May-2011 stas <stas@FreeBSD.org> - Do no try to drop a NULL filedesc pointer.
5f9f79547658271f3f469b6423a176831fef7683 12-May-2011 stas <stas@FreeBSD.org> - Commit work from libprocstat project. These patches add support for runtime
file and processes information retrieval from the running kernel via sysctl
in the form of new library, libprocstat. The library also supports KVM backend
for analyzing memory crash dumps. Both procstat(1) and fstat(1) utilities have
been modified to take advantage of the library (as the bonus point the fstat(1)
utility no longer need superuser privileges to operate), and the procstat(1)
utility is now able to display information from memory dumps as well.

The newly introduced fuser(1) utility also uses this library and able to operate
via sysctl and kvm backends.

The library is by no means complete (e.g. KVM backend is missing vnode name
resolution routines, and there're no manpages for the library itself) so I
plan to improve it further. I'm commiting it so it will get wider exposure
and review.

We won't be able to MFC this work as it relies on changes in HEAD, which
was introduced some time ago, that break kernel ABI. OTOH we may be able
to merge the library with KVM backend if we really need it there.

Discussed with: rwatson
a0192d37e69b36c66e920e1050038e46084d28ee 06-Apr-2011 trasz <trasz@FreeBSD.org> Add RACCT_NOFILE accounting.

Sponsored by: The FreeBSD Foundation
Reviewed by: kib (earlier version)
eb730d92e49e2ade0bd124e5d3b8506b02a768cb 01-Apr-2011 kib <kib@FreeBSD.org> After the r219999 is merged to stable/8, rename fallocf(9) to falloc(9)
and remove the falloc() version that lacks flag argument. This is done
to reduce the KPI bloat.

Requested by: jhb
X-MFC-note: do not
fc2bd01611a5db99f84b5d7f3109b9f9274f548d 25-Mar-2011 kib <kib@FreeBSD.org> Add O_CLOEXEC flag to open(2) and fhopen(2).
The new function fallocf(9), that is renamed falloc(9) with added
flag argument, is provided to facilitate the merge to stable branch.

Reviewed by: jhb
MFC after: 1 week
c7ac62aecd199084355eed0c34e006b1a886e37e 24-Mar-2011 jhb <jhb@FreeBSD.org> Fix some locking nits with the p_state field of struct proc:
- Hold the proc lock while changing the state from PRS_NEW to PRS_NORMAL
in fork to honor the locking requirements. While here, expand the scope
of the PROC_LOCK() on the new process (p2) to avoid some LORs. Previously
the code was locking the new child process (p2) after it had locked the
parent process (p1). However, when locking two processes, the safe order
is to lock the child first, then the parent.
- Fix various places that were checking p_state against PRS_NEW without
having the process locked to use PROC_LOCK(). Every place was already
locking the process, just after the PRS_NEW check.
- Remove or reduce the use of PROC_SLOCK() for places that were checking
p_state against PRS_NEW. The PROC_LOCK() alone is sufficient for reading
the current state.
- Reorder fill_kinfo_proc() slightly so it only acquires PROC_SLOCK() once.

MFC after: 1 week
b9b7d3e93a251adacd363eaf9d23963ec7467075 16-Feb-2011 bz <bz@FreeBSD.org> Mfp4 CH=177274,177280,177284-177285,177297,177324-177325

VNET socket push back:
try to minimize the number of places where we have to switch vnets
and narrow down the time we stay switched. Add assertions to the
socket code to catch possibly unset vnets as seen in r204147.

While this reduces the number of vnet recursion in some places like
NFS, POSIX local sockets and some netgraph, .. recursions are
impossible to fix.

The current expectations are documented at the beginning of
uipc_socket.c along with the other information there.

Sponsored by: The FreeBSD Foundation
Sponsored by: CK Software GmbH
Reviewed by: jhb
Tested by: zec

Tested by: Mikolaj Golub (to.my.trociny gmail.com)
MFC after: 2 weeks
f43c10ba8c6f7334b043a441d0a2c449309da1b0 28-Jan-2011 jilles <jilles@FreeBSD.org> Do not trip a KASSERT if /dev/null cannot be opened for a setuid program.

The fdcheckstd() function makes sure fds 0, 1 and 2 are open by opening
/dev/null. If this fails (e.g. missing devfs or wrong permissions),
fdcheckstd() will return failure and the process will exit as if it received
SIGABRT. The KASSERT is only to check that kern_open() returns the expected
fd, given that it succeeded.

Tripping the KASSERT is most likely if fd 0 is open but fd 1 or 2 are not.

MFC after: 2 weeks
a6922e1e8c1ff68412b1e16fd32603720f4c8e71 04-Jan-2011 kib <kib@FreeBSD.org> Finish r210923, 210926. Mark some devices as eternal.

MFC after: 2 weeks
09f9c897d33c41618ada06fbbcf1a9b3812dee53 19-Oct-2010 jamie <jamie@FreeBSD.org> A new jail(8) with a configuration file, to replace the work currently done
by /etc/rc.d/jail.
2bea36895da2f87e483fbd0c7e59d744a668dd11 02-May-2010 bz <bz@FreeBSD.org> MFC r207116:
Remove one zero from the double-0.
This code doesn't have a license to kill.
8b9f8a673542755de61861a6fcfcc7f5215e3526 23-Apr-2010 bz <bz@FreeBSD.org> Remove one zero from the double-0.
This code doesn't have a license to kill.

MFC after: 3 days
f1216d1f0ade038907195fc114b7e630623b402c 19-Mar-2010 delphij <delphij@FreeBSD.org> Create a custom branch where I will be able to do the merge.
3c2bafe65e8458fc5abe508bf9e336768eda01bb 07-Dec-2009 delphij <delphij@FreeBSD.org> MFC revision 197579 and 199617:

Add two new fcntls to enable/disable read-ahead:

- F_READAHEAD: specify the amount for sequential access. The amount is
specified in bytes and is rounded up to nearest block size.
- F_RDAHEAD: Darwin compatible version that use 128KB as the sequential
access size.

A third argument of zero disables the read-ahead behavior.

Please note that the read-ahead amount is also constrainted by sysctl
variable, vfs.read_max, which may need to be raised in order to better
utilize this feature.

Thanks Igor Sysoev for proposing the feature and submitting the original
version, and kib@ for his valuable comments.
66176c69689f18f787b94036e05758d211ccbd02 20-Nov-2009 kib <kib@FreeBSD.org> On the return path from F_RDAHEAD and F_READAHEAD fcntls, do not
unlock Giant twice.

While there, bring conditions in the do/while loops closer to style,
that also makes the lines fit into 80 columns.

Reported and tested by: dougb
79f2f8c774b48d4ef62f92d4a09e80af3ceaca7c 28-Sep-2009 delphij <delphij@FreeBSD.org> Add two new fcntls to enable/disable read-ahead:

- F_READAHEAD: specify the amount for sequential access. The amount is
specified in bytes and is rounded up to nearest block size.
- F_RDAHEAD: Darwin compatible version that use 128KB as the sequential
access size.

A third argument of zero disables the read-ahead behavior.

Please note that the read-ahead amount is also constrainted by sysctl
variable, vfs.read_max, which may need to be raised in order to better
utilize this feature.

Thanks Igor Sysoev for proposing the feature and submitting the original
version, and kib@ for his valuable comments.

Submitted by: Igor Sysoev <is rambler-co ru>
Reviewed by: kib@
MFC after: 1 month
da78c9e4a2e1689a4d400553bb5f6aa0537c5f49 27-Jun-2009 rwatson <rwatson@FreeBSD.org> Replace AUDIT_ARG() with variable argument macros with a set more more
specific macros for each audit argument type. This makes it easier to
follow call-graphs, especially for automated analysis tools (such as
fxr).

In MFC, we should leave the existing AUDIT_ARG() macros as they may be
used by third-party kernel modules.

Suggested by: brooks
Approved by: re (kib)
Obtained from: TrustedBSD Project
MFC after: 1 week
4208ef996734cc86a5d2151d45f22c675eb837ea 24-Jun-2009 lulf <lulf@FreeBSD.org> - Similar to the previous commit, but for CURRENT: Fix a bug where a FIFO vnode
use count was increased twice, but only decreased once.
6c60345b340131c66e68bbef66500d1f0eefb556 24-Jun-2009 lulf <lulf@FreeBSD.org> - Fix a bug where a FIFO vnode use count was increased twice, but only
decreased once.

MFC after: 1 week
447d980cd05483b9af4e91000999997f5ba018e7 15-Jun-2009 jhb <jhb@FreeBSD.org> Add a new 'void closefrom(int lowfd)' system call. When called, it closes
any open file descriptors >= 'lowfd'. It is largely identical to the same
function on other operating systems such as Solaris, DFly, NetBSD, and
OpenBSD. One difference from other *BSD is that this closefrom() does not
fail with any errors. In practice, while the manpages for NetBSD and
OpenBSD claim that they return EINTR, they ignore internal errors from
close() and never return EINTR. DFly does return EINTR, but for the common
use case (closing fd's prior to execve()), the caller really wants all
fd's closed and returning EINTR just forces callers to call closefrom() in
a loop until it stops failing.

Note that this implementation of closefrom(2) does not make any effort to
resolve userland races with open(2) in other threads. As such, it is not
multithread safe.

Submitted by: rwatson (initial version)
Reviewed by: rwatson
MFC after: 2 weeks
7bd92180e74fab8260b697c4de1f4bfe36a17a4a 02-Jun-2009 jeff <jeff@FreeBSD.org> - Use an acquire barrier to increment f_count in fget_unlocked and
remove the volatile cast. Describe the reason in detail in a comment.

Discussed with: bde, jhb
a013e0afcbb44052a86a7977277d669d8883b7e7 27-May-2009 jamie <jamie@FreeBSD.org> Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system
call.

Approved by: bz (mentor)
ebdd571432523104dcdd5ff6b619fc12082420f7 20-May-2009 jhb <jhb@FreeBSD.org> Set the umask in a new file descriptor table earlier in fdcopy() to remove
two lock operations.
b8162aa0c9949012533f0436d057c8cc5bea571d 15-May-2009 kib <kib@FreeBSD.org> Revert r192094. The revision caused problems for sysctl(3) consumers
that expect that oldlen is filled with required buffer length even when
supplied buffer is too short and returned error is ENOMEM.

Redo the fix for kern.proc.filedesc, by reverting the req->oldidx when
remaining buffer space is too short for the current kinfo_file structure.
Also, only ignore ENOMEM. We have to convert ENOMEM to no error condition
to keep existing interface for the sysctl, though.

Reported by: ed, Florian Smeets <flo kasimir com>
Tested by: pho
20397e643153b90263768cb71928b488cab2c91e 14-May-2009 jeff <jeff@FreeBSD.org> - Implement a lockless file descriptor lookup algorithm in
fget_unlocked().
- Save old file descriptor tables created on expansion until
the entire descriptor table is freed so that pointers may be
followed without regard for expanders.
- Mark the file zone as NOFREE so we may attempt to reference
potentially freed files.
- Convert several fget_locked() users to fget_unlocked(). This
requires us to manage reference counts explicitly but reduces
locking overhead in the common case.
a66490b8892e7ed1030646b903c3e463e2005e61 15-Apr-2009 jhb <jhb@FreeBSD.org> Update comment above _fget() for earlier change to FWRITE failures return
EBADF rather than EINVAL.

Submitted by: Jaakko Heinonen jh saunalahti fi
MFC after: 1 month
60038f21cf78de5ad20b51c2f0953cc1801a827f 14-Feb-2009 marcus <marcus@FreeBSD.org> Remove the printf's when the vnode to be exported for procstat is not a VDIR.
If the file system backing a process' cwd is removed, and procstat -f PID
is called, then these messages would have been printed. The extra verbosity is
not required in this situation.

Requested by: kib
Approved by: kib
130b8c14ad2ecacb715fe63397e5e867a7a7c7ec 14-Feb-2009 marcus <marcus@FreeBSD.org> Change two KASSERTS to printfs and simple returns. Stress testing has
revealed that a process' current working directory can be VBAD if the
directory is removed. This can trigger a panic when procstat -f PID is
run.

Tested by: pho
Discovered by: phobot
Reviewed by: kib
Approved by: kib
ced47d0a8e4610843f05ea493da7c712e17f9eb2 11-Feb-2009 rwatson <rwatson@FreeBSD.org> Modify fdcopy() so that, during fork(2), it won't copy file descriptors
from the parent to the child process if they have an operation vector
of &badfileops. This narrows a set of races involving system calls that
allocate a new file descriptor, potentially block for some extended
period, and then return the file descriptor, when invoked by a threaded
program that concurrently invokes fork(2). Similar approches are used
in both Solaris and Linux, and the wideness of this race was introduced
in FreeBSD when we moved to a more optimistic implementation of
accept(2) in order to simplify locking.

A small race necessarily remains because the fork(2) might occur after
the finit() in accept(2) but before the system call has returned, but
that appears unavoidable using current APIs. However, this race is
vastly narrower.

The fix can be validated using the newfileops_on_fork regression test.

PR: kern/130348
Reported by: Ivan Shcheklein <shcheklein at gmail dot com>
Reviewed by: jhb, kib
MFC after: 1 week
2349a65923842226ae7c1ed630f1d87991af065f 30-Dec-2008 kib <kib@FreeBSD.org> Clear the pointers to the file in the struct filedesc before file is closed
in fdfree. Otherwise, sysctl_kern_proc_filedesc may dereference stale
struct file * values.

Reported and tested by: pho
MFC after: 1 month
6f7ed797d320c33ec8c210b24fb80ef01c5e5f13 10-Dec-2008 kmacy <kmacy@FreeBSD.org> IF_RELENG7 184527:185849
0cd59a18e366e541876a23e59b53ce6246f0cec8 02-Dec-2008 peter <peter@FreeBSD.org> Prune some whining.
cd7b78c33f9eb6fc2730afe6d09252f28cf9996e 01-Dec-2008 peter <peter@FreeBSD.org> Duplicate another few hundred lines of code in order to be compatible
with unreleased binaries.
2b1f03929a1b2aede23598bc1b54f8061c22fd69 30-Nov-2008 peter <peter@FreeBSD.org> Properly wrap this giant block of duplicate code inside COMPAT_FREEBSD7
343bde97065f4c35293558594e5b36627ef8ccb7 30-Nov-2008 peter <peter@FreeBSD.org> Implement copyout packing more along the lines of what I had in mind.
Create a temporary duplicate implementation of old filedesc struct for
pre-7.1 libgtop package.
Todo: specific fd or addr request
83dc2280cebd626d5a4fda7c4268a94d329c4e78 29-Nov-2008 peter <peter@FreeBSD.org> WIP kinfo_file/kinfo_vmmentry tweaks. The idea:
1) to get the 32 and 64 bit versions in sync so that no shims are needed,
Valgrind in particular excercises this. and:
2) reduce the size of the copyout. On large processes this turns out to
be a huge problem. Valgrind also suffers from this since it needs to do
this in a context that can't malloc. I want to pack the records.
3) Add new types.. 'tell me about fd N' and 'tell me about addr N'.
19b6af98ec71398e77874582eb84ec5310c7156f 22-Nov-2008 dfr <dfr@FreeBSD.org> Clone Kip's Xen on stable/6 tree so that I can work on improving FreeBSD/amd64
performance in Xen's HVM mode.
6c6f8c89e867271ab3baf15171419147ba85e088 04-Nov-2008 jhb <jhb@FreeBSD.org> Remove unnecessary locking around vn_fullpath(). The vnode lock for the
vnode in question does not need to be held. All the data structures used
during the name lookup are protected by the global name cache lock.
Instead, the caller merely needs to ensure a reference is held on the
vnode (such as vhold()) to keep it from being freed.

In the case of procfs' <pid>/file entry, grab the process lock while we
gain a new reference (via vhold()) on p_textvp to fully close races with
execve(2).

For the kern.proc.vmmap sysctl handler, use a shared vnode lock around
the call to VOP_GETATTR() rather than an exclusive lock.

MFC after: 1 month
ee8312c8bb0864cea61a7e69692721d017fad2b0 03-Nov-2008 jhb <jhb@FreeBSD.org> Use shared vnode locks instead of exclusive vnode locks for the access(),
chdir(), chroot(), eaccess(), fpathconf(), fstat(), fstatfs(), lseek()
(when figuring out the current size of the file in the SEEK_END case),
pathconf(), readlink(), and statfs() system calls.

Submitted by: ups (mostly)
Tested by: pho
MFC after: 1 month
27c09cdcec9ef5edfa7f22d40d6925ab4e1b7f75 01-Nov-2008 kmacy <kmacy@FreeBSD.org> IF_RELENG7 183757:184526
a1e1ad22e07d384a9609e60cdf00daf7cac902cf 23-Oct-2008 des <des@FreeBSD.org> Fix a number of style issues in the MALLOC / FREE commit. I've tried to
be careful not to fix anything that was already broken; the NFSv4 code is
particularly bad in this respect.
66f807ed8b3634dc73d9f7526c484e43f094c0ee 23-Oct-2008 des <des@FreeBSD.org> Retire the MALLOC and FREE macros. They are an abomination unto style(9).

MFC after: 3 months
cf5320822f93810742e3d4a1ac8202db8482e633 19-Oct-2008 lulf <lulf@FreeBSD.org> - Import the HEAD csup code which is the basis for the cvsmode work.
ef6dfc27c47cc39216c7fd950dfc0b2caeba8e19 12-Oct-2008 rwatson <rwatson@FreeBSD.org> Downgrade XXX to a Note for fgetsock() and fputsock().

MFC after: 3 days
cc3116a9380fe32a751b584f3d8083698ccfba15 20-Aug-2008 ed <ed@FreeBSD.org> Integrate the new MPSAFE TTY layer to the FreeBSD operating system.

The last half year I've been working on a replacement TTY layer for the
FreeBSD kernel. The new TTY layer was designed to improve the following:

- Improved driver model:

The old TTY layer has a driver model that is not abstract enough to
make it friendly to use. A good example is the output path, where the
device drivers directly access the output buffers. This means that an
in-kernel PPP implementation must always convert network buffers into
TTY buffers.

If a PPP implementation would be built on top of the new TTY layer
(still needs a hooks layer, though), it would allow the PPP
implementation to directly hand the data to the TTY driver.

- Improved hotplugging:

With the old TTY layer, it isn't entirely safe to destroy TTY's from
the system. This implementation has a two-step destructing design,
where the driver first abandons the TTY. After all threads have left
the TTY, the TTY layer calls a routine in the driver, which can be
used to free resources (unit numbers, etc).

The pts(4) driver also implements this feature, which means
posix_openpt() will now return PTY's that are created on the fly.

- Improved performance:

One of the major improvements is the per-TTY mutex, which is expected
to improve scalability when compared to the old Giant locking.
Another change is the unbuffered copying to userspace, which is both
used on TTY device nodes and PTY masters.

Upgrading should be quite straightforward. Unlike previous versions,
existing kernel configuration files do not need to be changed, except
when they reference device drivers that are listed in UPDATING.

Obtained from: //depot/projects/mpsafetty/...
Approved by: philip (ex-mentor)
Discussed: on the lists, at BSDCan, at the DevSummit
Sponsored by: Snow B.V., the Netherlands
dcons(4) fixed by: kan
22ff03f33782b6e03626ec22f322eddfdf99999d 19-Aug-2008 kib <kib@FreeBSD.org> MFC r179175:
Implement the per-open file data for the cdev.
The td_fpop member of the struct thread is appended to the end, and cleared
in the thread allocator to keep struct thread KBI-compatible on RELENG_7.

MFC r181635:
Remove unnecessary locking around pointer fetch.
4fd52c514c0d34bbccb492315965a79a9d09853d 12-Aug-2008 des <des@FreeBSD.org> MFH r176471 (KTR_STRUCT, support for struct stat and struct sockaddr)
746d949d895abb79b7e7f29d4c945d54e6108be0 09-Aug-2008 ed <ed@FreeBSD.org> Remove unneeded D_NEEDGIANT from /dev/fd/{0,1,2}.

There is no reason the fdopen() routine needs Giant. It only sets
curthread->td_dupfd, based on the device unit number of the cdev.

I guess we won't get massive performance improvements here, but still, I
assume we eventually want to get rid of Giant.
97ddbbd5772f2e80c12239cbf03628740879ab24 22-Jul-2008 rwatson <rwatson@FreeBSD.org> Merge r177253, r177255 from head to stable/7:

In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

Discussed with: imp, rink

The requirement to place a ; after each SYSINIT definition has not been
MFC'd, as this might break the compile third-party modules, but merging
the actual ; additions reduces diffs against 8.x making it easier to
merge other changes.
411d06839511bd58b0df613c30e0ff8b09022ed1 27-Jun-2008 jhb <jhb@FreeBSD.org> Rework the lifetime management of the kernel implementation of POSIX
semaphores. Specifically, semaphores are now represented as new file
descriptor type that is set to close on exec. This removes the need for
all of the manual process reference counting (and fork, exec, and exit
event handlers) as the normal file descriptor operations handle all of
that for us nicely. It is also suggested as one possible implementation
in the spec and at least one other OS (OS X) uses this approach.

Some bugs that were fixed as a result include:
- References to a named semaphore whose name is removed still work after
the sem_unlink() operation. Prior to this patch, if a semaphore's name
was removed, valid handles from sem_open() would get EINVAL errors from
sem_getvalue(), sem_post(), etc. This fixes that.
- Unnamed semaphores created with sem_init() were not cleaned up when a
process exited or exec'd. They were only cleaned up if the process
did an explicit sem_destroy(). This could result in a leak of semaphore
objects that could never be cleaned up.
- On the other hand, if another process guessed the id (kernel pointer to
'struct ksem' of an unnamed semaphore (created via sem_init)) and had
write access to the semaphore based on UID/GID checks, then that other
process could manipulate the semaphore via sem_destroy(), sem_post(),
sem_wait(), etc.
- As part of the permission check (UID/GID), the umask of the proces
creating the semaphore was not honored. Thus if your umask denied group
read/write access but the explicit mode in the sem_init() call allowed
it, the semaphore would be readable/writable by other users in the
same group, for example. This includes access via the previous bug.
- If the module refused to unload because there were active semaphores,
then it might have deregistered one or more of the semaphore system
calls before it noticed that there was a problem. I'm not sure if
this actually happened as the order that modules are discovered by the
kernel linker depends on how the actual .ko file is linked. One can
make the order deterministic by using a single module with a mod_event
handler that explicitly registers syscalls (and deregisters during
unload after any checks). This also fixes a race where even if the
sem_module unloaded first it would have destroyed locks that the
syscalls might be trying to access if they are still executing when
they are unloaded.

XXX: By the way, deregistering system calls doesn't do any blocking
to drain any threads from the calls.
- Some minor fixes to errno values on error. For example, sem_init()
isn't documented to return ENFILE or EMFILE if we run out of semaphores
the way that sem_open() can. Instead, it should return ENOSPC in that
case.

Other changes:
- Kernel semaphores now use a hash table to manage the namespace of
named semaphores nearly in a similar fashion to the POSIX shared memory
object file descriptors. Kernel semaphores can now also have names
longer than 14 chars (up to MAXPATHLEN) and can include subdirectories
in their pathname.
- The UID/GID permission checks for access to a named semaphore are now
done via vaccess() rather than a home-rolled set of checks.
- Now that kernel semaphores have an associated file object, the various
MAC checks for POSIX semaphores accept both a file credential and an
active credential. There is also a new posixsem_check_stat() since it
is possible to fstat() a semaphore file descriptor.
- A small set of regression tests (using the ksem API directly) is present
in src/tools/regression/posixsem.

Reported by: kris (1)
Tested by: kris
Reviewed by: rwatson (lightly)
MFC after: 1 month
83304da0e8894a62450630d692b89fe85958885a 28-May-2008 ed <ed@FreeBSD.org> Remove redundant checks from fcntl()'s F_DUPFD.

Right now we perform some of the checks inside the fcntl()'s F_DUPFD
operation twice. We first validate the `fd' argument. When finished,
we validate the `arg' argument. These checks are also performed inside
do_dup().

The reason we need to do this, is because fcntl() should return different
errno's when the `arg' argument is out of bounds (EINVAL instead of
EBADF). To prevent the redundant locking of the PROC_LOCK and
FILEDESC_SLOCK, patch do_dup() to support the error semantics required
by fcntl().

Approved by: philip (mentor)
4d240aa98e28b74ea9808f1ed3cd852399f4c1ba 25-May-2008 attilio <attilio@FreeBSD.org> Replace direct atomic operation for the file refcount witht the
refcount interface.
It also introduces the correct usage of memory barriers, as sometimes
fdrop() and fhold() are used with shared locks, which don't use any
release barrier.
5971791c189c2ddb0bee3aadd934e064d60bf299 21-May-2008 kib <kib@FreeBSD.org> Implement the per-open file data for the cdev.

The patch does not change the cdevsw KBI. Management of the data is
provided by the functions
int devfs_set_cdevpriv(void *priv, cdevpriv_dtr_t dtr);
int devfs_get_cdevpriv(void **datap);
void devfs_clear_cdevpriv(void);
All of the functions are supposed to be called from the cdevsw method
contexts.

- devfs_set_cdevpriv assigns the priv as private data for the file
descriptor which is used to initiate currently performed driver
operation. dtr is the function that will be called when either the
last refernce to the file goes away, the device is destroyed or
devfs_clear_cdevpriv is called.
- devfs_get_cdevpriv is the obvious accessor.
- devfs_clear_cdevpriv allows to clear the private data for the still
open file.

Implementation keeps the driver-supplied pointers in the struct
cdev_privdata, that is referenced both from the struct file and struct
cdev, and cannot outlive any of the referee.

Man pages will be provided after the KPI stabilizes.

Reviewed by: jhb
Useful suggestions from: jeff, antoine
Debugging help and tested by: pho
MFC after: 1 month
cfe6bd34110d22e97cd47c71710c01cb13d0e664 16-May-2008 kris <kris@FreeBSD.org> MFC 1.330, 1.331:

fdhold can return NULL, so add the one remaining missing check for this
condition.
150f1de0cf1e2342ec39b72b7c3969d252db3032 26-Apr-2008 kris <kris@FreeBSD.org> * Correct a mis-merge that leaked the PROC_LOCK [1]
* Return ENOENT on error instead of 0 [2]

Submitted by: rdivacky [1], kib [2]
d6c5faf2cc345f9eb5037795770d15418450f08b 24-Apr-2008 kris <kris@FreeBSD.org> fdhold can return NULL, so add the one remaining missing check for this
condition.

Reviewed by: attilio
MFC after: 1 week
0536363c8541b26238166e5c92aa1e95b5301de5 24-Apr-2008 dfr <dfr@FreeBSD.org> MFC: kernel-mode NFS lock manager.
bdc8481556f5026bed1bb54a4a39707388a41901 20-Apr-2008 antoine <antoine@FreeBSD.org> MFC to RELENG_7:
Introduce a new F_DUP2FD command to fcntl(2), for compatibility with
Solaris and AIX.
fcntl(fd, F_DUP2FD, arg) and dup2(fd, arg) are functionnaly equivalent.
Document it.
Add some regression tests (identical to the dup2(2) regression tests).

PR: 120233
Submitted by: Jukka Ukkonen
Approved by: rwaston (mentor)
MFC after: 1 month
41ab4e51db1c9de65d35fdf40c98f653148784e5 11-Apr-2008 rwatson <rwatson@FreeBSD.org> Merge kern_descrip.c:1.322, user.h:1.74, procstat_files.c:1.5:

Add support for displaying a process' current working directory, root
directory, and jail directory within procstat. While this functionality
is available already in fstat, encapsulating it in the kern.proc.filedesc
sysctl makes it accessible without using kvm and thus without needing
elevated permissions.

The new procstat output looks like:

PID COMM FD T V FLAGS REF OFFSET PRO NAME
76792 tcsh cwd v d -------- - - - /usr/src
76792 tcsh root v d -------- - - - /
76792 tcsh 15 v c rw------ 16 9130 - -
76792 tcsh 16 v c rw------ 16 9130 - -
76792 tcsh 17 v c rw------ 16 9130 - -
76792 tcsh 18 v c rw------ 16 9130 - -
76792 tcsh 19 v c rw------ 16 9130 - -

I am also bumping __FreeBSD_version for this as this new feature will be
used in at least one port.

Reviewed by: rwatson
Approved by: rwatson

Note that in the MFC, __FreeBSD_version is not bumped as we will bump it
once (shortly) for all procstat(1) MFC changes together.
9be8410be1950146fe071ef9680d4a8c69258a81 10-Apr-2008 rwatson <rwatson@FreeBSD.org> Merge kern_descrip.c:1.314, kern_proc.c:1.256, sysctl.h:1.153,
user.h:1.71 from HEAD to RELENG_7:

Add two new sysctls in support of the forthcoming procstat(1) to support
its -f and -v arguments:

kern.proc.filedesc - dump file descriptor information for a process, if
debugging is permitted, including socket addresses, open flags, file
offsets, file paths, etc.

kern.proc.vmmap - dump virtual memory mapping information for a process,
if debugging is permitted, including layout and information on
underlying objects, such as the type of object and path.

These provide a superset of the information historically available
through the now-deprecated procfs(4), and are intended to be exported
in an ABI-robust form.
693c703a12fbd6080805760eae94c2e13566cdd6 10-Apr-2008 dfr <dfr@FreeBSD.org> MFC: Kernel mode Network Lock Manager.
79d2dfdaa69db38c43daed9744a6dbd0568189b5 26-Mar-2008 dfr <dfr@FreeBSD.org> Add the new kernel-mode NFS Lock Manager. To use it instead of the
user-mode lock manager, build a kernel with the NFSLOCKD option and
add '-k' to 'rpc_lockd_flags' in rc.conf.

Highlights include:

* Thread-safe kernel RPC client - many threads can use the same RPC
client handle safely with replies being de-multiplexed at the socket
upcall (typically driven directly by the NIC interrupt) and handed
off to whichever thread matches the reply. For UDP sockets, many RPC
clients can share the same socket. This allows the use of a single
privileged UDP port number to talk to an arbitrary number of remote
hosts.

* Single-threaded kernel RPC server. Adding support for multi-threaded
server would be relatively straightforward and would follow
approximately the Solaris KPI. A single thread should be sufficient
for the NLM since it should rarely block in normal operation.

* Kernel mode NLM server supporting cancel requests and granted
callbacks. I've tested the NLM server reasonably extensively - it
passes both my own tests and the NFS Connectathon locking tests
running on Solaris, Mac OS X and Ubuntu Linux.

* Userland NLM client supported. While the NLM server doesn't have
support for the local NFS client's locking needs, it does have to
field async replies and granted callbacks from remote NLMs that the
local client has contacted. We relay these replies to the userland
rpc.lockd over a local domain RPC socket.

* Robust deadlock detection for the local lock manager. In particular
it will detect deadlocks caused by a lock request that covers more
than one blocking request. As required by the NLM protocol, all
deadlock detection happens synchronously - a user is guaranteed that
if a lock request isn't rejected immediately, the lock will
eventually be granted. The old system allowed for a 'deferred
deadlock' condition where a blocked lock request could wake up and
find that some other deadlock-causing lock owner had beaten them to
the lock.

* Since both local and remote locks are managed by the same kernel
locking code, local and remote processes can safely use file locks
for mutual exclusion. Local processes have no fairness advantage
compared to remote processes when contending to lock a region that
has just been unlocked - the local lock manager enforces a strict
first-come first-served model for both local and remote lockers.

Sponsored by: Isilon Systems
PR: 95247 107555 115524 116679
MFC after: 2 weeks
d818a8db6839fadf3aba5d6d4a2bb79483200ba3 19-Mar-2008 sobomax <sobomax@FreeBSD.org> Revert previous change - it appears that the limit I was hitting was a
maxsockets limit, not maxfiles limit. The question remains why those
limits are handled differently (with error code for maxfiles but with
sleep for maxsokets), but those would be addressed in a separate commit
if necessary.

Requested by: rwhatson, jeff
877d7c65ba9b74233df6c9197fc39c770e809d02 16-Mar-2008 rwatson <rwatson@FreeBSD.org> In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink
1560402d31fc795d7c3f43a5af671bc3a1c5993b 16-Mar-2008 sobomax <sobomax@FreeBSD.org> Properly set size of the file_zone to match kern.maxfiles parameter.
Otherwise the parameter is no-op, since zone by default limits number
of descriptors to some 12K entries. Attempt to allocate more ends up
sleeping on zonelimit.

MFC after: 2 weeks
514f31f40ed28fea8fdc190c743417debb0d03b3 08-Mar-2008 antoine <antoine@FreeBSD.org> Introduce a new F_DUP2FD command to fcntl(2), for compatibility with
Solaris and AIX.
fcntl(fd, F_DUP2FD, arg) and dup2(fd, arg) are functionnaly equivalent.
Document it.
Add some regression tests (identical to the dup2(2) regression tests).

PR: 120233
Submitted by: Jukka Ukkonen
Approved by: rwaston (mentor)
MFC after: 1 month
df26e399aa077b14fb965be866012bccf2847bae 23-Feb-2008 des <des@FreeBSD.org> This patch adds a new ktrace(2) record type, KTR_STRUCT, whose payload
consists of the null-terminated name and the contents of any structure
you wish to record. A new ktrstruct() function constructs and emits a
KTR_STRUCT record. It is accompanied by convenience macros for struct
stat and struct sockaddr.

In kdump(1), KTR_STRUCT records are handled by a dispatcher function
that runs stringent sanity checks on its contents before handing it
over to individual decoding funtions for each type of structure.
Currently supported structures are struct stat and struct sockaddr for
the AF_INET, AF_INET6 and AF_UNIX families; support for AF_APPLETALK
and AF_IPX is present but disabled, as I am unable to test it properly.

Since 's' was already taken, the letter 't' is used by ktrace(1) to
enable KTR_STRUCT trace points, and in kdump(1) to enable their
decoding.

Derived from patches by Andrew Li <andrew2.li@citi.com>.

PR: kern/117836
MFC after: 3 weeks
926347d060a99a23b99bb2ac3ecf69bc4d1c9fc5 14-Feb-2008 simon <simon@FreeBSD.org> Fix sendfile(2) write-only file permission bypass.

Security: FreeBSD-SA-08:03.sendfile
145bc4340e30c161495d0b83803d1672dc63ab0a 14-Feb-2008 simon <simon@FreeBSD.org> Fix sendfile(2) write-only file permission bypass.

Security: FreeBSD-SA-08:03.sendfile
49aa39283b5da2ce1669bb252c6544ed9383fd5d 14-Feb-2008 simon <simon@FreeBSD.org> Fix sendfile(2) write-only file permission bypass.

Security: FreeBSD-SA-08:03.sendfile
Submitted by: kib
7e24637c24d89a152b59a841be37492eb89f6306 09-Feb-2008 marcus <marcus@FreeBSD.org> Add support for displaying a process' current working directory, root
directory, and jail directory within procstat. While this functionality
is available already in fstat, encapsulating it in the kern.proc.filedesc
sysctl makes it accessible without using kvm and thus without needing
elevated permissions.

The new procstat output looks like:

PID COMM FD T V FLAGS REF OFFSET PRO NAME
76792 tcsh cwd v d -------- - - - /usr/src
76792 tcsh root v d -------- - - - /
76792 tcsh 15 v c rw------ 16 9130 - -
76792 tcsh 16 v c rw------ 16 9130 - -
76792 tcsh 17 v c rw------ 16 9130 - -
76792 tcsh 18 v c rw------ 16 9130 - -
76792 tcsh 19 v c rw------ 16 9130 - -

I am also bumping __FreeBSD_version for this as this new feature will be
used in at least one port.

Reviewed by: rwatson
Approved by: rwatson
ff397597d9783b1619f900eb485292ded54f46a9 20-Jan-2008 rwatson <rwatson@FreeBSD.org> Export a type for POSIX SHM file descriptors via kern.proc.filedesc as
used by procstat, or SHM descriptors will show up as type unknown in
userspace.
71b7824213151e91b40ee4afa9fa7f100c90ed0b 13-Jan-2008 attilio <attilio@FreeBSD.org> VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
18d0a0dd51c7995ce9e549616f78ef724096b1bd 10-Jan-2008 attilio <attilio@FreeBSD.org> vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and
vop_unlock.

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>
8cd9437636744162d1427275b2fe66cf8ccef25c 08-Jan-2008 jhb <jhb@FreeBSD.org> Add a new file descriptor type for IPC shared memory objects and use it to
implement shm_open(2) and shm_unlink(2) in the kernel:
- Each shared memory file descriptor is associated with a swap-backed vm
object which provides the backing store. Each descriptor starts off with
a size of zero, but the size can be altered via ftruncate(2). The shared
memory file descriptors also support fstat(2). read(2), write(2),
ioctl(2), select(2), poll(2), and kevent(2) are not supported on shared
memory file descriptors.
- shm_open(2) and shm_unlink(2) are now implemented as system calls that
manage shared memory file descriptors. The virtual namespace that maps
pathnames to shared memory file descriptors is implemented as a hash
table where the hash key is generated via the 32-bit Fowler/Noll/Vo hash
of the pathname.
- As an extension, the constant 'SHM_ANON' may be specified in place of the
path argument to shm_open(2). In this case, an unnamed shared memory
file descriptor will be created similar to the IPC_PRIVATE key for
shmget(2). Note that the shared memory object can still be shared among
processes by sharing the file descriptor via fork(2) or sendmsg(2), but
it is unnamed. This effectively serves to implement the getmemfd() idea
bandied about the lists several times over the years.
- The backing store for shared memory file descriptors are garbage
collected when they are not referenced by any open file descriptors or
the shm_open(2) virtual namespace.

Submitted by: dillon, peter (previous versions)
Submitted by: rwatson (I based this on his version)
Reviewed by: alc (suggested converting getmemfd() to shm_open())
f8a246b9791d1450cf4945cc7b38f651a3a456ee 07-Jan-2008 jhb <jhb@FreeBSD.org> Make ftruncate a 'struct file' operation rather than a vnode operation.
This makes it possible to support ftruncate() on non-vnode file types in
the future.
- 'struct fileops' grows a 'fo_truncate' method to handle an ftruncate() on
a given file descriptor.
- ftruncate() moves to kern/sys_generic.c and now just fetches a file
object and invokes fo_truncate().
- The vnode-specific portions of ftruncate() move to vn_truncate() in
vfs_vnops.c which implements fo_truncate() for vnode file types.
- Non-vnode file types return EINVAL in their fo_truncate() method.

Submitted by: rwatson
811f1d35ef471cadfb6d935bdf932ac76df39940 03-Jan-2008 jeff <jeff@FreeBSD.org> - In sysctl_kern_file skip fdps with negative lastfiles. This can
happen if there are no files open. Accounting for these can
eventually return a negative value for olenp causing sysctl to
crash with a bad malloc.

Reported by: Pawel Worach <pawel.worach@gmail.com>
ce1863880500c459eb1395c1d6f81819e02e6608 30-Dec-2007 jeff <jeff@FreeBSD.org> Remove explicit locking of struct file.
- Introduce a finit() which is used to initailize the fields of struct file
in such a way that the ops vector is only valid after the data, type,
and flags are valid.
- Protect f_flag and f_count with atomic operations.
- Remove the global list of all files and associated accounting.
- Rewrite the unp garbage collection such that it no longer requires
the global list of all files and instead uses a list of all unp sockets.
- Mark sockets in the accept queue so we don't incorrectly gc them.

Tested by: kris, pho
c25458da37ba171afe76f0f5e3ba48ce24ad7769 02-Dec-2007 rwatson <rwatson@FreeBSD.org> Add two new sysctls in support of the forthcoming procstat(1) to support
its -f and -v arguments:

kern.proc.filedesc - dump file descriptor information for a process, if
debugging is permitted, including socket addresses, open flags, file
offsets, file paths, etc.

kern.proc.vmmap - dump virtual memory mapping information for a process,
if debugging is permitted, including layout and information on
underlying objects, such as the type of object and path.

These provide a superset of the information historically available
through the now-deprecated procfs(4), and are intended to be exported
in an ABI-robust form.
23574c86734ab5cb088584d30345e698cbbeaef2 06-Aug-2007 rwatson <rwatson@FreeBSD.org> Remove the now-unused NET_{LOCK,UNLOCK,ASSERT}_GIANT() macros, which
previously conditionally acquired Giant based on debug.mpsafenet. As that
has now been removed, they are no longer required. Removing them
significantly simplifies error-handling in the socket layer, eliminated
quite a bit of unwinding of locking in error cases.

While here clean up the now unneeded opt_net.h, which previously was used
for the NET_WITH_GIANT kernel option. Clean up some related gotos for
consistency.

Reviewed by: bz, csjp
Tested by: kris
Approved by: re (kensmith)
47a37c32747a7665c553994b2b3cb401c6abacd2 08-Jul-2007 netchild <netchild@FreeBSD.org> MFC (3 of X):
- In preparation of further linuxulator fixes MFC kern_descrip.c rev 1.296 and
syscallsubr.h rev 1.41 by jhb:
Add a kern_close() so that the ABIs can close a file descriptor w/o having
to populate a close_args struct and change some of the places that do.

Tested by: scf (i386, as part of a mega-MFC-patch),
Arno J. Klaassen <arno@heho.snv.jussieu.fr> (amd64)
af5bbfbc7b6a610f0c44a56a7b2d81b96be9b1b5 03-Jul-2007 jeff <jeff@FreeBSD.org> - Use explicit locking in the various fcntl case statements so that we
can acquire shared filedescriptor locks in the appropriate cases.
- Remove Giant from calls that issue ioctls. The ioctl path has been
mpsafe for some time now.
- Only acquire giant for VOP_ADVLOCK when the filesystem requires giant.
advlock is now mpsafe.

Reviewed by: rwatson
Approved by: re
5956b5bc21c96b25c05bcdb8b76e1fd590072f14 16-Jun-2007 rwatson <rwatson@FreeBSD.org> Rather than passing SUSER_RUID into priv_check_cred() to specify when
a privilege is checked against the real uid rather than the effective
uid, instead decide which uid to use in priv_check_cred() based on the
privilege passed in. We use the real uid for PRIV_MAXFILES,
PRIV_MAXPROC, and PRIV_PROC_LIMIT. Remove the definition of
SUSER_RUID; there are now no flags defined for priv_check_cred().

Obtained from: TrustedBSD Project
f13486a2227b9165fce30aa40d12b728f327a909 31-May-2007 kib <kib@FreeBSD.org> Revert UF_OPENING workaround for CURRENT.
Change the VOP_OPEN(), vn_open() vnode operation and d_fdopen() cdev operation
argument from being file descriptor index into the pointer to struct file.

Proposed and reviewed by: jhb
Reviewed by: daichi (unionfs)
Approved by: re (kensmith)
70815715b0a763b8f10937d0eec2b60bb26e305b 29-May-2007 kib <kib@FreeBSD.org> MFC rev. 1.309 of sys/kern/kern_descrip.c,
rev. 1.438 of sys/kern/vfs_syscalls.c,
rev. 1.77 of sys/sys/filedesc.h:
Mark the filedescriptor table entries with VOP_OPEN being performed for them
as UF_OPENING. Disable closing of that entries. This should fix the crashes
caused by devfs_open() (and fifo_open()) dereferencing struct file * by
index, while the filedescriptor is closed by parallel thread.

RELENG_6 testing by: Mark Kane <mark at mkproductions org>
63a2b7e3f065e213c09a692c380615064dd324f8 23-May-2007 jhb <jhb@FreeBSD.org> MFC 1.308: Use kern_open() to open /dev/null in fdcheckstd().
348f936c25b54c1b916ac2d92897b390c009e8b5 23-May-2007 jhb <jhb@FreeBSD.org> Revert previous commit, was part of a different change.

Reported by: kib
3464b119cd45b5c5fb66f1cd969bfbb15f9f77e5 23-May-2007 jhb <jhb@FreeBSD.org> MFC: Rework the support used by ABIs to override resource limits so that
a 64-bit process exec'd by a 32-bit process doesn't end up with 32-bit
limits.

This doesn't break the ABI as neither of the 32-bit ABIs (COMPAT_LINUX32
and COMPAT_IA32) are buildable as modules on 6.x/amd64 and none of the
other ABIs use this hook.
cef02547601ab1f28f733c9db913c8d052984fb4 04-May-2007 kib <kib@FreeBSD.org> Mark the filedescriptor table entries with VOP_OPEN being performed for them
as UF_OPENING. Disable closing of that entries. This should fix the crashes
caused by devfs_open() (and fifo_open()) dereferencing struct file * by
index, while the filedescriptor is closed by parallel thread.

Idea by: tegge
Reviewed by: tegge (previous version of patch)
Tested by: Peter Holm
Approved by: re (kensmith)
MFC after: 3 weeks
8d6f49b7317a0f199de175e20f65022d269553dd 26-Apr-2007 jhb <jhb@FreeBSD.org> Avoid a lot of code duplication by using kern_open() to open /dev/null
in fdcheckstd() instead of a stripped down version of kern_open()'s code.

MFC after: 1 week
Reviewed by: cperciva
765a83fd795f79d2911ebf3b158ddc368ea0a0f6 04-Apr-2007 rwatson <rwatson@FreeBSD.org> Replace custom file descriptor array sleep lock constructed using a mutex
and flags with an sxlock. This leads to a significant and measurable
performance improvement as a result of access to shared locking for
frequent lookup operations, reduced general overhead, and reduced overhead
in the event of contention. All of these are imported for threaded
applications where simultaneous access to a shared file descriptor array
occurs frequently. Kris has reported 2x-4x transaction rate improvements
on 8-core MySQL benchmarks; smaller improvements can be expected for many
workloads as a result of reduced overhead.

- Generally eliminate the distinction between "fast" and regular
acquisisition of the filedesc lock; the plan is that they will now all
be fast. Change all locking instances to either shared or exclusive
locks.

- Correct a bug (pointed out by kib) in fdfree() where previously msleep()
was called without the mutex held; sx_sleep() is now always called with
the sxlock held exclusively.

- Universally hold the struct file lock over changes to struct file,
rather than the filedesc lock or no lock. Always update the f_ops
field last. A further memory barrier is required here in the future
(discussed with jhb).

- Improve locking and reference management in linux_at(), which fails to
properly acquire vnode references before using vnode pointers. Annotate
improper use of vn_fullpath(), which will be replaced at a future date.

In fcntl(), we conservatively acquire an exclusive lock, even though in
some cases a shared lock may be sufficient, which should be revisited.
The dropping of the filedesc lock in fdgrowtable() is no longer required
as the sxlock can be held over the sleep operation; we should consider
removing that (pointed out by attilio).

Tested by: kris
Discussed with: jhb, kris, attilio, jeff
cbfbccd2212ab1f4583b0b44881cb991088c238b 15-Mar-2007 jhb <jhb@FreeBSD.org> Just use 'fdrop()' instead of 'FILE_LOCK(); fdrop_locked()' in
dupfdopen(). While I'm at it, move the second fdrop() out from under the
filedesc lock.
69938bd19626ef8e176215880689a282503f8eca 05-Mar-2007 rwatson <rwatson@FreeBSD.org> Further system call comment cleanup:

- Remove also "MP SAFE" after prior "MPSAFE" pass. (suggested by bde)
- Remove extra blank lines in some cases.
- Add extra blank lines in some cases.
- Remove no-op comments consisting solely of the function name, the word
"syscall", or the system call name.
- Add punctuation.
- Re-wrap some comments.
300d4098cfd89ed7f8fca1a3333256308124f41b 04-Mar-2007 rwatson <rwatson@FreeBSD.org> Remove 'MPSAFE' annotations from the comments above most system calls: all
system calls now enter without Giant held, and then in some cases, acquire
Giant explicitly.

Remove a number of other MPSAFE annotations in the credential code and
tweak one or two other adjacent comments.
7cbf0c292cd1b619002096225fc4299e70aff267 15-Feb-2007 rwatson <rwatson@FreeBSD.org> Catch up file descriptor printing function in DDB to the addition of kqueues
and POSIX message queues.
8ae276c86f35fec6aafa3cbb38d716f46f8b6054 15-Feb-2007 rwatson <rwatson@FreeBSD.org> Break file descriptor printing logic out of db_show_files() into
db_print_file(), and add a new "show file <ptr>" DDB command, which can
be used to print out file descriptors referenced in stack traces.
2e20bff54b86c33ebb25166239b010f241410789 17-Jan-2007 delphij <delphij@FreeBSD.org> Use FOREACH_PROC_IN_SYSTEM instead of using its unrolled form.
82609b5afe99447dbb5bccaca2572497a1bb4c88 12-Jan-2007 jhb <jhb@FreeBSD.org> MFC: Close a race between UNIX domain pcb garbage collection (unp_gc()) and
file descriptor teardown (fdrop()) by adding a new garbage collection flag
FWAIT.
256d3cdbaf7ae58daa236a7354b03d5b94182a94 05-Jan-2007 jhb <jhb@FreeBSD.org> - Close a race between enumerating UNIX domain socket pcb structures via
sysctl and socket teardown by adding a reference count to the UNIX domain
pcb object and fixing the sysctl that enumerates unpcbs to grab a
reference on each unpcb while it builds the list to copy out to userland.
- Close a race between UNIX domain pcb garbage collection (unp_gc()) and
file descriptor teardown (fdrop()) by adding a new garbage collection
flag FWAIT. unp_gc() sets FWAIT while it walks the message buffers
in a UNIX domain socket looking for nested file descriptor references
and clears the flag when it is finished. fdrop() checks to see if the
flag is set on a file descriptor whose refcount just dropped to 0 and
waits for unp_gc() to clear the flag before completely destroying the
file descriptor.

MFC after: 1 week
Reviewed by: rwatson
Submitted by: ups
Hopefully makes the panics go away: mx1
10d0d9cf473dc5f0ce1bf263ead445ffe7819154 06-Nov-2006 rwatson <rwatson@FreeBSD.org> Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>
fc9b5516606a049d1b5a0758c0ef6467a2c47529 29-Sep-2006 jmg <jmg@FreeBSD.org> MFC: v1.298 kern_descript.c
> return EBADF instead of successfully attaching (and then panicing) when
> an fd is dieing..

Approved by: re (kensmith)
CVn: ----------------------------------------------------------------------
08d150cea64d52d18374d306967feb39d63c4e64 24-Sep-2006 jmg <jmg@FreeBSD.org> return EBADF instead of successfully attaching (and then panicing) when
an fd is dieing..

Convinced by: jhb
PR: 103127
c69598a52c6961f93b5908bd0c3f7f3743e11a4c 04-Sep-2006 pjd <pjd@FreeBSD.org> MFC: sys/kern/kern_descrip.c 1.295

Compress direct cr_ruid comparsion and jailed() call to suser_cred(9).

Reviewed by: rwatson
5ee0fb3fe3bc06c3cb462245f622e03698aaadf6 02-Sep-2006 rwatson <rwatson@FreeBSD.org> Merge kern_descrip.c:1.291, kern_exit.c:1.280, kern_fork.c:1.255,
kern_prot.c:1.202 from HEAD to RELENG_6:

Add auditing of arguments to the close() and fstat() system calls.

Audit the pid being requested in wait4().

Audit the args to rfork(), and the child PID for all fork system calls.

Audit the arguments (user/group IDs) for the system calls that set these
IDs.

Obtained from: TrustedBSD Project
480dbd17c4a6d6e059cb591497f3fd19bcec978d 21-Jul-2006 jhb <jhb@FreeBSD.org> Add a comment to explain what fdclose() does and what it's purpose is
since the subtlety eluded me when I looked at it last week.
e09e5b52dbb8914136f6708a8042007a16277dde 08-Jul-2006 jhb <jhb@FreeBSD.org> Add a kern_close() so that the ABIs can close a file descriptor w/o having
to populate a close_args struct and change some of the places that do.
97382ba992b2246952e73a36460bcdfbb32f1c02 27-Jun-2006 pjd <pjd@FreeBSD.org> Compress direct cr_ruid comparsion and jailed() call to suser_cred(9).

Reviewed by: rwatson
53d8847cf38e5f7265c8e830a4dff1c7f888c359 01-Apr-2006 rwatson <rwatson@FreeBSD.org> Mark fgetsock() and fputsock() as depcrecated: callers should rely on
the file descriptor reference, rather than paying additional lock
operations to acquire a socket reference from the file descriptor.
This will also help to ensure that file descriptor based socket
requests are not delivered to a socket after close. Most consumers
have already been converted to this model.

MFC after: 3 months
f3b5ccdb543e0f595fd7d7b2a9432591ed621d81 23-Mar-2006 csjp <csjp@FreeBSD.org> MFC descriptor fixes in hopes of killing the "dup(2) regression on 6.x" show
stopper item on the 6.1-RELEASE TODO list.

Approved by: re (scottl)
7448676f59fe1294f9b5ec654564614a5c1edbe1 20-Mar-2006 csjp <csjp@FreeBSD.org> Restore fd optimization with a few minor tweaks, to quote tegge:

"fdinit() fails to initialize newfdp->fd_fd.fd_lastfile to -1. This breaks
fdcopy() which will incorrectly set newfdp->fd_freefile to 1 if no files are
open and the last file descriptor marked as unused for fdp was 0. This later
causes descriptor 0 to be unavailable in newfdp when the optimization is
enabled.

When the last file descriptor previously marked as used is nonzero and marked
as unused, fdunused() incorrectly sets fdp->fd_lastfile to fd - 1 due to
fd_last_used() returning (size - 1). This hides the problem that breaks the
optimization."

This allows us to keep the optimization, while un-breaking it.

This is a RELENG_6 candidate.

PR: kern/87208
MFC after: 1 week
Submitted by: tegge
6b22256534eec8a346cc20c6006473e1d4e85d80 18-Mar-2006 csjp <csjp@FreeBSD.org> Back out fd optimization introduced in revision 1.280 as it appears to be
really breaking things. Simple "close(0); dup(fd)" does not return descriptor
"0" in some cases. Further, this change also breaks some MAC interactions with
mac_execve_will_transition(). Under certain circumstances, fdcheckstd() can
be called in execve(2) causing an assertion that checks to make sure that
stdin, stdout and stderr reside at indexes 0, 1 and 2 in the process fd table
to fail, resulting in a kernel panic when INVARIANTS is on.

This should also kill the "dup(2) regression on 6.x" show stopper item on the
6.1-RELEASE TODO list.

This is a RELENG_6 candidate.

PR: kern/87208
Silence from: des
MFC after: 1 week
645740538d7ab2cfbd5a29d9edaf08a1abd9371b 05-Feb-2006 wsalamon <wsalamon@FreeBSD.org> Add auditing of arguments to the close() and fstat() system calls. Much more
argument auditing yet to come, for remaining system calls in this file.

Obtained from: TrustedBSD Project
Approved by: rwatson (mentor)
e851a1b52aa81242bee0ce1da1d3d2172f0e670d 06-Jan-2006 jhb <jhb@FreeBSD.org> Return EBADF rather than EINVAL for FWRITE failure as per POSIX.

MFC after: 1 week
5d50adf57dd3eb3f8996480fa4dd26e4756e268e 30-Nov-2005 davidxu <davidxu@FreeBSD.org> Last step to make mq_notify conform to POSIX standard, If the process
has successfully attached a notification request to the message queue
via a queue descriptor, file closing should remove the attachment.
c3f82714cb2f5d600bde9aff9bf67f417781f3ed 17-Nov-2005 rwatson <rwatson@FreeBSD.org> Merge kern_descrip.c:1.286, 1.287, 1.288 from HEAD to RELENG_6:

Add a DDB "show files" command to list the current open file list, some
state about each open file, and identify the first process in the process
table that references the file. This is helpful in debugging leaks of
file descriptors.

Expand the set of details printed for each file descriptor to include
it's garbage collection flags. Reformat generally to make this fit and
leave some room for future expansion.

Add the f_msgcount field to the set of struct file fields printed in show
files.
6b9e4e50dc5ee788f09b8510e405eb6546ed8e42 16-Nov-2005 rwatson <rwatson@FreeBSD.org> Merge kern_descrip.c:1.284 from HEAD to RELENG_6:

In closef(), remove the assumption that there is a thread associated
with the file descriptor. When a file descriptor is closed as a result
of garbage collecting a UNIX domain socket, the file descriptor will
not have any associated thread, so the logic to identify advisory locks
held by that thread is not appropriate. Check the thread for NULL to
avoid this scenario. Expand an existing comment to say a bit more about
this.
3153d02ada4911492510b3ea6df3677b7125047e 10-Nov-2005 rwatson <rwatson@FreeBSD.org> Add the f_msgcount field to the set of struct file fields printed in show
files.

MFC after: 1 week
dcccc2e254b65c0d294c8ab60f302ee656ad0e55 10-Nov-2005 rwatson <rwatson@FreeBSD.org> Expanet of details printed for each file descriptor to include it's
garbage collection flags. Reformat generally to make this fit and
leave some room for future expansion.

MFC after: 1 week
20a1214886ea883bdcce327b9722855ed3637aab 10-Nov-2005 rwatson <rwatson@FreeBSD.org> Add a DDB "show files" command to list the current open file list, some
state about each open file, and identify the first process in the process
table that references the file. This is helpful in debugging leaks of
file descriptors.

MFC after: 1 week
fc360a564f423f6153219166750b4335eee05f51 09-Nov-2005 rwatson <rwatson@FreeBSD.org> Fix typo in recent comment tweak.

Submitted by: jkim
MFC after: 1 week
6b8f490b777875eac382dcd10325ccfef5a77fe8 09-Nov-2005 rwatson <rwatson@FreeBSD.org> In closef(), remove the assumption that there is a thread associated
with the file descriptor. When a file descriptor is closed as a result
of garbage collecting a UNIX domain socket, the file descriptor will
not have any associated thread, so the logic to identify advisory locks
held by that thread is not appropriate. Check the thread for NULL to
avoid this scenario. Expand an existing comment to say a bit more about
this.

MFC after: 1 week
f9c4f174146de11d284a7a5d766359e28dac821b 08-Nov-2005 jhb <jhb@FreeBSD.org> MFC: Push down Giant into fdfree() and remove it from two of the callers.
2eddf38ca6603759a7a97d99379fa80288915826 01-Nov-2005 jhb <jhb@FreeBSD.org> Push down Giant into fdfree() and remove it from two of the callers.
Other callers such as some rfork() cases weren't locking Giant anyway.

Reviewed by: csjp
MFC after: 1 week
be4f357149ecc68e1bf349f69f702cad430aec97 31-Oct-2005 rwatson <rwatson@FreeBSD.org> Normalize a significant number of kernel malloc type names:

- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.
47f527c6ee55428a48656a7994d1b42667086f88 09-Oct-2005 rik <rik@FreeBSD.org> MFC:
----------------------------
revision 1.281
date: 2005/10/04 16:27:54; author: rik; state: Exp; lines: +3 -1
Use FILEDESC_UNLOCK(fdp) after FILE_UNLOCK(p), not before to avoid LOR.
Slightly discussed on current@.

LOR #055

Approved by: re(scottl)
Requested by: delphij
231bdbf078237839215dcb3c0c21031245e39687 06-Oct-2005 delphij <delphij@FreeBSD.org> MFC 1.280 (by des):

| Two minor optimizations of fdalloc():
|
| - if minfd < fd_freefile (as is most often the case, since minfd is
| usually 0), set it to fd_freefile.
|
| - remove a call to fd_first_free() which duplicates work already done
| by fdused().
|
| This change results in a small but measurable speedup for processes
| with large numbers (several thousands) of open files.
|
| PR: kern/85176
| Submitted by: Divacky Roman <xdivac02@stud.fit.vutbr.cz>
|
| Revision Changes Path
| 1.280 +3 -1 src/sys/kern/kern_descrip.c

Approved by: re
947a0d1c6633d30c4078023f04e9187aa24eef29 04-Oct-2005 rik <rik@FreeBSD.org> Use FILEDESC_UNLOCK(fdp) after FILE_UNLOCK(p), not before to avoid LOR.
Slightly discussed on current@.

LOR #055

MFC after: 14 days
6222a26a3b2248379acb5ba8ed9923029ccd8a9a 26-Aug-2005 des <des@FreeBSD.org> Two minor optimizations of fdalloc():

- if minfd < fd_freefile (as is most often the case, since minfd is
usually 0), set it to fd_freefile.

- remove a call to fd_first_free() which duplicates work already done
by fdused().

This change results in a small but measurable speedup for processes
with large numbers (several thousands) of open files.

PR: kern/85176
Submitted by: Divacky Roman <xdivac02@stud.fit.vutbr.cz>
MFC after: 3 weeks
1d30c90da0046704aaa5d352be8fdfaeaaa182f6 25-Jun-2005 dd <dd@FreeBSD.org> Fix fdcheckstd to pass the file descriptor along through vn_open. When
opening a device, devfs_open needs the file descriptor to install its
own fileops. Failing to pass the file descriptor causes the vnode to
be returned with the regular vnops, which will cause a panic on the
first read or write because devfs_specops is not meant to support
those operations.

This bug caused a panic after exec'ing any set[ug]id program with
fds 0..2 closed (i.e., if any action had to be taken by fdcheckstd, we
would panic if the exec'd program ever tried to use any of those
descriptors).

Reviewed by: phk
Approved by: re (scottl)
451e14446f914bff6d118da3d71fa31501ba406c 03-May-2005 jeff <jeff@FreeBSD.org> - Use NAMEI to pickup Giant if we need it in fpcheckstd().
29c09959986c1ae26b219b240e9e00ab79e89524 08-Mar-2005 keramida <keramida@FreeBSD.org> Remove redundant initialization that is repeated in the for() loop
right below it.

Approved by: jhb
eb0e1a66a0bacca0bf86be5c6666da35fd8d28dc 08-Mar-2005 keramida <keramida@FreeBSD.org> Typo & grammar fixes in comments.
dc9f809dd574faccecacee96f8fafcefeb7151aa 10-Feb-2005 phk <phk@FreeBSD.org> Make some file/filedesc related functions static
71c05d27c0fe7676964592e1791abed816be1a00 07-Feb-2005 jhb <jhb@FreeBSD.org> - Tweak kern_msgctl() to return a copy of the requested message queue id
structure in the struct pointed to by the 3rd argument for IPC_STAT and
get rid of the 4th argument. The old way returned a pointer into the
kernel array that the calling function would then access afterwards
without holding the appropriate locks and doing non-lock-safe things like
copyout() with the data anyways. This change removes that unsafeness and
resulting race conditions as well as simplifying the interface.
- Implement kern_foo wrappers for stat(), lstat(), fstat(), statfs(),
fstatfs(), and fhstatfs(). Use these wrappers to cut out a lot of
code duplication for freebsd4 and netbsd compatability system calls.
- Add a new lookup function kern_alternate_path() that looks up a filename
under an alternate prefix and determines which filename should be used.
This is basically a more general version of linux_emul_convpath() that
can be shared by all the ABIs thus allowing for further reduction of
code duplication.
796d435574629a3a293e13d786e313d9d473a134 25-Jan-2005 phk <phk@FreeBSD.org> Don't use VOP_GETVOBJECT, use vp->v_object directly.
8be2f1a91ea892b47b1426e8ec1dc62b26b083cb 24-Jan-2005 jeff <jeff@FreeBSD.org> - Use VFS_LOCK_GIANT() in place of mtx_lock(&giant), etc.

Sponsored By: Isilon Systems, Inc.
20280f143170ee08a1e2cbd8871550105b276674 06-Jan-2005 imp <imp@FreeBSD.org> /* -> /*- for copyright notices, minor format tweaks as necessary
eab55e158905b42bd76bba4164b30ab5db8d0d86 14-Dec-2004 phk <phk@FreeBSD.org> Fix a deadlock I introduced this morning.

Mostly from: tegge
8a94ce5ea17343a8b3d2cee98003de0983f20f6c 14-Dec-2004 phk <phk@FreeBSD.org> Add a new kind of reference count (fd_holdcnt) to struct filedesc
which holds on to just the data structure and the mutex. (The
existing refcount (fd_refcnt) holds onto the open files in the
descriptor.)

The fd_holdcnt is protected by fdesc_mtx, fd_refcnt by FILEDESC_LOCK.

Add fdhold(struct proc *) which gets a hold on the filedescriptors of
the specified proc..

Add fddrop(struct filedesc *) which drops the fd_holdcnt and if zero
destroys the mutex and frees the memory.

Initialize the fd_holdcnt to one in fdinit(). Normal operations on
the filedesc structure will not change it.

In fdfree() use fddrop() to dispose of the mutex and structure. Hold
the FILEDESC_LOCK() until we have cleaned out the contents and carefully
set the fields to null values during cleanup.

Use fdhold()/fddrop() in mountcheckdirs() and sysctl_kern_file().
8986ab0ecfc691abb411612122098f7ac558fe2b 14-Dec-2004 phk <phk@FreeBSD.org> Make fdesc_mtx private to kern_descrip.c now that the flock has come home.
b124f6c510c48e1ce8a71ea17a354f49c71fc9a0 14-Dec-2004 phk <phk@FreeBSD.org> Move the checkdirs() function from vfs_mount.c to kern_descrip.c and
call it mountcheckdirs().
e622b9bcba58c68642dfd3b9c7654dce32990367 14-Dec-2004 phk <phk@FreeBSD.org> Add new function fdunshare() which encapsulates the necessary light magic
for ensuring that a process' filedesc is not shared with anybody.

Use it in the two places which previously had private implmentations.

This collects all fd_refcnt handling in kern_descrip.c
8bc17baa39a9e45682509f80aa1438f3b6682809 03-Dec-2004 phk <phk@FreeBSD.org> Sort and wash #includes.
62476e023e538a6a64b6dd8280e221c107590046 02-Dec-2004 phk <phk@FreeBSD.org> Drop ffree() as a separate function and incorporate the only place used.
660e2d8605afb1066f3e1adf766ca5e89081d3e6 02-Dec-2004 phk <phk@FreeBSD.org> Style polishing.

Use grepable functions
Other minor nitpickings.
a50f0bcbfd79a523bed7c5e00964a797edad9ff7 01-Dec-2004 phk <phk@FreeBSD.org> We already have a lock initialization function, use that for fdesc_mtx
also.

Polish badfo stuff.
ea3f471ee5f1f698d5a2e1fe0076a5e5354b947c 01-Dec-2004 phk <phk@FreeBSD.org> Collect the stuff for the /dev/fd/{%d,std{in,out,err}} pseudo-device
driver at the bottom of the file.
9b4cd725f10562da76638ddca6a4eebdb665764c 01-Dec-2004 phk <phk@FreeBSD.org> "nfiles" is a bad name for a global variable. Call it "openfiles" instead
as this is more correct and matches the sysctl variable.
31e045eaae3af67634f7d76dd3927307359e3cd6 01-Dec-2004 phk <phk@FreeBSD.org> Style: move data to top of file.
b9b9205b8f5f3d577106a937dee83095d52d5b66 28-Nov-2004 rwatson <rwatson@FreeBSD.org> Don't acquire Giant before calling closef() in close() (and elsewhere);
instead acquire it conditionally in closef() if it is required for
advisory locking. This removes Giant from the close() path of sockets
and pipes (and any other objects that don't acquire Giant in their
fo_close path, such as kqueues). Giant will still be acquired twice for
vnodes -- once for advisory lock teardown, and a second time in the
fo_close method. Both Poul-Henning and I believe that the advisory lock
teardown code can be moved into the vn_closefile path shortly.

This trims a percent or two off the cost of most non-vnode close
operations on SMP, but has a fairly minimal impact on UP where the cost
of a single mutex operation is pretty low.
f5fa0d9a9a2e502d3a8241dc25c21273999dd88b 26-Nov-2004 phk <phk@FreeBSD.org> Fix LOR.

Solution pointed out by: jhb
e764a6edec01abc86207ace169f3f4eb726a2b2c 21-Nov-2004 das <das@FreeBSD.org> Neither of the arguments to closef() can be NULL anymore, so don't
check for that.
205866145eff75c19e34cdcdeafc112d85d87f16 16-Nov-2004 phk <phk@FreeBSD.org> Move a FILEDESC_UNLOCK up to maintain correct nesting of FILEDESC/FILE
locking.
b9a3a171cc981341d15cfec685507aabefe0f7a9 15-Nov-2004 phk <phk@FreeBSD.org> Make FILE_LOCK and FILEDESC_LOCK nest properly by postponing the the
release of FILEDESC_LOCK a few more lines.
c61620e47ea2c3b67a511394b021fb6d9ba9e3c7 14-Nov-2004 phk <phk@FreeBSD.org> Move #define up.
216166ee0de39b10ba8e60f4115d65e1251ff29f 13-Nov-2004 phk <phk@FreeBSD.org> Introduce an alias for FILEDESC_{UN}LOCK() with the suffix _FAST.

Use this in all the places where sleeping with the lock held is not
an issue.

The distinction will become significant once we finalize the exact
lock-type to use for this kind of case.
d24107be6b63ca9ccbc6bca190ef874651886c49 08-Nov-2004 phk <phk@FreeBSD.org> Use more intuitive pointer for fdinit() and fdcopy().

Change fdcopy() to take unlocked filedesc.
52da2f8e345a7de465e7604b8712493ea0fedebd 07-Nov-2004 phk <phk@FreeBSD.org> Introduce fdclose() which will clean an entry in a filedesc.

Replace homerolled versions with call to fdclose().

Make fdunused() static to kern_descrip.c
02a22fcc9e339cebd74af8233fc2949356c09fda 07-Nov-2004 phk <phk@FreeBSD.org> Move fdinit() related stuff from .h to .c
418eb9d34be916bd06097f00f2a07385d45875ca 07-Nov-2004 phk <phk@FreeBSD.org> Allow fdinit() to be called with a NULL fdp argument so we can use
it when setting up init.

Make fdinit() lock the fdp argument as needed.
296b8c7706561fbc5d08c6b5fd2333a5acf5054a 06-Nov-2004 phk <phk@FreeBSD.org> When we open /dev/null for stdin/out/err for safety reasons, do it right:
we should preserve f_data and f_ops if they are already set.
4b81ce6dd2658abba782e835143f8008092c1c6f 18-Oct-2004 rwatson <rwatson@FreeBSD.org> Push acquisition of the accept mutex out of sofree() into the caller
(sorele()/sotryfree()):

- This permits the caller to acquire the accept mutex before the socket
mutex, avoiding sofree() having to drop the socket mutex and re-order,
which could lead to races permitting more than one thread to enter
sofree() after a socket is ready to be free'd.

- This also covers clearing of the so_pcb weak socket reference from
the protocol to the socket, preventing races in clearing and
evaluation of the reference such that sofree() might be called more
than once on the same socket.

This appears to close a race I was able to easily trigger by repeatedly
opening and resetting TCP connections to a host, in which the
tcp_close() code called as a result of the RST raced with the close()
of the accepted socket in the user process resulting in simultaneous
attempts to de-allocate the same socket. The new locking increases
the overhead for operations that may potentially free the socket, so we
will want to revise the synchronization strategy here as we normalize
the reference counting model for sockets. The use of the accept mutex
in freeing of sockets that are not listen sockets is primarily
motivated by the potential need to remove the socket from the
incomplete connection queue on its parent (listen) socket, so cleaning
up the reference model here may allow us to substantially weaken the
synchronization requirements.

RELENG_5_3 candidate.

MFC after: 3 days
Reviewed by: dwhite
Discussed with: gnn, dwhite, green
Reported by: Marc UBM Bocklet <ubm at u-boot-man dot de>
Reported by: Vlad <marchenko at gmail dot com>
395c906e95c3dd63b5994b9fd2a3800cf70771c0 04-Oct-2004 julian <julian@FreeBSD.org> Another case where we need to guard against a partially
constructed process.

Submitted by: Stephan Uphoff ( ups at tree.com )
MFC after: 3 days
477ea1ed67ee4ee12f0fa7a922f4501e4e503fad 19-Aug-2004 rwatson <rwatson@FreeBSD.org> Remove GIANT_REQUIRED from setugidsafety() as knote_fdclose() no longer
requires Giant.
1de6d6df05dc75c48597f345ef9a8372bc906d67 16-Aug-2004 green <green@FreeBSD.org> Add the missing knote_fdclose().
bc1805c6e871c178d0b6516c3baa774ffd77224a 15-Aug-2004 jmg <jmg@FreeBSD.org> Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)
656f43381358b8935c39a52b4ab704da7dc8ec66 07-Aug-2004 rwatson <rwatson@FreeBSD.org> We're not yet ready to assert !Giant in kern_fcntl(), as it's called
with Giant from ABI wrappers such as Linux emulation.

Foot shoot off: phk
8de3afda37da2a4436323e1948674d9362292258 06-Aug-2004 rwatson <rwatson@FreeBSD.org> Avoid acquiring Giant for some common light-weight or already MPSAFE
fcntl() operations, including:

F_DUPFD dup() alias
F_GETFD retrieve close-on-exec flag
F_SETFD set close-on-exec flag
F_GETFL retrieve file descriptor flags

For the remaining fcntl() operations, do acquire Giant, especially
where we call into fo_ioctl() as a result. We're not yet ready to
push Giant into fo_ioctl(). Once we do, this can all become quite a
bit prettier.
243f24944e31bf8264984db84728ef9b479c4a91 04-Aug-2004 rwatson <rwatson@FreeBSD.org> Assert Giant in the following file descriptor-related functions:

Function Reason
-------- ------
fdfree() VFS
setugidsafety() KQueue
fdcheckstd() VFS
_fgetvp() VFS
fgetsock() Conditional assertion based on debug.mpsafenet
92f30976fe9d08c23ba40cb7e6d74d64d73af87f 22-Jul-2004 rwatson <rwatson@FreeBSD.org> Push Giant acquisition down into fo_stat() from most callers. Acquire
Giant conditional on debug.mpsafenet in the socket soo_stat() routine,
unconditionally in vn_statfile() for VFS, and otherwise don't acquire
Giant. Accept an unlocked read in kqueue_stat(), and cryptof_stat() is
a no-op. Don't acquire Giant in fstat() system call.

Note: in fdescfs, fo_stat() is called while holding Giant due to the VFS
stack sitting on top, and therefore there will still be Giant recursion
in this case.
861b3c44169cf75066afaaa69d2ec46483eac929 22-Jul-2004 rwatson <rwatson@FreeBSD.org> Push acquisition of Giant from fdrop_closed() into fo_close() so that
individual file object implementations can optionally acquire Giant if
they require it:

- soo_close(): depends on debug.mpsafenet
- pipe_close(): Giant not acquired
- kqueue_close(): Giant required
- vn_close(): Giant required
- cryptof_close(): Giant required (conservative)

Notes:

Giant is still acquired in close() even when closing MPSAFE objects
due to kqueue requiring Giant in the calling closef() code.
Microbenchmarks indicate that this removal of Giant cuts 3%-3% off
of pipe create/destroy pairs from user space with SMP compiled into
the kernel.

The cryptodev and opencrypto code appears MPSAFE, but I'm unable to
test it extensively and so have left Giant over fo_close(). It can
probably be removed given some testing and review.
7b09b25ecb2d9e5be9c06e81ce89a7c5d347bb6f 14-Jul-2004 csjp <csjp@FreeBSD.org> In addition to the real user ID check, do an explicit jail
check to ensure that the caller is not prison root.

The intention is to fix file descriptor creation so that
prison root can not use the last remaining file descriptors.
This privilege should be reserved for non-jailed root users.

Approved by: bmilekic (mentor)
1ce305fbfd151957d1877c533da0a0578ab04938 19-Jun-2004 phk <phk@FreeBSD.org> Explicitly initialize f_data and f_vnode to NULL.

Report f_vnode to userland in struct xfile.
dfd1f7fd50fffaf75541921fcf86454cd8eb3614 16-Jun-2004 phk <phk@FreeBSD.org> Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.
82295697cd4bae93852c3a10a939f20227018fbd 12-Jun-2004 rwatson <rwatson@FreeBSD.org> Extend coverage of SOCK_LOCK(so) to include so_count, the socket
reference count:

- Assert SOCK_LOCK(so) macros that directly manipulate so_count:
soref(), sorele().

- Assert SOCK_LOCK(so) in macros/functions that rely on the state of
so_count: sofree(), sotryfree().

- Acquire SOCK_LOCK(so) before calling these functions or macros in
various contexts in the stack, both at the socket and protocol
layers.

- In some cases, perform soisdisconnected() before sotryfree(), as
this could result in frobbing of a non-present socket if
sotryfree() actually frees the socket.

- Note that sofree()/sotryfree() will release the socket lock even if
they don't free the socket.

Submitted by: sam
Sponsored by: FreeBSD Foundation
Obtained from: BSD/OS
86602fc06c6eef73f48ce541f6b8d2b6af993629 11-Jun-2004 phk <phk@FreeBSD.org> Deorbit COMPAT_SUNOS.

We inherited this from the sparc32 port of BSD4.4-Lite1. We have neither
a sparc32 port nor a SunOS4.x compatibility desire these days.
1e76056c09514635a9a2a057476724f42303910a 01-Jun-2004 rwatson <rwatson@FreeBSD.org> Push the VOP_ADVLOCK() call to release advisory locks on vnode file
descriptors out of fdrop_locked() and into vn_closefile(). This
removes all knowledge of vnodes from fdrop_locked(), since the lock
behavior was specific to vnodes. This also removes the specific
requirement for Giant in fdrop_locked(), it's now only required by
code that it calls into.

Add GIANT_REQUIRED to vn_closefile() since VFS requires Giant.
74cf37bd00b1e09a0b991b7b1edd335d8e0c2355 05-Apr-2004 imp <imp@FreeBSD.org> Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core
0feec337577c7f4d84f9da68b5294750de293ab7 29-Mar-2004 rwatson <rwatson@FreeBSD.org> Conditionally assert Giant in fputsock() based on the value of
debug.mpsafenet.
1de257deb3229812024de5861eb0aaa41e471448 26-Feb-2004 truckman <truckman@FreeBSD.org> Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by: bms
ad925439e08646e188eb1c0e0be355f0685c8739 21-Feb-2004 phk <phk@FreeBSD.org> Device megapatch 4/6:

Introduce d_version field in struct cdevsw, this must always be
initialized to D_VERSION.

Flip sense of D_NOGIANT flag to D_NEEDGIANT, this involves removing
four D_NOGIANT flags and adding 145 D_NEEDGIANT flags.
177497ee8500027f0130f7ab6bbe20f34bae7da4 16-Feb-2004 des <des@FreeBSD.org> Don't bother storing a result when all you need are the side effects.
824c230543eae3766e5a429669e3b8fe19fa8e73 15-Feb-2004 dwmalone <dwmalone@FreeBSD.org> In fdcheckstd the descriptor table should never be shared, so just
KASSERT this rather than trying to deal with what happens when file
descriptors change out from under us.
279b2b827810d149b5b8453900cdea57874ae234 04-Feb-2004 jhb <jhb@FreeBSD.org> Locking for the per-process resource limits structure.
- struct plimit includes a mutex to protect a reference count. The plimit
structure is treated similarly to struct ucred in that is is always copy
on write, so having a reference to a structure is sufficient to read from
it without needing a further lock.
- The proc lock protects the p_limit pointer and must be held while reading
limits from a process to keep the limit structure from changing out from
under you while reading from it.
- Various global limits that are ints are not protected by a lock since
int writes are atomic on all the archs we support and thus a lock
wouldn't buy us anything.
- All accesses to individual resource limits from a process are abstracted
behind a simple lim_rlimit(), lim_max(), and lim_cur() API that return
either an rlimit, or the current or max individual limit of the specified
resource from a process.
- dosetrlimit() was renamed to kern_setrlimit() to match existing style of
other similar syscall helper functions.
- The alpha OSF/1 compat layer no longer calls getrlimit() and setrlimit()
(it didn't used the stackgap when it should have) but uses lim_rlimit()
and kern_setrlimit() instead.
- The svr4 compat no longer uses the stackgap for resource limits calls,
but uses lim_rlimit() and kern_setrlimit() instead.
- The ibcs2 compat no longer uses the stackgap for resource limits. It
also no longer uses the stackgap for accessing sysctl's for the
ibcs2_sysconf() syscall but uses kernel_sysctl() instead. As a result,
ibcs2_sysconf() no longer needs Giant.
- The p_rlimit macro no longer exists.

Submitted by: mtm (mostly, I only did a few cleanups and catchups)
Tested on: i386
Compiled on: alpha, amd64
3e90922798b18f5f48cf18a71637ff86bdd8bd93 17-Jan-2004 des <des@FreeBSD.org> Restore correct semantics for F_DUPFD fcntl. This should fix the errors
people have been getting with configure scripts.
e0a75877b59aae7157172d050fd9405dc26aa8a6 16-Jan-2004 des <des@FreeBSD.org> WITNESS won't let us hold two filedesc locks at the same time, so juggle
fdp and newfdp around a bit.
5af038f21added63f08bd826d3d748c044404156 16-Jan-2004 des <des@FreeBSD.org> Remove two KASSERTs which were overly paranoid.
9cd32448094b8ed411b6c2cdc99b6f011975c32f 15-Jan-2004 des <des@FreeBSD.org> Take care to drop locks when calling malloc()
f270054cd314ee2f822fb6cb1e04452749d4c214 15-Jan-2004 des <des@FreeBSD.org> New file descriptor allocation code, derived from similar code introduced
in OpenBSD by Niels Provos. The patch introduces a bitmap of allocated
file descriptors which is used to locate available descriptors when a new
one is needed. It also moves the task of growing the file descriptor table
out of fdalloc(), reducing complexity in both fdalloc() and do_dup().

Debts of gratitude are owed to tjr@ (who provided the original patch on
which this work is based), grog@ (for the gdb(4) man page) and rwatson@
(for assistance with pxeboot(8)).
69ce6641b65cd97aa24d82408f22bb45edd7dccc 11-Jan-2004 des <des@FreeBSD.org> Mechanical whitespace cleanup.
5752042760d09cb610e99eba21ace91ccfade729 11-Jan-2004 alc <alc@FreeBSD.org> Remove long dead code, specifically, code related to munmapfd().
(See also vm/vm_mmap.c revision 1.173.)
1e69eeaeee11196731ec5e555aa60c432a26c56b 28-Dec-2003 dwmalone <dwmalone@FreeBSD.org> Plug a leak of open files that happens when you exec a suid program
with one of std{in,out,err} open. This helps with the file descriptor
leaks reported on -current. This should probably be merged into 5.2.

Reviewed by: ru
Tested by: Bjoern A. Zeeb <bzeeb-lists@lists.zabbadoz.net>
be405a4cbded7cdf17be85da15ed406cd32d25cf 19-Oct-2003 dwmalone <dwmalone@FreeBSD.org> falloc allocates a file structure and adds it to the file descriptor
table, acquiring the necessary locks as it works. It usually returns
two references to the new descriptor: one in the descriptor table
and one via a pointer argument.

As falloc releases the FILEDESC lock before returning, there is a
potential for a process to close the reference in the file descriptor
table before falloc's caller gets to use the file. I don't think this
can happen in practice at the moment, because Giant indirectly protects
closes.

To stop the file being completly closed in this situation, this change
makes falloc set the refcount to two when both references are returned.
This makes life easier for several of falloc's callers, because the
first thing they previously did was grab an extra reference on the
file.

Reviewed by: iedowse
Idea run past: jhb
1c522512fd8b855dc3d6dbd26fed59bda6e9f0e4 02-Oct-2003 rwatson <rwatson@FreeBSD.org> Remove the global variable 'cmask', which was used to initialize the
fd_cmask field in the file descriptor structure for the first process
indirectly from CMASK, and when an fd structure is initialized before
being filled in, and instead just use CMASK. This appears to be an
artifact left over from the initial integration of quotas into BSD.

Suggested by: peter
cb188056e62a5c06bc394877c8f8de9d7aaab12a 04-Aug-2003 dwmalone <dwmalone@FreeBSD.org> Do some minor Giant pushdown made possible by copyin, fget, fdrop,
malloc and mbuf allocation all not requiring Giant.

1) ostat, fstat and nfstat don't need Giant until they call fo_stat.
2) accept can copyin the address length without grabbing Giant.
3) sendit doesn't need Giant, so don't bother grabbing it until kern_sendit.
4) move Giant grabbing from each indivitual recv* syscall to recvit.
bbf702f5b5e1d9fcbca53b2c1676dea5d448a26e 29-Jul-2003 alc <alc@FreeBSD.org> Revision 1.51 of vm/uma_core.c modified uma_large_free() to acquire Giant
when needed. So, don't do it here.
9bfbf98f8a8fcc1607b07e0109f31468a6e9fef3 28-Jul-2003 rwatson <rwatson@FreeBSD.org> When exporting file descriptor data for threads invoking the
kern.file sysctl, don't return information about processes that
fail p_cansee(td, p). This prevents sockstat and related
programs from seeing file descriptors owned by processes not
in the same jail as the thread, as well as having implications
for MAC, etc.

This is a partial solution: it permits an information leak about
the number of descriptors in the sizing calculation (but this is
not new information, you can also get it from kern.openfiles),
and doesn't attempt to mask file descriptors based on the
properties of the descriptor, only the process referencing it.
However, it provides most of what you want under most
circumstances, without complicating the locking.

PR: 54211
Based on a patch submitted by: Pawel Jakub Dawidek <nick@garage.freebsd.pl>
d4d7ca154aba6a0f8370fe818bb79bd7685b9fdc 27-Jul-2003 phk <phk@FreeBSD.org> Add fdidx argument to vn_open() and vn_open_cred() and pass -1 throughout.
90dbdc0a7e3841c1ba9d25c597611f08c6384c20 25-Jul-2003 alc <alc@FreeBSD.org> revision 1.51 of vm/uma_core.c modified uma_large_malloc() to acquire
Giant when needed.
f98596512043a2b2e8207da073f085954bc223a9 13-Jul-2003 truckman <truckman@FreeBSD.org> Extend the mutex pool implementation to permit the creation and use of
multiple mutex pools with different options and sizes. Mutex pools can
be created with either the default sleep mutexes or with spin mutexes.
A dynamically created mutex pool can now be destroyed if it is no longer
needed.

Create two pools by default, one that matches the existing pool that
uses the MTX_NOWITNESS option that should be used for building higher
level locks, and a new pool with witness checking enabled.

Modify the users of the existing mutex pool to use the appropriate pool
in the new implementation.

Reviewed by: jhb
1e9246857202241772eba9f3150f907fb21e9458 04-Jul-2003 phk <phk@FreeBSD.org> Use the f_vnode field to tell which file descriptors have a vnode.
c81c59299bd255eb5ab7510c9e84e9c54b8003d0 22-Jun-2003 phk <phk@FreeBSD.org> Add a f_vnode field to struct file.

Several of the subtypes have an associated vnode which is used for
stuff like the f*() functions.

By giving the vnode a speparate field, a number of checks for the specific
subtype can be replaced simply with a check for f_vnode != NULL, and
we can later free f_data up to subtype specific use.

At this point in time, f_data still points to the vnode, so any code I
might have overlooked will still work.
b64d71d8c882365e1edc1d44eb28353a342e59e1 20-Jun-2003 phk <phk@FreeBSD.org> Don't (re)initialize f_gcflag to zero.

Move initialization of DTYPE_VNODE specific field f_seqcount into
the DTYPE_VNODE specific code.
c618ac833871991c8f6b2717bee6d9773f396ea3 19-Jun-2003 alfred <alfred@FreeBSD.org> Unlock the struct file lock before aquiring Giant, otherwise
we can deadlock because of lock order reversals. This was not
caught because Witness ignores pool mutexes right now.

Diagnosis and help: truckman
Noticed by: pho
82d03c66d5d36a6bc9db1738c71edd00b2d4e48c 19-Jun-2003 silby <silby@FreeBSD.org> Add a rate limited message reporting when kern.maxfiles is exceeded,
reporting who did it.

Also, fix a style bug introduced in the previous change.

MFC after: 1 week
0d0a45a41b57e28acaf0d2182e18da074cb188d1 18-Jun-2003 silby <silby@FreeBSD.org> Reserve the last 5% of file descriptors for root use. This should allow
systems to fail more gracefully when a file descriptor exhaustion situation
occurs.

Original patch by: David G. Andersen <dga@lcs.mit.edu>
PR: 45353
MFC after: 1 week
591f399cfea86c008e5908349bbd5137d370f450 18-Jun-2003 phk <phk@FreeBSD.org> Initialize struct fileops with C99 sparse initialization.
3b8fff9e4cedc4d9df3fb1ff39f5b668abdb9676 11-Jun-2003 obrien <obrien@FreeBSD.org> Use __FBSDID().
e41badac0ac3cc9aef1d142d0fa7cd6fd8524008 02-Jun-2003 tegge <tegge@FreeBSD.org> Add tracking of process leaders sharing a file descriptor table and
allow a file descriptor table to be shared between multiple process
leaders.

PR: 50923
11faebdb1af7128503b8b90390486634ee32bdb0 31-May-2003 phk <phk@FreeBSD.org> Remove needless return

Found by: FlexeLint
1db54a2d45e200e5c772193676e29fdc69f88759 15-May-2003 rwatson <rwatson@FreeBSD.org> VOP_PATHCONF() requires a vnode lock; this patch adds locking to
fpathconf(). The lock is held for direct calls to VOP_PATHCONF() in
pathconf() already.

Approved by: re (jhb)
Pointed out by: DEBUG_VFS_LOCKS
6cc289554b8533c3a4ccee449df82dd25964011a 30-Apr-2003 markm <markm@FreeBSD.org> Fix some easy, global, lint warnings. In most cases, this means
making some local variables static. In a couple of cases, this means
removing an unused variable.
9468fdaf14ab3e5212aac4e764e4616b726ec850 29-Apr-2003 kan <kan@FreeBSD.org> Deprecate machine/limits.h in favor of new sys/limits.h.
Change all in-tree consumers to include <sys/limits.h>

Discussed on: standards@
Partially submitted by: Craig Rodrigues <rodrigc@attbi.com>
0ae911eb0e57eebb61c50c02ddf69aa67ed19599 03-Mar-2003 phk <phk@FreeBSD.org> Gigacommit to improve device-driver source compatibility between
branches:

Initialize struct cdevsw using C99 sparse initializtion and remove
all initializations to default values.

This patch is automatically generated and has been tested by compiling
LINT with all the fields in struct cdevsw in reverse order on alpha,
sparc64 and i386.

Approved by: re(scottl)
24eb4f5a923deecee5dfa7a2adeccd87d46dadc2 01-Mar-2003 tegge <tegge@FreeBSD.org> Remove unneeded code added in revision 1.188.
d3806508a829dfeb0d5fac145c9cfe336522d32a 24-Feb-2003 scottl <scottl@FreeBSD.org> Don't NULL out p_fd until after closefd() has been called. This isn't
totally correct, but it has caused breakage for too long. I welcome
someone with more fd fu to fix it correctly.
8ccfd800bd5c63b6d8c475e3e310baa626166219 22-Feb-2003 mtm <mtm@FreeBSD.org> Remove a comment which hasn't been true since rev. 1.158

Approved by: jhb, markm (mentor)(implicit)
cf874b345d0f766fb64cf4737e1c85ccc78d2bee 19-Feb-2003 imp <imp@FreeBSD.org> Back out M_* changes, per decision of the TRB.

Approved by: trb
f360480a6898bcaa82e7208ccaada23eed75ec60 15-Feb-2003 tegge <tegge@FreeBSD.org> Avoid file lock leakage when linuxthreads port or rfork is used:
- Mark the process leader as having an advisory lock
- Check if process leader is marked as having advisory lock when
closing file
- Check that file is still open after lock has been obtained
- Don't allow file descriptor table sharing between processes
with different leaders

PR: 10265
Reviewed by: alfred
29fb7c2bce463e61580d0653a141df43ac982c4f 15-Feb-2003 alfred <alfred@FreeBSD.org> Do not allow kqueues to be passed via unix domain sockets.
d9a7e5d6275ad9bb5fb49ed6879def1058777294 15-Feb-2003 alfred <alfred@FreeBSD.org> Fix LOR with PROC/filedesc. Introduce fdesc_mtx that will be used as a
barrier between free'ing filedesc structures. Basically if you want to
access another process's filedesc, you want to hold this mutex over the
entire operation.
2cadfd181f08386b144b11c0ef669916a2b33608 11-Feb-2003 alfred <alfred@FreeBSD.org> Don't lock FILEDESC under PROC.

The locking here needs to be revisited, but this ought to get rid of the
LOR messages that people are complaining about for now. I imagine either
I or someone else interested with smp will eventually clear this up.
04c4fc0788d1f6b517b910724ae0fc1caf5f0503 30-Jan-2003 phk <phk@FreeBSD.org> NODEVFS cleanup: remove #ifdefs
95173465f516a70d32ad373849f753432ebd94dd 21-Jan-2003 hsu <hsu@FreeBSD.org> Add missing SMP file locks around read-modify-write operations on
the flag field.

Reviewed by: rwatson
bf8e8a6e8f0bd9165109f0a258730dd242299815 21-Jan-2003 alfred <alfred@FreeBSD.org> Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
24596ddb76743273c5d39821727e4e00d025cb02 19-Jan-2003 phk <phk@FreeBSD.org> Originally when DEVFS was added, a global variable "devfs_present"
was used to control code which were conditional on DEVFS' precense
since this avoided the need for large-scale source pollution with
#include "opt_geom.h"

Now that we approach making DEVFS standard, replace these tests
with an #ifdef to facilitate mechanical removal once DEVFS becomes
non-optional.

No functional change by this commit.
ccd5574cc6e61b8fbf6b5ed907375f42e19b54f8 13-Jan-2003 dillon <dillon@FreeBSD.org> Bow to the whining masses and change a union back into void *. Retain
removal of unnecessary casts and throw in some minor cleanups to see if
anyone complains, just for the hell of it.
ddf9ef103e0a611c9a01425a28baf8a612b0d114 12-Jan-2003 dillon <dillon@FreeBSD.org> Change struct file f_data to un_data, a union of the correct struct
pointer types, and remove a huge number of casts from code using it.

Change struct xfile xf_data to xun_data (ABI is still compatible).

If we need to add a #define for f_data and xf_data we can, but I don't
think it will be necessary. There are no operational changes in this
commit.
266526442d75736b6f9ba0af867862398750359a 06-Jan-2003 nectar <nectar@FreeBSD.org> Correct file descriptor leaks in lseek and do_dup.
The leak in lseek was introduced in vfs_syscalls.c revision 1.218.
The leak in do_dup was introduced in kern_descrip.c revision 1.158.

Submitted by: iedowse
dd68501eb2ffc038a7b58abdbdfab424e80271b3 01-Jan-2003 alfred <alfred@FreeBSD.org> fdcopy() only needs a filedesc pointer.
9c85c6f62556907879069023dd22c05b2589074e 01-Jan-2003 alfred <alfred@FreeBSD.org> purge 'register'.
11118a80282743ed594d627e26c4007bcf1013bf 01-Jan-2003 alfred <alfred@FreeBSD.org> Since fdshare() and fdinit() only operate on filedescs, make them
take pointers to filedesc structures instead of threads. This makes
it more clear that they do not do any voodoo with the thread/proc
or anything other than the filedesc passed in or returned.

Remove some XXX KSE's as this resolves the issue.
b3883074cb4bda7291c2a017dae9f062be0f7b57 01-Jan-2003 alfred <alfred@FreeBSD.org> fdinit() does not need to lock the filedesc it is creating as no one
besideds itself has access until the function returns.
4dcc4f2f7e46cbb7043c619cfe0adb6a78b936d6 27-Dec-2002 rwatson <rwatson@FreeBSD.org> Improve consistency between devfs and MAKEDEV: use UID_ROOT and
GID_WHEEL instead of UID_BIN and GID_BIN for /dev/fd/* entries.

Submitted by: kris
22ca3b530e0ece1ccb72f0d5bd2b07d9356e9eb0 24-Dec-2002 phk <phk@FreeBSD.org> White-space changes.
b9e78196906ce4e1ddcc9226147152ae434299af 23-Dec-2002 phk <phk@FreeBSD.org> Detediousficate declaration of fileops array members by introducing
typedefs for them.
bcd327d9a44f1ca6c358929cb93dd40acf3d7295 13-Dec-2002 tjr <tjr@FreeBSD.org> Drop filedesc lock and acquire Giant around calls to malloc() and free().
These call uma_large_malloc() and uma_large_free() which require Giant.
Fixes panic when descriptor table is larger than KMEM_ZMAX bytes
noticed by kkenn.

Reviewed by: jhb
390d1b415b4ef343dc89c751254861f0a69afeef 26-Nov-2002 jhb <jhb@FreeBSD.org> If the file descriptors passed into do_dup() are negative, return EBADF
instead of panicing. Also, perform some of the simpler sanity checks on
the fds before acquiring the filedesc lock.

Approved by: re
Reported by: Dan Nelson <dan@emsphone.com> and others
7e9d4df21f95c33c9f6ee36ceffeb83ea7bcbe61 27-Oct-2002 wollman <wollman@FreeBSD.org> Change the way support for asynchronous I/O is indicated to applications
to conform to 1003.1-2001. Make it possible for applications to actually
tell whether or not asynchronous I/O is supported.

Since FreeBSD's aio implementation works on all descriptor types, don't
call down into file or vnode ops when [f]pathconf() is asked about
_PC_ASYNC_IO; this avoids the need for every file and vnode op to know about
it.
22c558ff8ce2596bab92131cb2f4cf36b666ce3d 18-Oct-2002 jhb <jhb@FreeBSD.org> Don't lock the proc lock to clear p_fd. p_fd isn't protected by the proc
lock.
00200a73ce61312fb429c1a652ce3762dfb3f512 16-Oct-2002 jhb <jhb@FreeBSD.org> Many style and whitespace fixes.

Submitted by: bde (mostly)
397b9f8a22d3afeecd2fdf4b01125c730ca590de 16-Oct-2002 jhb <jhb@FreeBSD.org> Sort includes a bit.

Submitted by: bde
28b39014f922f14aa60f52a085ff3b255a3726cb 15-Oct-2002 jhb <jhb@FreeBSD.org> Argh. Put back setting of P_ADVLOCK for the F_WRLCK case that was
accidentally lost in the previous revision.

Submitted by: bde
Pointy hat to: jhb
db8a406c24025647789a4d21484a3b79c09714f9 15-Oct-2002 jhb <jhb@FreeBSD.org> Remove the leaderp variable and just access p_leader directly. The
p_leader field is not protected by the proc lock but is only set during
fork1() by the parent process and never changes.
da2757cbc5b4e67753f56890f45f5f687cc298ae 03-Oct-2002 truckman <truckman@FreeBSD.org> In an SMP environment post-Giant it is no longer safe to blindly
dereference the struct sigio pointer without any locking. Change
fgetown() to take a reference to the pointer instead of a copy of the
pointer and call SIGIO_LOCK() before copying the pointer and
dereferencing it.

Reviewed by: rwatson
69375b2688581a4adcc2e1535fdd48fd696433c0 16-Sep-2002 tmm <tmm@FreeBSD.org> fcntl(..., F_SETLKW, ...) takes a pointer to a struct flock just like
F_SETLK does, so it also needs this structure copied in in fnctl() before
calling kern_fcntl().
0590c43070aac7fb636a1f4c4b94469046a317a0 14-Sep-2002 njl <njl@FreeBSD.org> Remove all use of vnode->v_tag, replacing with appropriate substitutes.
v_tag is now const char * and should only be used for debugging.

Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.

Suggested by: phk
Reviewed by: bde, rwatson (earlier version)
3103a45bbfcd593503595c5717f40de30d985210 13-Sep-2002 tmm <tmm@FreeBSD.org> Fix fcntl(..., F_GETOWN, ...) and fcntl(..., F_SETOWN, ...) on sparc64
by not passing a pointer to a register_t or intptr_t when the code in
the lower layers expects one to an int.
b0aee047fb440026a1abd35d4e56a79307456d49 03-Sep-2002 jhb <jhb@FreeBSD.org> - Change falloc() to acquire an fd from the process table last so that
it can do it w/o needing to hold the filelist_lock sx lock.
- fdalloc() doesn't need Giant to call free() anymore. It also doesn't
need to drop and reacquire the filedesc lock around free() now as a
result.
- Try to make the code that copies fd tables when extending the fd table in
fdalloc() a bit more readable by performing assignments in separate
statements. This is still a bit ugly though.
- Use max() instead of an if statement so to figure out the starting point
in the search-for-a-free-fd loop in fdalloc() so it reads better next to
the min() in the previous line.
- Don't grow nfiles in steps up to the size needed if we dup2() to some
really large number. Go ahead and double 'nfiles' in a loop prior
to doing the malloc().
- malloc() doesn't need Giant now.
- Use malloc() and free() instead of MALLOC() and FREE() in fdalloc().
- Check to see if the size we are going to grow to is too big, not if the
current size of the fd table is too big in the loop in fdalloc(). This
means if we are out of space or if dup2() requests too high of a fd,
then we will return an error before we go off and try to allocate some
huge table and copy the existing table into it.
- Move all of the logic for dup'ing a file descriptor into do_dup() instead
of putting some of it in do_dup() and duplicating other parts in four
different places. This makes dup(), dup2(), and fcntl(F_DUPFD) basically
wrappers of do_dup now. fcntl() still has an extra check since it uses
a different error return value in one case then the other functions.
- Add a KASSERT() for an assertion that may not always be true where the
fdcheckstd() function assumes that falloc() returns the fd requested and
not some other fd. I think that the assertion is always true because we
are always single-threaded when we get to this point, but if one was
using rfork() and another process sharing the fd table were playing with
the fd table, there might could be a problem.
- To handle the problem of a file descriptor we are dup()'ing being closed
out from under us in dup() in general, do_dup() now obtains a reference
on the file in question before calling fdalloc(). If after the call to
fdalloc() the file for the fd we are dup'ing is a different file, then
we drop our reference on the original file and return EBADF. This
race was only handled in the dup2() case before and would just retry
the operation. The error return allows the user to know they are being
stupid since they have a locking bug in their app instead of dup'ing
some other descriptor and returning it to them.

Tested on: i386, alpha, sparc64
62f75e87a487b86c51a8b2ad6a92a74221098916 02-Sep-2002 iedowse <iedowse@FreeBSD.org> Split fcntl() into a wrapper and a kernel-callable kern_fcntl()
implementation. The wrapper is responsible for copying additional
structure arguments (struct flock) to and from userland.
7dd9d470599f145845572ac1f0d4b621c19c1cdb 25-Aug-2002 charnier <charnier@FreeBSD.org> Replace various spelling with FALLTHROUGH which is lint()able
3246fbf45f089a96288563f2d5071bfbde5f99df 17-Aug-2002 rwatson <rwatson@FreeBSD.org> In continuation of early fileop credential changes, modify fo_ioctl() to
accept an 'active_cred' argument reflecting the credential of the thread
initiating the ioctl operation.

- Change fo_ioctl() to accept active_cred; change consumers of the
fo_ioctl() interface to generally pass active_cred from td->td_ucred.
- In fifofs, initialize filetmp.f_cred to ap->a_cred so that the
invocations of soo_ioctl() are provided access to the calling f_cred.
Pass ap->a_td->td_ucred as the active_cred, but note that this is
required because we don't yet distinguish file_cred and active_cred
in invoking VOP's.
- Update kqueue_ioctl() for its new argument.
- Update pipe_ioctl() for its new argument, pass active_cred rather
than td_ucred to MAC for authorization.
- Update soo_ioctl() for its new argument.
- Update vn_ioctl() for its new argument, use active_cred rather than
td->td_ucred to authorize VOP_IOCTL() and the associated VOP_GETATTR().

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
2b82cd24f10c789221e2b4edc59b96a7733b9e71 16-Aug-2002 rwatson <rwatson@FreeBSD.org> Make similar changes to fo_stat() and fo_poll() as made earlier to
fo_read() and fo_write(): explicitly use the cred argument to fo_poll()
as "active_cred" using the passed file descriptor's f_cred reference
to provide access to the file credential. Add an active_cred
argument to fo_stat() so that implementers have access to the active
credential as well as the file credential. Generally modify callers
of fo_stat() to pass in td->td_ucred rather than fp->f_cred, which
was redundantly provided via the fp argument. This set of modifications
also permits threads to perform these operations on behalf of another
thread without modifying their credential.

Trickle this change down into fo_stat/poll() implementations:

- badfo_poll(), badfo_stat(): modify/add arguments.
- kqueue_poll(), kqueue_stat(): modify arguments.
- pipe_poll(), pipe_stat(): modify/add arguments, pass active_cred to
MAC checks rather than td->td_ucred.
- soo_poll(), soo_stat(): modify/add arguments, pass fp->f_cred rather
than cred to pru_sopoll() to maintain current semantics.
- sopoll(): moidfy arguments.
- vn_poll(), vn_statfile(): modify/add arguments, pass new arguments
to vn_stat(). Pass active_cred to MAC and fp->f_cred to VOP_POLL()
to maintian current semantics.
- vn_close(): rename cred to file_cred to reflect reality while I'm here.
- vn_stat(): Add active_cred and file_cred arguments to vn_stat()
and consumers so that this distinction is maintained at the VFS
as well as 'struct file' layer. Pass active_cred instead of
td->td_ucred to MAC and to VOP_GETATTR() to maintain current semantics.

- fifofs: modify the creation of a "filetemp" so that the file
credential is properly initialized and can be used in the socket
code if desired. Pass ap->a_td->td_ucred as the active
credential to soo_poll(). If we teach the vnop interface about
the distinction between file and active credentials, we would use
the active credential here.

Note that current inconsistent passing of active_cred vs. file_cred to
VOP's is maintained. It's not clear why GETATTR would be authorized
using active_cred while POLL would be authorized using file_cred at
the file system level.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
44404e4547aee87b255582d4e6395551869e29b1 15-Aug-2002 rwatson <rwatson@FreeBSD.org> In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
"cred", and change the semantics of consumers of fo_read() and
fo_write() to pass the active credential of the thread requesting
an operation rather than the cached file cred. The cached file
cred is still available in fo_read() and fo_write() consumers
via fp->f_cred. These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
pipe_read/write() now authorize MAC using active_cred rather
than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred. Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not. If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
5e9bc3c12a5ff59cda5edb46ea46bc9a43a902c6 31-Jul-2002 des <des@FreeBSD.org> Have the kern.file sysctl export xfiles rather than files. The truth is
out there!

Sponsored by: DARPA, NAI Labs
b1555a27432152cc7817d0c592bbcebf95fdfd19 28-Jul-2002 truckman <truckman@FreeBSD.org> Wire the sysctl output buffer before grabbing any locks to prevent
SYSCTL_OUT() from blocking while locks are held. This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
released.
3af47b505779529ef058e22b9ae31366e60ffa70 17-Jul-2002 jhb <jhb@FreeBSD.org> Preallocate a struct file as the first thing in falloc() before we lock
the filelist_lock and check nfiles. This closes a race where we had to
unlock the filedesc to re-lock the filelist_lock.

Reported by: David Xu
Reviewed by: bde (mostly)
d1cbf6a1d1f96288005329dfaca2aaffbd884d81 29-Jun-2002 alfred <alfred@FreeBSD.org> More caddr_t removal, make fo_ioctl take a void * instead of a caddr_t.
e6fa9b9e922913444c2e6b2b58bf3de5eaed868d 31-May-2002 tanimura <tanimura@FreeBSD.org> Back out my lats commit of locking down a socket, it conflicts with hsu's work.

Requested by: hsu
92d8381dd544a8237b3fd68c4e7fce9bd0903fb2 20-May-2002 tanimura <tanimura@FreeBSD.org> Lock down a socket, milestone 1.

o Add a mutex (sb_mtx) to struct sockbuf. This protects the data in a
socket buffer. The mutex in the receive buffer also protects the data
in struct socket.

o Determine the lock strategy for each members in struct socket.

o Lock down the following members:

- so_count
- so_options
- so_linger
- so_state

o Remove *_locked() socket APIs. Make the following socket APIs
touching the members above now require a locked socket:

- sodisconnect()
- soisconnected()
- soisconnecting()
- soisdisconnected()
- soisdisconnecting()
- sofree()
- soref()
- sorele()
- sorwakeup()
- sotryfree()
- sowakeup()
- sowwakeup()

Reviewed by: alfred
28d42899b766c395e5a6476f5bfa88b1481a08c0 16-May-2002 trhodes <trhodes@FreeBSD.org> More s/file system/filesystem/g
d1e340364b9883eebdfb4928891b438172fe13fb 06-May-2002 alfred <alfred@FreeBSD.org> Make funsetown() take a 'struct sigio **' so that the locking can
be done internally.

Ensure that no one can fsetown() to a dying process/pgrp. We need
to check the process for P_WEXIT to see if it's exiting. Process
groups are already safe because there is no such thing as a pgrp
zombie, therefore the proctree lock completely protects the pgrp
from having sigio structures associated with it after it runs
funsetownlst.

Add sigio lock to witness list under proctree and allproc, but over
proc and pgrp.

Seigo Tanimura helped with this.
101b936bbcbf5df4649ff52f53a4e12fc2b27ef1 03-May-2002 tanimura <tanimura@FreeBSD.org> As malloc(9) and free(9) are now Giant-free, remove the Giant lock
across malloc(9) and free(9) of a pgrp or a session.
58f1f5c532ae9a81c60b6a1a7e54ad895e064a72 03-May-2002 tanimura <tanimura@FreeBSD.org> Fix the lock order reversal between the sigio lock and a process/pgrp lock in
funsetownlst() by locking the sigio lock across funsetownlst().
798c53d495a4eb1c10dc65a1d2ca87e2cb12f8df 01-May-2002 alfred <alfred@FreeBSD.org> Redo the sigio locking.

Turn the sigio sx into a mutex.

Sigio lock is really only needed to protect interrupts from dereferencing
the sigio pointer in an object when the sigio itself is being destroyed.

In order to do this in the most unintrusive manner change pgsigio's
sigio * argument into a **, that way we can lock internally to the
function.
dafd57693bd94b06b4a40b4b56ac73814e22a01f 29-Apr-2002 asmodai <asmodai@FreeBSD.org> Fix indention which I did wrong in a previous commit.

Submitted by: bde
dbb4756491715a06ce4578841f6eba43fc62fa70 27-Apr-2002 tanimura <tanimura@FreeBSD.org> Add a global sx sigio_lock to protect the pointer to the sigio object
of a socket. This avoids lock order reversal caused by locking a
process in pgsigio().

sowakeup() and the callers of it (sowwakeup, soisconnected, etc.) now
require sigio_lock to be locked. Provide sowwakeup_locked(),
soisconnected_locked(), and so on in case where we have to modify a
socket and wake up a process atomically.
d4c507ea29a089a017b39ff7d76624177ec4bb5b 22-Apr-2002 alfred <alfred@FreeBSD.org> Don't FILEDESC_LOCK around calls to falloc().
e2acd5cecf61f060255d467b8d38cd35f941fb1c 20-Apr-2002 tanimura <tanimura@FreeBSD.org> Push down Giant for setpgid(), setsid() and aio_daemon(). Giant protects only
malloc(9) and free(9).
fcc5ad0935b7186f4b02cdb9e21265c4b22f7bdc 19-Apr-2002 nectar <nectar@FreeBSD.org> When exec'ing a set[ug]id program, make sure that the stdio file descriptors
(0, 1, 2) are allocated by opening /dev/null for any which are not already
open.

Reviewed by: alfred, phk
MFC after: 2 days
dba04cd736d55f53d6db22e89a37b13ba56eb759 16-Apr-2002 jhb <jhb@FreeBSD.org> Lock proctree_lock instead of pgrpsess_lock.
4d94ee39e68fce6f6387a3cfc3b8ef91a146d899 13-Apr-2002 asmodai <asmodai@FreeBSD.org> Use the correct macros for F_SETFD/F_GETFD instead of magic numbers.
Reflect that fact in the manual page.

PR: 12723
Submitted by: Peter Jeremy <peter.jeremy@alcatel.com.au>
Approved by: bde
MFC after: 2 weeks
db9aa81e239bb1c46b3b7ba560474cd954b78bf3 04-Apr-2002 jhb <jhb@FreeBSD.org> Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64
9ae6d1242c5a218cbb27f1a3ed8a9014412bbf80 29-Mar-2002 tanimura <tanimura@FreeBSD.org> The description of fd_mtx is "filedesc structure."
318cbeeecf54d416eb936f4bb65c00b18aab686b 20-Mar-2002 jeff <jeff@FreeBSD.org> Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.
357e37e023059920b1f80494e489797e2f69a3dd 19-Mar-2002 alfred <alfred@FreeBSD.org> Remove __P.
2923687da3c046deea227e675d5af075b9fa52d4 19-Mar-2002 jeff <jeff@FreeBSD.org> This is the first part of the new kernel memory allocator. This replaces
malloc(9) and vm_zone with a slab like allocator.

Reviewed by: arch@
21fc25cfdf4a00506690f4163a8a4391797d9c62 19-Mar-2002 alfred <alfred@FreeBSD.org> Close a race when vfs_syscalls.c:checkdirs() runs.

To do this protect the filedesc pointer in the proc with PROC_LOCK
in both checkdirs() and kern_descrip.c:fdfree().
b0fd50345a3289e9bb8d558d427d688f80e93c0b 15-Mar-2002 alfred <alfred@FreeBSD.org> Giant pushdown for read/write/pread/pwrite syscalls.

kern/kern_descrip.c:
Aquire Giant in fdrop_locked when file refcount hits zero, this removes
the requirement for the caller to own Giant for the most part.

kern/kern_ktrace.c:
Aquire Giant in ktrgenio, simplifies locking in upper read/write syscalls.

kern/vfs_bio.c:
Aquire Giant in bwillwrite if needed.

kern/sys_generic.c
Giant pushdown, remove Giant for:
read, pread, write and pwrite.
readv and writev aren't done yet because of the possible malloc calls
for iov to uio processing.

kern/sys_socket.c
Grab giant in the socket fo_read/write functions.

kern/vfs_vnops.c
Grab giant in the vnode fo_read/write functions.
3706cd350927f08fa8742cce9448c9ba8e4d6b2c 27-Feb-2002 jhb <jhb@FreeBSD.org> Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.
a09da298590e8c11ebafa37f79e0046814665237 23-Feb-2002 tanimura <tanimura@FreeBSD.org> Lock struct pgrp, session and sigio.

New locks are:

- pgrpsess_lock which locks the whole pgrps and sessions,
- pg_mtx which protects the pgrp members, and
- s_mtx which protects the session members.

Please refer to sys/proc.h for the coverage of these locks.

Changes on the pgrp/session interface:

- pgfind() needs the pgrpsess_lock held.

- The caller of enterpgrp() is responsible to allocate a new pgrp and
session.

- Call enterthispgrp() in order to enter an existing pgrp.

- pgsignal() requires a pgrp lock held.

Reviewed by: jhb, alfred
Tested on: cvsup.jp.FreeBSD.org
(which is a quad-CPU machine running -current)
0c6681c43533865b76871ae8cc3e499803e46d29 08-Feb-2002 peter <peter@FreeBSD.org> Fix broken Giant locking protocol introduced in rev 1.114. You cannot
unlock Giant if it is not locked in the first place. This make the
nfstat(2) syscall (#278) a nice panic(2) implementation.
b616625325513bceaa693d49b5fc523887b0cefd 01-Feb-2002 alfred <alfred@FreeBSD.org> Remove bogus assertion in dup2 that can lead to panics when kernel
threads race for a file slot.

dup2(2) incorrectly assumes that if it needs to grow the ofiles
array that it will get what it wants. This assertion was valid
before we allowed shared filedescriptor tables but is now incorrect.

The assertion can trigger superfolous panics if the thread doing a
dup2 looses a race with another thread while possibly blocked in
the MALLOC call in fdalloc. Another thread may grab the slot we
are requesting which makes fdalloc return something other than what
we asked for, this will triggering the bogus assertion.

MFC after: 2 weeks
Reviewed by: phk
74f44e9118c6128d583e17eb9c7f6c28741e03b6 01-Feb-2002 alfred <alfred@FreeBSD.org> Avoid lock order reversal filedesc/Giant when calling FREE() in fdalloc
by unlocking the filedesc before calling FREE().

Submitted by: bde
b0fc10702ad9c73c2baea40625f42e4564b923d4 29-Jan-2002 alfred <alfred@FreeBSD.org> Attempt to fixup select(2) and poll(2), this should fix some races with
other threads as well as speed up the interfaces.

To fix the race and accomplish the speedup, remove selholddrop and
pollholddrop. The entire concept is somewhat bogus because holding
the individual struct file pointers offers us no guarantees that
another thread context won't close it on us thereby removing our
access to our own reference.

Selholddrop and pollholddrop also would do multiple locks and unlocks
of mutexes _per-file_ in the fd arrays to be scanned, this needed to
be sped up.

Instead of using selholddrop and pollholddrop, simply hold the
filedesc lock over the selscan and pollscan functions. This should
protect us against close(2)'s on the files as reduce the multiple
lock/unlock pairs per fd into a single lock over the filedesc.
b969e5c19835642b62dfd3f5a5493d5225af218b 29-Jan-2002 alfred <alfred@FreeBSD.org> Backout 1.120, EINVAL isn't a proper error return when the passed fd is
negative, the 'pointer' referred to by the manpage is actually the
struct file's f_offset field.

Pointed out by: bde
53eeef7678d4eb2327aa3aa6f05b464964b83aaf 23-Jan-2002 alfred <alfred@FreeBSD.org> in fget() return EINVAL when the descriptor requested is negative.
1d6432ede3d9347bff05575e2e59a970c49ba052 20-Jan-2002 alfred <alfred@FreeBSD.org> use mutex pools for "struct file" locking.
fix indentation of FILE_LOCK/UNLOCK macros while I'm here.
18fa15ac4c6280e328ae83dd05ba035a378fef34 15-Jan-2002 alfred <alfred@FreeBSD.org> Push down Giant in dup(2) and dup2(2), Giant is only needed when
calling closef() in the case of dup2(2) duping over a descriptor
and when fdalloc must grow or free a filedesc.
1f82bc18d1d1e906cd9ed68039acb051fa6e11cf 14-Jan-2002 alfred <alfred@FreeBSD.org> Replace ffind_* with fget calls.

Make fget MPsafe.

Make fgetvp and fgetsock use the fget subsystem to reduce code bloat.

Push giant down in fpathconf().
f720362ae2b64985ec0266a24e489369e15041df 13-Jan-2002 alfred <alfred@FreeBSD.org> Comment fdrop and fdrop_locked functions.
b0764e3d9ae39199dd6b60e4dd916f0406c2eed2 13-Jan-2002 alfred <alfred@FreeBSD.org> Implement ffind_hold using ffind_lock.

Recommended by: jhb
844237b3960bfbf49070d6371a84f67f9e3366f6 13-Jan-2002 alfred <alfred@FreeBSD.org> SMP Lock struct file, filedesc and the global file list.

Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
protects all the fields.
protects "struct file" initialization, while a struct file
is being changed from &badfileops -> &pipeops or something
the filedesc should be locked.

1 mutex in each struct file
protects the refcount fields.
doesn't protect anything else.
the flags used for garbage collection have been moved to
f_gcflag which was the FILLER short, this doesn't need
locking because the garbage collection is a single threaded
container.
could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file * fhold(struct file *fp);
/* increments reference count on a file */

struct file * fhold_locked(struct file *fp);
/* like fhold but expects file to locked */

struct file * ffind_hold(struct thread *, int fd);
/* finds the struct file in thread, adds one reference and
returns it unlocked */

struct file * ffind_lock(struct thread *, int fd);
/* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.
f7af5c6f927c1530abc7ebc578a2dc820dcdfe7e 14-Dec-2001 jlemon <jlemon@FreeBSD.org> When removing kqueue descriptors from the descriptor table during a fork,
update fd_freefile and fd_lastfile as well, to keep things in sync.

Pointed out by: Debbie Chu <dchu@juniper.net>
86ed17d675cb503ddb3f71f8b6f7c3af530bb29a 17-Nov-2001 dillon <dillon@FreeBSD.org> Give struct socket structures a ref counting interface similar to
vnodes. This will hopefully serve as a base from which we can
expand the MP code. We currently do not attempt to obtain any
mutex or SX locks, but the door is open to add them when we nail
down exactly how that part of it is going to work.
e3b965f7d57557c7273b062793ee6de6ff40223d 14-Nov-2001 dillon <dillon@FreeBSD.org> remove holdfp()

Replace uses of holdfp() with fget*() or fgetvp*() calls as appropriate

introduce fget(), fget_read(), fget_write() - these functions will take
a thread and file descriptor and return a file pointer with its ref
count bumped.

introduce fgetvp(), fgetvp_read(), fgetvp_write() - these functions will
take a thread and file descriptor and return a vref()'d vnode.

*_read() requires that the file pointer be FREAD, *_write that it be
FWRITE.

This continues the cleanup of struct filedesc and struct file access
routines which, when are all through with it, will allow us to then
make the API calls MP safe and be able to move Giant down into the fo_*
functions.
4806d88677d2a333254f9fc33d4e119cc56bb002 11-Oct-2001 jhb <jhb@FreeBSD.org> Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
refcount is > 1 and false (0) otherwise.
a5d12801498b0c470fdcbec08e0ed4726c3fae4f 30-Sep-2001 jlemon <jlemon@FreeBSD.org> When FREE()ing kqueue related structures, charge them to the correct bucket.

Submitted by: iedowse
Forgotten by: jlemon
1bda6a607a245d198d36f589a6d8d42d9ea45b86 12-Sep-2001 julian <julian@FreeBSD.org> If an incoming struct proc could have been NULL before, tehn don't
automatically change the code to add

struct proc *p = td->td_proc;

because now 'td' is probably capable of being NULL too.
I expect to see more of this kind of error during the 'weeding'
process. It's too easy to make. (junior hacker project.. look for these :-)

Submitted by: mark Peek <mp@freebsd.org>
5596676e6c6c1e81e899cd0531f9b1c28a292669 12-Sep-2001 julian <julian@FreeBSD.org> KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha
ab65c87802bf5b2499357a7731a9d082ee00ea48 01-Sep-2001 dillon <dillon@FreeBSD.org> Giant Pushdown. Saved the worst P4 tree breakage for last.

reboot() getpriority() setpriority() rtprio() osetrlimit() ogetrlimit()
setrlimit() getrlimit() getrusage() getpid() getppid() getpgrp()
getpgid() getsid() getgid() getegid() getgroups() setsid() setpgid()
setuid() seteuid() setgid() setegid() setgroups() setreuid() setregid()
setresuid() setresgid() getresuid() getresgid () __setugid() getlogin()
setlogin() modnext() modfnext() modstat() modfind() kldload() kldunload()
kldfind() kldnext() kldstat() kldfirstmod() kldsym() getdtablesize()
dup2() dup() fcntl() close() ofstat() fstat() nfsstat() fpathconf()
flock()
e4aa34eee032faef51918fdfa9d9f0758d999654 29-Aug-2001 ache <ache@FreeBSD.org> advlock: simplify overflow checks
428194d97d283c1a8d5ebe8ab066339b67e211eb 23-Aug-2001 ache <ache@FreeBSD.org> Move <machine/*> after <sys/*>
Add missing fdrop() before EOVERFLOW

Pointed by: bde
1905060cac646380cb34d8c69b70c28b52278798 23-Aug-2001 ache <ache@FreeBSD.org> Detect off_t EOVERFLOW of start/end offsets calculations for adv. lock,
as POSIX require.
81b95242db35dc7073e271d0822c12bda409ae37 06-Aug-2001 chris <chris@FreeBSD.org> Remove the fildesc_clone() function and its associated unnecessary code.
It didn't implement the proper /dev/fd functionality (which would be to
include in the directory listing /dev/fd/n if the process has fd n open)
anyway.

Anything needing access to /dev/fd/n where n > 2 can use the optional
fdescfs module, which implements this properly and does not cause any
trouble with devfs.

Discussed with: phk
f504530d9fa3bcd6613f6051a68db5da74c627ce 25-May-2001 rwatson <rwatson@FreeBSD.org> o Merge contents of struct pcred into struct ucred. Specifically, add the
real uid, saved uid, real gid, and saved gid to ucred, as well as the
pcred->pc_uidinfo, which was associated with the real uid, only rename
it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
original macro that pointed.
p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
we figure out locking and optimizations; generally speaking, this
means moving to a structure like this:
newcred = crdup(oldcred);
...
p->p_ucred = newcred;
crfree(oldcred);
It's not race-free, but better than nothing. There are also races
in sys_process.c, all inter-process authorization, fork, exec, and
exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
allocation.
o Clean up ktrcanset() to take into account changes, and move to using
suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
calls to better document current behavior. In a couple of places,
current behavior is a little questionable and we need to check
POSIX.1 to make sure it's "right". More commenting work still
remains to be done.
o Update credential management calls, such as crfree(), to take into
account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
change_euid()
change_egid()
change_ruid()
change_rgid()
change_svuid()
change_svgid()
In each case, the call now acts on a credential not a process, and as
such no longer requires more complicated process locking/etc. They
now assume the caller will do any necessary allocation of an
exclusive credential reference. Each is commented to document its
reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
processes and pcreds. Note that this authorization, as well as
CANSIGIO(), needs to be updated to use the p_cansignal() and
p_cansched() centralized authorization routines, as they currently
do not take into account some desirable restrictions that are handled
by the centralized routines, as well as being inconsistent with other
similar authorization instances.
o Update libkvm to take these changes into account.

Obtained from: TrustedBSD Project
Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit
bcca5847d5e7a197302d7689cd358f5ce6316d0a 01-May-2001 markm <markm@FreeBSD.org> Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by: bde (with reservations)
9c03a8ae91e06e47f0c59996ef0e2300e231e101 24-Apr-2001 jhb <jhb@FreeBSD.org> Change the pfind() and zpfind() functions to lock the process that they
find before releasing the allproc lock and returning.

Reviewed by: -smp, dfr, jake
c47745e97713190e3da533b9d29b74b2ceee96f1 26-Mar-2001 phk <phk@FreeBSD.org> Send the remains (such as I have located) of "block major numbers" to
the bit-bucket.
d458a37885714323658f12d0da555473beca7c4c 20-Mar-2001 phk <phk@FreeBSD.org> Make the pseudo-driver for "/dev/fd/*" handle fd's larger than 255.

PR: 25936
11781a7431fab609cd00058a63ac09ccddb16854 15-Feb-2001 jlemon <jlemon@FreeBSD.org> Extend kqueue down to the device layer.

Backwards compatible approach suggested by: peter
dd75d1d73b4f3034c1d9f621a49fff58b1d71eb1 08-Dec-2000 dwmalone <dwmalone@FreeBSD.org> Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>
15a44d16ca10bf52da55462560c345940cd19b38 18-Nov-2000 dillon <dillon@FreeBSD.org> This patchset fixes a large number of file descriptor race conditions.
Pre-rfork code assumed inherent locking of a process's file descriptor
array. However, with the advent of rfork() the file descriptor table
could be shared between processes. This patch closes over a dozen
serious race conditions related to one thread manipulating the table
(e.g. closing or dup()ing a descriptor) while another is blocked in
an open(), close(), fcntl(), read(), write(), etc...

PR: kern/11629
Discussed with: Alexander Viro <viro@math.psu.edu>
8b68576c1658300e9ba6239ddab7cc0da01384ac 28-Oct-2000 alc <alc@FreeBSD.org> Add missing call to knote_fdclose() in setugidsafety() and fdcloseexec().

Reviewed by: jlemon
e47f61e18396b6e5f61ee91b9f9f832976ee96cf 02-Sep-2000 phk <phk@FreeBSD.org> Avoid the modules madness I inadvertently introduced by making the
cloning infrastructure standard in kern_conf. Modules are now
the same with or without devfs support.

If you need to detect if devfs is present, in modules or elsewhere,
check the integer variable "devfs_present".

This happily removes an ugly hack from kern/vfs_conf.c.

This forces a rename of the eventhandler and the standard clone
helper function.

Include <sys/eventhandler.h> in <sys/conf.h>: it's a helper #include
like <sys/queue.h>

Remove all #includes of opt_devfs.h they no longer matter.
f507f35db734e4ec15038f724a6d984563eab1f0 26-Aug-2000 alfred <alfred@FreeBSD.org> new sysctl 'kern.openfiles' (exports nfiles to userland)
f3d31d465280d599585a26de3181eaae9ee83f69 24-Aug-2000 phk <phk@FreeBSD.org> Dang, a _clone routine escaped #ifdef DEVFS containment.
ec761116e25ef0a9e43ec5670c7c6565a4848a0b 24-Aug-2000 phk <phk@FreeBSD.org> Fix panic when removing open device (found by bp@)
Implement subdirs.
Build the full "devicename" for cloning functions.
Fix panic when deleted device goes away.
Collaps devfs_dir and devfs_dirent structures.
Add proper cloning to the /dev/fd* "device-"driver.
Fix a bug in make_dev_alias() handling which made aliases appear
multiple times.
Use devfs_clone to implement getdiskbyname()
Make specfs maintain the stat(2) timestamps per dev_t
85c9a2ddc16cd13cfb2434396af3929dc95adaa7 11-Aug-2000 peter <peter@FreeBSD.org> Clean up some low level bootstrap code:

- stop using the evil 'struct trapframe' argument for mi_startup()
(formerly main()). There are much better ways of doing it.
- do not use prepare_usermode() - setregs() in execve() will do it
all for us as long as the p_md.md_regs pointer is set. (which is
now done in machdep.c rather than init_main.c. The Alpha port did it
this way all along and is much cleaner).
- collect all the magic %cr0 etc register settings into one place and
have the AP's call that instead of using magic numbers (!!) that keep
changing over and over again.
- Make it safe to call kthread_create() earlier, including during the
device probe sequence. It doesn't need the callback mechanism that
NetBSD's version uses.
- kthreads created this way are root-less as they exist before the root
filesystem is mounted. init(1) is set up so that it aquires the root
pointers prior to running. If other kthreads want filesystem acccess
we can make this code more generic.
- set all threads start times once we have decided what time it is.
- init uses a trampoline rather than the evil prepare_usermode() hack.
- kern_descrip.c has a couple of tweaks to deal with forking when there
is no rootdir or cwd etc.
- adjust the early SYSINIT() sequence so that a few prereqisites are in
place. eg: make sure the run queue is initialized before doing forks.

With this, the USB code can easily create a kthread to do the device
tree discovery. (I have tested it, it works nicely).

There are still some open issues before this is truely useful.
- tsleep() does not like working before the clock is running. It
sort-of tries to spin wait, but it can do more useful things now.
- stopping a kthread in kld code at unload time is "interesting" but
we have a solution for that.

The Alpha code needs no changes for this. It already uses pretty much the
same strategies, but a little cleaner.
e5de271d472634538e30a52ae173ebe1213162fd 04-Jul-2000 phk <phk@FreeBSD.org> Previous commit changing SYSCTL_HANDLER_ARGS violated KNF.

Pointed out by: bde
61ff05be253ab1a6d0939338ce307aece558f308 03-Jul-2000 phk <phk@FreeBSD.org> Style police catches up with rev 1.26 of src/sys/sys/sysctl.h:

Sanitize SYSCTL_HANDLER_ARGS so that simplistic tools can grog our
sources:

-sysctl_vm_zone SYSCTL_HANDLER_ARGS
+sysctl_vm_zone (SYSCTL_HANDLER_ARGS)
1e48e18a710b517238a0acb9033bc02eb52b9b99 27-Jun-2000 alfred <alfred@FreeBSD.org> don't panic the system when fpathconv is called on an unsupported filetype.
961b97d43458f3c57241940cabebb3bedf7e4c00 26-May-2000 jake <jake@FreeBSD.org> Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by: msmith and others
d93fbc99166053b75c2eeb69b5cb603cfaf79ec0 23-May-2000 jake <jake@FreeBSD.org> Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by: phk
Reviewed by: phk
Approved by: mdodd
c41c876463ee8c302f06554537e0fb22a3fcdca4 16-Apr-2000 jlemon <jlemon@FreeBSD.org> Introduce kqueue() and kevent(), a kernel event notification facility.
b42951578188c5aab5c9f8cbcde4a743f8092cdc 02-Apr-2000 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'ALSA'.
72c8ff7d8a5f214ddae6f08cf6a1cefbe35fddc2 21-Jan-2000 imp <imp@FreeBSD.org> Fix the style bugs in the style bugs fix. The style bug fix made the
new function inconsistant with the rest of this file. The spelling
and grammer fixes were good and remain.
c6da76a1a6b1b60435d8df410d11a7a497db504a 21-Jan-2000 green <green@FreeBSD.org> Fix style bugs in the last commit.
f6db7985c40e24ed7b40b2a88bed96538ecea466 20-Jan-2000 imp <imp@FreeBSD.org> bdeize last commit:
o Remove opt_dontuse.h and ifdef PROCFS

Subitted by: bde, peter
4e884c480a19cc6a6050ee1e47f54fbc3cab11a0 20-Jan-2000 imp <imp@FreeBSD.org> When we are execing a setugid program, and we have a procfs filesystem
file open in one of the special file descriptors (0, 1, or 2), close
it before completing the exec.

Submitted by: nergal@idea.avet.com.pl
Constructive comments: deraadt@openbsd.org, sef, peter, jkh
81b3aed1d181b6e482aedd1c33dee744ef152e24 26-Dec-1999 bde <bde@FreeBSD.org> Removed unused includes.

Rumoved unused compatibility cruft for dup(). Using it today would just
break dup() on fd's >= 64.

Fixed some style bugs.
bd95a150acc1fc88cede6f39fc0807cdc57a1382 18-Nov-1999 dillon <dillon@FreeBSD.org> Only bother converting the stat structure if we intend to return it,
when no error occurs.

PR: kern/14966
Reviewed by: dillon@freebsd.org
Submitted by: Kelly Yancey kbyanc@posi.net
6670f38e3fdabc62da97ed3938edb220a61b96eb 18-Nov-1999 peter <peter@FreeBSD.org> Remove cdevsw_add() - the necessary make_dev() calls appear to be there
already.
8fca18de89986d6275d0502af100b97274c4bab5 16-Nov-1999 phk <phk@FreeBSD.org> This is a partial commit of the patch from PR 14914:

Alot of the code in sys/kern directly accesses the *Q_HEAD and *Q_ENTRY
structures for list operations. This patch makes all list operations
in sys/kern use the queue(3) macros, rather than directly accessing the
*Q_{HEAD,ENTRY} structures.

This batch of changes compile to the same object files.

Reviewed by: phk
Submitted by: Jake Burkholder <jake@checker.org>
PR: 14914
3dd33a3e0f26ff05eb6d05ab539d40228ef676e5 08-Nov-1999 peter <peter@FreeBSD.org> Use fo_stat() rather than duplicating knowledge of file type internals
in here for stat(2) and friends. Update the badops entries accordingly.
9e596ffa36410bb5383ea949a67b1c3f2d79b7e3 07-Nov-1999 green <green@FreeBSD.org> Fix the advisory file locking by restoring previous ordering in closef()/
fdrop(). This only showed up when a file descriptor was duplicated
and then closed once, where the lock would be released on the first close().
787140aa421dd3c40fee10d21f80fadda98981ca 11-Oct-1999 peter <peter@FreeBSD.org> Trim unused options (or #ifdef for undoc options).

Submitted by: phk
e9e05122106077f828fbbd0964036c22503d1633 25-Sep-1999 phk <phk@FreeBSD.org> Remove five now unused fields from struct cdevsw. They should never
have been there in the first place. A GENERIC kernel shrinks almost 1k.

Add a slightly different safetybelt under nostop for tty drivers.

Add some missing FreeBSD tags
3588b9beb9bff6f857172a7d8fc19057371a291b 25-Sep-1999 phk <phk@FreeBSD.org> Fix a hole in jail(2).

Noticed by: Alexander Bezroutchko <abb@zenon.net>
140cb4ff83b0061eeba0756f708f3f7c117e76e5 19-Sep-1999 green <green@FreeBSD.org> This is what was "fdfix2.patch," a fix for fd sharing. It's pretty
far-reaching in fd-land, so you'll want to consult the code for
changes. The biggest change is that now, you don't use
fp->f_ops->fo_foo(fp, bar)
but instead
fo_foo(fp, bar),
which increments and decrements the fp refcount upon entry and exit.
Two new calls, fhold() and fdrop(), are provided. Each does what it
seems like it should, and if fdrop() brings the refcount to zero, the
fd is freed as well.

Thanks to peter ("to hell with it, it looks ok to me.") for his review.
Thanks to msmith for keeping me from putting locks everywhere :)

Reviewed by: peter
3b842d34e82312a8004a7ecd65ccdb837ef72ac1 28-Aug-1999 peter <peter@FreeBSD.org> $Id$ -> $FreeBSD$
663cbe4fc26065f7af7d10faaee492a626156145 23-Aug-1999 phk <phk@FreeBSD.org> Convert DEVFS hooks in (most) drivers to make_dev().

Diskslice/label code not yet handled.

Vinum, i4b, alpha, pc98 not dealt with (left to respective Maintainers)

Add the correct hook for devfs to kern_conf.c

The net result of this excercise is that a lot less files depends on DEVFS,
and devtoname() gets more sensible output in many cases.

A few drivers had minor additional cleanups performed relating to cdevsw
registration.

A few drivers don't register a cdevsw{} anymore, but only use make_dev().
c03366a55d4ace981b016ae999ae67675c486cdd 04-Aug-1999 green <green@FreeBSD.org> Fix fd race conditions (during shared fd table usage.) Badfileops is
now used in f_ops in place of NULL, and modifications to the files
are more carefully ordered. f_ops should also be set to &badfileops
upon "close" of a file.

This does not fix other problems mentioned in this PR than the first
one.

PR: 11629
Reviewed by: peter
1abbd13bc2610a86a1eef7c8a73b314b2041b08e 07-Jun-1999 msmith <msmith@FreeBSD.org> From the submitter:

- this causes POSIX locking to use the thread group leader
(p->p_leader) as the locking thread for all advisory locks.
In non-kernel-threaded code p->p_leader == p, so this will have
no effect.

This results in (more) correct POSIX threaded flock-ing semantics.

It also prevents the leader from exiting before any of the children.
(so that p->p_leader will never be stale) in exit1().

We have been running this patch for over a month now in our lab
under load and at customer sites.

Submitted by: John Plevyak <jplevyak@inktomi.com>
6a5dc97620c08ad609e1b3c3c042f150feb46dd3 31-May-1999 phk <phk@FreeBSD.org> Simplify cdevsw registration.

The cdevsw_add() function now finds the major number(s) in the
struct cdevsw passed to it. cdevsw_add_generic() is no longer
needed, cdevsw_add() does the same thing.

cdevsw_add() will print an message if the d_maj field looks bogus.

Remove nblkdev and nchrdev variables. Most places they were used
bogusly. Instead check a dev_t for validity by seeing if devsw()
or bdevsw() returns NULL.

Move bdevsw() and devsw() functions to kern/kern_conf.c

Bump __FreeBSD_version to 400006

This commit removes:
72 bogus makedev() calls
26 bogus SYSINIT functions

if_xe.c bogusly accessed cdevsw[], author/maintainer please fix.

I4b and vinum not changed. Patches emailed to authors. LINT
probably broken until they catch up.
7e4a9dced9acd97789b37c32063ee7a8aa133f6d 30-May-1999 phk <phk@FreeBSD.org> This commit should be a extensive NO-OP:

Reformat and initialize correctly all "struct cdevsw".

Initialize the d_maj and d_bmaj fields.

The d_reset field was not removed, although it is never used.

I used a program to do most of this, so all the files now use the
same consistent format. Please keep it that way.

Vinum and i4b not modified, patches emailed to respective authors.
7e26ca1d1a4bb6507cb6f9241f2d35b1048eaecd 11-May-1999 phk <phk@FreeBSD.org> Divorce "dev_t" from the "major|minor" bitmap, which is now called
udev_t in the kernel but still called dev_t in userland.

Provide functions to manipulate both types:
major() umajor()
minor() uminor()
makedev() umakedev()
dev2udev() udev2dev()

For now they're functions, they will become in-line functions
after one of the next two steps in this process.

Return major/minor/makedev to macro-hood for userland.

Register a name in cdevsw[] for the "filedescriptor" driver.

In the kernel the udev_t appears in places where we have the
major/minor number combination, (ie: a potential device: we
may not have the driver nor the device), like in inodes, vattr,
cdevsw registration and so on, whereas the dev_t appears where
we carry around a reference to a actual device.

In the future the cdevsw and the aliased-from vnode will be hung
directly from the dev_t, along with up to two softc pointers for
the device driver and a few houskeeping bits. This will essentially
replace the current "alias" check code (same buck, bigger bang).

A little stunt has been provided to try to catch places where the
wrong type is being used (dev_t vs udev_t), if you see something
not working, #undef DEVT_FASCIST in kern/kern_conf.c and see if
it makes a difference. If it does, please try to track it down
(many hands make light work) or at least try to reproduce it
as simply as possible, and describe how to do that.

Without DEVT_FASCIST I belive this patch is a no-op.

Stylistic/posixoid comments about the userland view of the <sys/*.h>
files welcome now, from userland they now contain the end result.

Next planned step: make all dev_t's refer to the same devsw[] which
means convert BLK's to CHR's at the perimeter of the vnodes and
other places where they enter the game (bootdev, mknod, sysctl).
dd35516544a379a6c23755ba8ea52e0cb126c095 03-May-1999 billf <billf@FreeBSD.org> Add sysctl descriptions to many SYSCTL_XXXs

PR: kern/11197
Submitted by: Adrian Chadd <adrian@FreeBSD.org>
Reviewed by: billf(spelling/style/minor nits)
Looked at by: bde(style)
ba8c622703d205067a8d74af1a1b2d95844349b2 28-Apr-1999 dt <dt@FreeBSD.org> s/static foo_devsw_installed = 0;/static int foo_devsw_installed;/.
(Edited automatically)
a8dc66f457be84eefbe16e70c901ceb11137ba65 08-Jan-1999 eivind <eivind@FreeBSD.org> Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as
discussed on -hackers.

Introduce 'KASSERT(assertion, ("panic message", args))' for simple
check + panic.

Reviewed by: msmith
d869e3568091433f9d3beb11c9f29c4fdd08f39a 11-Nov-1998 truckman <truckman@FreeBSD.org> I got another batch of suggestions for cosmetic changes from bde.
de184682fa22833c7b18a96a136bc031ae786434 11-Nov-1998 truckman <truckman@FreeBSD.org> Installed the second patch attached to kern/7899 with some changes suggested
by bde, a few other tweaks to get the patch to apply cleanly again and
some improvements to the comments.

This change closes some fairly minor security holes associated with
F_SETOWN, fixes a few bugs, and removes some limitations that F_SETOWN
had on tty devices. For more details, see the description on the PR.

Because this patch increases the size of the proc and pgrp structures,
it is necessary to re-install the includes and recompile libkvm,
the vinum lkm, fstat, gcore, gdb, ipfilter, ps, top, and w.

PR: kern/7899
Reviewed by: bde, elvind
57b661836aae1698e2f0388e6462e8f1521b801d 29-Jul-1998 bde <bde@FreeBSD.org> Fixed printf format errors.
e006ba669f17c0c9aa8d728ab0024243dc5095fd 15-Jul-1998 bde <bde@FreeBSD.org> Cast longs to intptr_t before casting them to pointers.

Fixed bitrot in pseudo-declaration of `struct fcntl_args'. fcntl()
is now broken in some cases when ints are larger than longs.
9c3644d05ebf093315b1da64f2129e45980d2010 10-Jun-1998 dfr <dfr@FreeBSD.org> 64bit fixes: p->p_retval is a register_t[] not an int[].
ee396db7d3a6e8672a75bb02cff258ca0c155aca 11-May-1998 dyson <dyson@FreeBSD.org> Fix the futimes/undelete/utrace conflict with other BSD's. Note that
the only common usage of utrace (the possible problem with this
commit) is with malloc, so this should be a real problem. Add
the various NetBSD syscalls that allow full emulation of their
development environment.
fbe6fe8df622ec2f41d4c28cf3cb1849007a8239 15-Feb-1998 dyson <dyson@FreeBSD.org> Make the rootdir handling more consistent. Now, processes always
have a root vnode associated with them, and no special checks for
the null case are needed.
Submitted by: terry@freebsd.org
4547a09753662d6525ae498b0da796738fa1bb22 06-Feb-1998 eivind <eivind@FreeBSD.org> Back out DIAGNOSTIC changes.
c552a9a1c3362d37fc1aaf3a9ba4231225b1f13a 04-Feb-1998 eivind <eivind@FreeBSD.org> Turn DIAGNOSTIC into a new-style option.
71ddd313905625e7a30459009de6972aa2fd42a2 24-Jan-1998 eivind <eivind@FreeBSD.org> Make all file-system (MFS, FFS, NFS, LFS, DEVFS) related option new-style.

This introduce an xxxFS_BOOT for each of the rootable filesystems.
(Presently not required, but encouraged to allow a smooth move of option *FS
to opt_dontuse.h later.)

LFS is temporarily disabled, and will be re-enabled tomorrow.
0506343883d62f6649f7bbaf1a436133cef6261d 11-Jan-1998 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'jb'.
7c6e96080c4fb49bf912942804477d202a53396c 10-Jan-1998 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'JB'.
01dd6091edaa3e5d6ce972956bdaff5e8575d53f 16-Dec-1997 eivind <eivind@FreeBSD.org> Make COMPAT_43 and COMPAT_SUNOS new-style options.
6781e003d564b1396bfc29a0570740018cadaddc 29-Nov-1997 dyson <dyson@FreeBSD.org> Fix and complete the AIO syscalls. There are some performance enhancements
coming up soon, but the code is functional. Docs will be forthcoming.
608663aedf97b58a40a7c2f93d69b5269d384759 23-Nov-1997 bde <bde@FreeBSD.org> Fixed a missing conversion of retval to p_retval in disabled code.

Fixed overflow of FFLAGS() in fcntl(F_SETFL, ...). This was not
a security hole, but gave wrong results for silly flags values.
E.g., it make fcntl(F_SETFL, -1) equivalent to fcntl(F_SETFL, 0).
POSIX requires ignoring the open mode bits in fcntl() (even if
they would be invalid for open()).
6c224bdd6e4fe8f6e79f250fa8e78e3f70ec5ed5 23-Nov-1997 bde <bde@FreeBSD.org> Fixed duplicate definitions of M_FILE (one static).
4c8218a5c7d132b8ae0bd2a5a677455d69fabaab 06-Nov-1997 phk <phk@FreeBSD.org> Move the "retval" (3rd) parameter from all syscall functions and put
it in struct proc instead.

This fixes a boatload of compiler warning, and removes a lot of cruft
from the sources.

I have not removed the /*ARGSUSED*/, they will require some looking at.

libkvm, ps and other userland struct proc frobbing programs will need
recompiled.
36e7a51ea1dedf0fc860ff3106aee1db1ab3b1f5 12-Oct-1997 phk <phk@FreeBSD.org> Last major round (Unless Bruce thinks of somthing :-) of malloc changes.

Distribute all but the most fundamental malloc types. This time I also
remembered the trick to making things static: Put "static" in front of
them.

A couple of finer points by: bde
645e7b2ab6676a2a3a05a59a053929d3b7732b4d 11-Oct-1997 phk <phk@FreeBSD.org> Distribute and statizice a lot of the malloc M_* types.

Substantial input from: bde
13141f4b232b76b274f95166bd837fba480bf0ee 14-Sep-1997 peter <peter@FreeBSD.org> Various select -> poll changes
6fcfc89b8c1c3972921e2a3f60ac5b5891f7e05d 26-Aug-1997 bde <bde@FreeBSD.org> Removed some stale comments.

Fixed a gratuitous ANSIism.
8c5b669d734fddf7f8d8e0ab0ff8dbda2c5faafc 09-Apr-1997 bde <bde@FreeBSD.org> Removed support for OLD_PIPE. <sys/stat.h> is now missing the hack that
supported nameless pipes being indistinguishable from fifos. We're not
going back.
94b6d727947e1242356988da003ea702d41a97de 22-Feb-1997 peter <peter@FreeBSD.org> Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not
ready for it yet.
808a36ef658c1810327b5d329469bcf5dad24b28 14-Jan-1997 jkh <jkh@FreeBSD.org> Make the long-awaited change from $Id$ to $FreeBSD$

This will make a number of things easier in the future, as well as (finally!)
avoiding the Id-smashing problem which has plagued developers for so long.

Boy, I'm glad we're not using sup anymore. This update would have been
insane otherwise.
04a0ee9a2a23ea1925b6625e3268da60ccc7944b 29-Dec-1996 dyson <dyson@FreeBSD.org> This commit is the embodiment of some VFS read clustering improvements.
Firstly, now our read-ahead clustering is on a file descriptor basis and not
on a per-vnode basis. This will allow multiple processes reading the
same file to take advantage of read-ahead clustering. Secondly, there
previously was a problem with large reads still using the ramp-up
algorithm. Of course, that was bogus, and now we read the entire
"chunk" off of the disk in one operation. The read-ahead clustering
algorithm should use less CPU than the previous also (I hope :-)).

NOTE: THAT LKMS MUST BE REBUILT!!!
fd7150cc218c3746b6dfad8cbe103d6bb390f3d1 19-Dec-1996 bde <bde@FreeBSD.org> Fixed nonexistent checking of lock types for F_GETLK.

Found by: NIST-PCTS
d8370232258e75a57361cafc2243862df35d18f1 19-Dec-1996 bde <bde@FreeBSD.org> Fixed lseek() on named pipes. It always succeeded but should always fail.
Broke locking on named pipes in the same way as locking on non-vnodes
(wrong errno). This will be fixed later.

The fix involves negative logic. Named pipes are now distinguished from
other types of files with vnodes, and there is additional code to handle
vnodes and named pipes in the same way only where that makes sense (not
for lseek, locking or TIOCSCTTY).
7c345d3f407095182f2d21cb5459fc87d1627e41 28-Sep-1996 bde <bde@FreeBSD.org> Fixed bitrot in the read-only attribute:
- kern.maxfilesperproc was read-only (and thus essentially useless).

Removed unused #includes. Strength-reduced used #includes.
6c8326977df550a2a016b180753415ee446ee145 15-Aug-1996 smpatel <smpatel@FreeBSD.org> Fix fdavail() so that correctly pays attention to the rlimit.

Fixes unp_externalize panic which occurs when a process is at it's
ulimit for file descriptors and tries to receive a file descriptor from
another process.

Reviewed by: wollman
1f5c4ea267df107922aca7248971ab314a648fee 17-Jun-1996 wpaul <wpaul@FreeBSD.org> Add a couple of #ifdef DEVFS/#endif clauses to slence the following
compiler warnings which occur if you don't have 'options DEVFS' in
your kernel config file:

../../kern/kern_descrip.c: In function `fildesc_drvinit':
../../kern/kern_descrip.c:1103: warning: unused variable `fd'
../../kern/kern_descrip.c: At top level:
../../kern/kern_descrip.c:1095: warning: `devfs_token_stdin' defined but not use
d
../../kern/kern_descrip.c:1096: warning: `devfs_token_stdout' defined but not us
ed
../../kern/kern_descrip.c:1097: warning: `devfs_token_stderr' defined but not us
ed
../../kern/kern_descrip.c:1098: warning: `devfs_token_fildesc' defined but not u
sed
57c3ebc617f6ed31240847c6fce74931a372824c 12-Jun-1996 gpalmer <gpalmer@FreeBSD.org> Clean up -Wunused warnings.

Reviewed by: bde
d88e11ffe23817617af35a2c6767dac68d2b66f8 27-Mar-1996 bde <bde@FreeBSD.org> Fixed the unit numbers of the devfs `fd' devices.

Made the devfs `fd' devices bug for bug compatible with the ones created
by MAKEDEV:
- ownership is bin.bin, not root.wheel, except for std*. The devfsext
interface doesn't seem to allow specifying the ownership of /devfs/fd,
so it's still incompatible.
- std* aren't links to fd/[0-2].
072464b1987f27964703650957bbf91f8350ace1 11-Mar-1996 peter <peter@FreeBSD.org> Import 4.4BSD-Lite2 onto the vendor branch, note that in the kernel, all
files are off the vendor branch, so this should not change anything.

A "U" marker generally means that the file was not changed in between
the 4.4Lite and Lite-2 releases, and does not need a merge. "C" generally
means that there was a change.
[note new unused (in this form) syscalls.conf, to be 'cvs rm'ed]
4ce83d9416d326466a861477103572379108d59d 11-Mar-1996 hsu <hsu@FreeBSD.org> Merge in Lite2: LIST replacement for f_filef, f_fileb, and filehead.
Did not accept change of second argument to ioctl from int to u_long.
Reviewed by: davidg & bde
5239b23b5dd3a758a93e0c2186e188e829e7ba19 23-Feb-1996 peter <peter@FreeBSD.org> kern_descrip.c: add fdshare()/fdcopy()
kern_fork.c: add the tiny bit of code for rfork operation.
kern/sysv_*: shmfork() takes one less arg, it was never used.
sys/shm.h: drop "isvfork" arg from shmfork() prototype
sys/param.h: declare rfork args.. (this is where OpenBSD put it..)
sys/filedesc.h: protos for fdshare/fdcopy.
vm/vm_mmap.c: add minherit code, add rounding to mmap() type args where
it makes sense.
vm/*: drop unused isvfork arg.

Note: this rfork() implementation copies the address space mappings,
it does not connect the mappings together. ie: once the two processes
have split, the pages may be shared, but the address space is not. If one
does a mmap() etc, it does not appear in the other. This makes it not
useful for pthreads, but it is useful in it's own right for having
light-weight threads in a static shared address space.

Obtained from: Original by Ron Minnich, extended by OpenBSD
276899d730ed09c8d2362c150aa0908e69928c36 04-Feb-1996 dyson <dyson@FreeBSD.org> Improve the performance for pipe(2) again. Also include some
fixes for previous version of new pipes from Bruce Evans. This
new version:

Supports more properly the semantics of select (BDE).
Supports "OLD_PIPE" correctly (kern_descrip.c, BDE).
Eliminates incorrect EPIPE returns (bash 'pipe broken' messages.)
Much faster yet, currently tuned relatively conservatively -- but now
gives approx 50% more perf than the new pipes code did originally.
(That was about 50% more perf than the original BSD pipe code.)

Known bugs outstanding:
No support for async io (SIGIO). Will be included soon.

Next to do:
Merge support for FIFOs.

Submitted by: bde
894b801eee9d8be5ef1af76d3e74a844a61ff35d 28-Jan-1996 dyson <dyson@FreeBSD.org> Enable the new fast pipe code. The old pipes can be used with the
"OLD_PIPE" config option.
63ec2c0ae9b44c5394bae5d6ee7fea5be9659585 14-Dec-1995 phk <phk@FreeBSD.org> A Major staticize sweep. Generates a couple of warnings that I'll deal
with later.
A number of unused vars removed.
A number of unused procs removed or #ifdefed.
a69d1dfbcb7bc187703b1767f289aba2bd4edbb6 08-Dec-1995 phk <phk@FreeBSD.org> Julian forgot to make the *devsw structures static.
1900eea896e2aaeae8a9fa8affa5fded2068c9b4 08-Dec-1995 julian <julian@FreeBSD.org> Pass 3 of the great devsw changes
most devsw referenced functions are now static, as they are
in the same file as their devsw structure. I've also added DEVFS
support for nearly every device in the system, however
many of the devices have 'incorrect' names under DEVFS
because I couldn't quickly work out the correct naming conventions.
(but devfs won't be coming on line for a month or so anyhow so that doesn't
matter)

If you "OWN" a device which would normally have an entry in /dev
then search for the devfs_add_devsw() entries and munge to make them right..
check out similar devices to see what I might have done in them in you
can't see what's going on..
for a laugh compare conf.c conf.h defore and after... :)
I have not doen DEVFS entries for any DISKSLICE devices yet as that will be
a much more complicated job.. (pass 5 :)

pass 4 will be to make the devsw tables of type (cdevsw * )
rather than (cdevsw)
seems to work here..
complaints to the usual places.. :)
c30f46c534617c688a4773ed830c44daa04853ee 07-Dec-1995 dg <dg@FreeBSD.org> Untangled the vm.h include file spaghetti.
699ff01f6608c66c7dd030106cfd033cd1661935 05-Dec-1995 bde <bde@FreeBSD.org> Include <vm/vm.h> or <vm/vm_page.h> explicitly to avoid breaking when
vnode_if.h doesn't include vm stuff.
6b7609f9092e6573753d281b9aba2b72653ea42c 04-Dec-1995 phk <phk@FreeBSD.org> A major sweep over the sysctl stuff.

Move a lot of variables home to their own code (In good time before xmas :-)

Introduce the string descrition of format.

Add a couple more functions to poke into these marvels, while I try to
decide what the correct interface should look like.

Next is adding vars on the fly, and sysctl looking at them too.

Removed a tine bit of defunct and #ifdefed notused code in swapgeneric.
3089ef081608c63c7ab47c632cfbc98efdc0e815 02-Dec-1995 bde <bde@FreeBSD.org> Completed function declarations and/or added prototypes.
198d88e0ae0e12a2ddafb80a60372116b0b1c0c6 29-Nov-1995 julian <julian@FreeBSD.org> If you're going to mechanically replicate something in 50 files
it's best to not have a (compiles cleanly) typo in it! (sigh)
f2f63c6ece7d25485976323df6d684743fe14bb6 29-Nov-1995 julian <julian@FreeBSD.org> OK, that's it..
That's EVERY SINGLE driver that has an entry in conf.c..
my next trick will be to define cdevsw[] and bdevsw[]
as empty arrays and remove all those DAMNED defines as well..

Each of these drivers has a SYSINIT linker set entry
that comes in very early.. and asks teh driver to add it's own
entry to the two devsw[] tables.

some slight reworking of the commits from yesterday (added the SYSINIT
stuff and some usually wrong but token DEVFS entries to all these
devices.

BTW does anyone know where the 'ata' entries in conf.c actually reside?
seems we don't actually have a 'ataopen() etc...

If you want to add a new device in conf.c
please make sure I know
so I can keep it up to date too..

as before, this is all dependent on #if defined(JREMOD)
(and #ifdef DEVFS in parts)
926091b331d408f07de3ee207b119e1ee4cf360f 14-Nov-1995 phk <phk@FreeBSD.org> Add new-style sysctl for KERN_FILE here.
aa9a60640e2c942769c3a8f506c8cb6317bb1eaf 12-Nov-1995 bde <bde@FreeBSD.org> Included <sys/sysproto.h> to get central declarations for syscall args
structs and prototypes for syscalls.

Ifdefed duplicated decentralized declarations of args structs. It's
convenient to have this visible but they are hard to maintain. Some
are already different from the central declarations. 4.4lite2 puts
them in comments in the function headers but I wanted to avoid the
large changes for that.
86f1bc4514fdcfd255f37f3218fe234bdc3664fc 05-Nov-1995 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'LINUX'.
50bd49a6eddb182f85a73a737d18da26fc5acdc1 21-Oct-1995 dg <dg@FreeBSD.org> Killed a few gratuitous #include's.
f89a2d225944046c5c8362675c54642d17b2ab38 08-Oct-1995 swallace <swallace@FreeBSD.org> Remove prototype definitions from <sys/systm.h>.
Prototypes are located in <sys/sysproto.h>.

Add appropriate #include <sys/sysproto.h> to files that needed
protos from systm.h.

Add structure definitions to appropriate files that relied on sys/systm.h,
right before system call definition, as in the rest of the kernel source.

In kern_prot.c, instead of using the dummy structure "args", create
individual dummy structures named <syscall>_args. This makes
life easier for prototype generation.
c86f0c7a71e7ade3e38b325c186a9cf374e0411e 30-May-1995 rgrimes <rgrimes@FreeBSD.org> Remove trailing whitespace.
4f64fe43e7186660d972f9eae509424f1addaa8a 28-Mar-1995 bde <bde@FreeBSD.org> Add and move declarations to fix all of the warnings from `gcc -Wimplicit'
(except in netccitt, netiso and netns) that I didn't notice when I fixed
"all" such warnings before.
2e14d9ebc3d3592c67bdf625af9ebe0dfc386653 14-Mar-1995 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'MATT_THOMAS'.
a5eaebecd08159cf578453b271353bd78121b55e 20-Feb-1995 guido <guido@FreeBSD.org> Implement maxprocperuid and maxfilesperproc. They are tunable
via sysctl(8). The initial value of maxprocperuid is maxproc-1,
that of maxfilesperproc is maxfiles (untill maxfile will disappear)

Now it is at least possible to prohibit one user opening maxfiles

-Guido

Submitted by:
Obtained from:
5e1c91854979f54637a24def6160e2296d5ff499 12-Dec-1994 bde <bde@FreeBSD.org> Obtained from: my fix for 1.1.5

Remove compatibility hack so that dup(fd) isn't interpreted as
dup2(fd & 0x3f, random_junk_on_stack_fd) when (fd & 0x3f) != 0.
c3e49455410fee43dec92514e04dfed13eb8c587 02-Oct-1994 phk <phk@FreeBSD.org> All of this is cosmetic. prototypes, #includes, printfs and so on. Makes
GCC a lot more silent.
f73f35898343587c73fd60422f7c2b15d42bae85 25-Sep-1994 phk <phk@FreeBSD.org> While in the real world, I had a bad case of being swapped out for a lot of
cycles. While waiting there I added a lot of the extra ()'s I have, (I have
never used LISP to any extent). So I compiled the kernel with -Wall and
shut up a lot of "suggest you add ()'s", removed a bunch of unused var's
and added a couple of declarations here and there. Having a lap-top is
highly recommended. My kernel still runs, yell at me if you kernel breaks.
34cd81d75f398ee455e61969b118639dacbfd7a6 23-Sep-1994 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'MACKERRAS'.
28c9f84238cb7b612ca7daf91875168e7453dbd6 02-Sep-1994 dg <dg@FreeBSD.org> munmapfd() was being called with one too few params - bug introduced
during my initial kernel port.
e16baf7a5fe7ac1453381d0017ed1dcdeefbc995 07-Aug-1994 cvs2svn <cvs2svn@FreeBSD.org> This commit was manufactured by cvs2svn to create branch 'SUNRPC'.
8d205697aac53476badf354623abd4e1c7bc5aff 02-Aug-1994 dg <dg@FreeBSD.org> Added $Id$
2469c867a164210ce96143517059f21db7f1fd17 25-May-1994 rgrimes <rgrimes@FreeBSD.org> The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch.

Reviewed by: Rodney W. Grimes
Submitted by: John Dyson and David Greenman
8fb65ce818b3e3c6f165b583b910af24000768a5 24-May-1994 rgrimes <rgrimes@FreeBSD.org> BSD 4.4 Lite Kernel Sources