History log of /freebsd-head/sys/kern/vfs_subr.c
Revision Date Author Comments (<<< Hide modified files) (Show modified files >>>)
dfad2c95f9e9a0a22003e07c314474ead321b824 06-Jul-2020 mjg <mjg@FreeBSD.org> vfs: expand on vhold_smr comment
4074673a8170783adc612fe7b120b52856220924 01-Jul-2020 mjg <mjg@FreeBSD.org> vfs: protect vnodes with smr

vget_prep_smr and vhold_smr can be used to ref a vnode while within vfs_smr
section, allowing consumers to get away without locking.

See vhold_smr and vdropl for comments explaining caveats.

Reviewed by: kib
Testec by: pho
Differential Revision: https://reviews.freebsd.org/D23913
30596480c966b8d8bc7436df3467ab2eac145751 21-May-2020 freqlabs <freqlabs@FreeBSD.org> Deduplicate fsid comparisons

Comparing fsid_t objects requires internal knowledge of the fsid structure
and yet this is duplicated across a number of places in the code.

Simplify by creating a fsidcmp function (macro).

Reviewed by: mjg, rmacklem
Approved by: mav (mentor)
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D24749
3c58c3daec32d24e6fee11a38c5e9d9fae312bf5 06-Mar-2020 chs <chs@FreeBSD.org> Add a new "mntfs" pseudo file system which provides private device vnodes for
file systems to safely access their disk devices, and adapt FFS to use it.
Also add a new BO_NOBUFS flag to allow enforcing that file systems using
mntfs vnodes do not accidentally use the original devfs vnode to create buffers.

Reviewed by: kib, mckusick
Approved by: imp (mentor)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D23787
1bd9aa5542b3f46c21bde7aab43e8bac42e453d1 23-Feb-2020 rlibby <rlibby@FreeBSD.org> vfs: quiet -Wwrite-strings

Reviewed by: kib, markj
Differential Revision: https://reviews.freebsd.org/D23797
2c8ef8f230769fc4f455d482afc8fad672b0f5c5 19-Feb-2020 jeff <jeff@FreeBSD.org> Eliminate some unnecessary uses of UMA_ZONE_VM. Only zones involved in
virtual address or physical page allocation need to be marked with this

Reviewed by: markj
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23712
7f6f346db07eeb4f9b00674678f1b085acfdbfe0 16-Feb-2020 mjg <mjg@FreeBSD.org> vfs: fix vlrureclaim ->v_object access

The routine was checking for ->v_type == VBAD. Since vgone drops the interlock
early sets this type at the end of the process of dooming a vnode, this opens
a time window where it can clear the pointer while the inerlock-holders is
accessing it.

Another note is that the code was:
(vp->v_object != NULL &&
vp->v_object->resident_page_count > trigger)

With the compiler being fully allowed to emit another read to get the pointer,
and in fact it did on the kernel used by pho.

Use atomic_load_ptr and remember the result.

Note that this depends on type-safety of vm_object.

Reported by: pho
9f682dd86f5924e34b389a7d0eada69ee729700c 16-Feb-2020 mjg <mjg@FreeBSD.org> vfs: check early for VCHR in vput_final to short-circuit in the common case

Otherwise the compiler inlines v_decr_devcount which keps getting jumped over
in the common case of not dealing with a device.
d94b8a55113908a952592ec9becd433a1876ff69 14-Feb-2020 mjg <mjg@FreeBSD.org> vfs: remove no longer needed atomic_load_ptr casts
8534caa69e2eccf6df52867966afcfb82e5219b6 12-Feb-2020 mjg <mjg@FreeBSD.org> vfs: refactor vputx and add more comment

Reviewed by: jeff (previous version)
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D23530
693023e586115e7abe124aac81dc8682533f61c4 12-Feb-2020 mjg <mjg@FreeBSD.org> vfs: switch to smp_rendezvous_cpus_retry for vfs_op_thread_enter/exit

In particular on amd64 this eliminates an atomic op in the common case,
trading it for IPIs in the uncommon case of catching CPUs executing the
code while the filesystem is getting suspended or unmounted.
cbc55c85d01e5ae3f31b1087ddd71119831ec42d 11-Feb-2020 mjg <mjg@FreeBSD.org> vfs: fix vhold race in mnt_vnode_next_lazy_relock

vdrop can set the hold count to 0 and wait for the ->mnt_listmtx held by
mnt_vnode_next_lazy_relock caller. The routine incorrectly asserted the
count has to be > 0.

Reported by: pho
Tested by: pho
11a0ed935594e97b91e36f95fea6f26db0099c6c 10-Feb-2020 mjg <mjg@FreeBSD.org> vfs: fix device count leak on vrele racing with vgone

The race is:

make v_usecount 0
sees v_usecount == 0, no updates
vp->v_rdev = NULL;
sees v_rdev == NULL, no updates

In this scenario si_devcount decrement is not performed.

Note this can only happen if the vnode lock is not held.

Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23529
759146989097842d1a3c003ad4a018e61e49053c 10-Feb-2020 mjg <mjg@FreeBSD.org> vfs: fix lock recursion in vrele

vrele is supposed to be called with an unlocked vnode, but this was never
asserted for if v_usecount was > 0. For such counts the lock is never touched
by the routine. As a result the kernel has several consumers which expect
vunref semantics and get away with calling vrele since they happen to never do
it when this is the last reference (and for some of them this may happen to be
a guarantee).

Work around the problem by changing vrele semantics to tolerate being called
with a lock. This eliminates a possible bug where the lock is already held and
vputx takes it anyway.

Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23528
5dfa167a47bd60a7492210e677b428537337c841 08-Feb-2020 mjg <mjg@FreeBSD.org> vfs: tidy up vget_finish and vn_lock

- remove assertion which duplicates vn_lock
- use VNPASS instead of retyping the failure
- report what flags were passed if panicking on them
11e74d9fa26b9a1948b06d8cce76932b0ac05212 02-Feb-2020 kevans <kevans@FreeBSD.org> Provide O_SEARCH

O_SEARCH is defined by POSIX [0] to open a directory for searching, skipping
permissions checks on the directory itself after the initial open(). This is
close to the semantics we've historically applied for O_EXEC on a directory,
which is UB according to POSIX. Conveniently, O_SEARCH on a file is also
explicitly undefined behavior according to POSIX, so O_EXEC would be a fine
choice. The spec goes on to state that O_SEARCH and O_EXEC need not be
distinct values, but they're not defined to be the same value.

This was pointed out as an incompatibility with other systems that had made
its way into libarchive, which had assumed that O_EXEC was an alias for

This defines compatibility O_SEARCH/FSEARCH (equivalent to O_EXEC and FEXEC
respectively) and expands our UB for O_EXEC on a directory. O_EXEC on a
directory is checked in vn_open_vnode already, so for completeness we add a
NOEXECCHECK when O_SEARCH has been specified on the top-level fd and do not
re-check that when descending in namei.

[0] https://pubs.opengroup.org/onlinepubs/9699919799/

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23247
94ee14c445db089b3d4153f24fc6de73ab532b48 02-Feb-2020 mjg <mjg@FreeBSD.org> vfs: remove the now empty vop_unlock_post
d382db405ebf5af540e47b51243ac29092e57816 01-Feb-2020 mjg <mjg@FreeBSD.org> vfs: replace VOP_MARKATIME with VOP_MMAPPED

The routine is only provided by ufs and is only used on mmap and exec.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23422
2e8e2366c04a6a8886741d9db61ff63440c0700c 01-Feb-2020 mjg <mjg@FreeBSD.org> vfs: add vrefactn

Differential Revision: https://reviews.freebsd.org/D23427
9250db86b8e29aaf2f085ea6b7075f5ca7eb4c19 31-Jan-2020 mjg <mjg@FreeBSD.org> vfs: revert the overzealous assert added in r357285 to vgone

The intent was to make it more likely to catch filesystems with custom
need_inactive routines which fail to call vn_need_pageq_flush (or do an

One immediate case which is missed is vgone from called by inactive itself.

A better assertion may land later. The routine is not added to vputx because
it is of no use to tmpfs et al.

Reported by: syzbot+5f697ec11f89b60941db@syzkaller.appspotmail.com
3f1b6975a2baf85a71ec51bdaffe8126168031a8 30-Jan-2020 mjg <mjg@FreeBSD.org> Remove duplicated empty lines from kern/*.c

No functional changes.
215f047a0d8ac5450cf3305412e45573da343893 30-Jan-2020 mjg <mjg@FreeBSD.org> vfs: assert that doomed vnodes don't need to call vm_object_page_clean

... after the optional inactive processing.
eaa936ec5160a41e828f79b7bbdd14b38686e307 30-Jan-2020 mjg <mjg@FreeBSD.org> vfs: unlazy before dooming the vnode

With this change having the listmtx lock held postpones dooming the vnode.
Use this fact to simplify iteration over the lazy list. It also allows
filters to safely access ->v_data.

Reviewed by: kib (early version)
Differential Revision: https://reviews.freebsd.org/D23397
eb9cc37db32d321baf0beb868786b2ba8a2ce57f 30-Jan-2020 glebius <glebius@FreeBSD.org> Fix text format definition for kern.maxvnodes, vfs.wantfreevnodes. This
is a regression from r356642, r356645.
16c3f2767afcb24c0ceedac70502e719adb14c6f 26-Jan-2020 mjg <mjg@FreeBSD.org> vfs: do an unlocked check before iterating the lazy list

For most filesystems it is expected to be empty most of the time.
97a4244ffe0eb3e884aabaa4db8d47f2669a2459 26-Jan-2020 mjg <mjg@FreeBSD.org> vfs: fix freevnodes count update race against preemption

vdbatch_process leaves the critical section too early, openign a time
window where another thread can get scheduled and modify vd->freevnodes.
Once it the preempted thread gets back it overrides the value with 0.

Just move critical_exit to the end of the function.
f600a862f4bd03d16e5ab5e34d8cf435731d4db0 26-Jan-2020 mjg <mjg@FreeBSD.org> vfs: predict vn_lock failure as unlikely in vget
ec983c9c6f45c495571d2bd160ea07d7921b94f5 24-Jan-2020 mjg <mjg@FreeBSD.org> vfs: allow v_usecount to transition 0->1 without the interlock

There is nothing to do but to bump the count even during said transition.
There are 2 places which can do it:
- vget only does this after locking the vnode, meaning there is no change in
contract versus inactive or reclamantion
- vref only ever did it with the interlock held which did not protect against
either (that is, it would always succeed)

VCHR vnodes retain special casing due to the need to maintain dev use count.

Reviewed by: jeff, kib
Tested by: pho (previous version)
Differential Revision: https://reviews.freebsd.org/D23185
1c80afbb99071bee67a1642de27bb2bd6b3840af 24-Jan-2020 mjg <mjg@FreeBSD.org> vfs: stop handling VI_OWEINACT in vget

vget is almost always called with LK_SHARED, meaning the flag (if present) is
almost guaranteed to get cleared. Stop handling it in the first place and
instead let the thread which wanted to do inactive handle the bumepd usecount.

Reviewed by: jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23184
837eba9f6ff430d90f63deff91c7f88df216bcb8 24-Jan-2020 mjg <mjg@FreeBSD.org> vfs: stop unlocking the vnode upfront in vput

Doing so runs into races with filesystems which make half-constructed vnodes
visible to other users, while depending on the chain vput -> vinactive ->
vrecycle to be executed without dropping the vnode lock.

Impediments for making this work got cleared up (notably vop_unlock_post now
does not do anything and lockmgr stops touching the lock after the final
write). Stacked filesystems keep vhold/vdrop across unlock, which arguably can
now be eliminated.

Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23344
44fb75786324f762ac29483567a094299145d87e 19-Jan-2020 mjg <mjg@FreeBSD.org> vfs: allow v_holdcnt to transition 0->1 without the interlock

Since r356672 ("vfs: rework vnode list management") there is nothing to do
apart from altering freevnodes count, but this much can be safely done based
on the result of atomic_fetchadd.

Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D23186
3fd0fc148dae5406b6db86bbb9e955e79e52a496 19-Jan-2020 mjg <mjg@FreeBSD.org> vfs: plug a conditional assigment of lo_name in getnewvnode

It only matters for witness. No functional changes.
5acc96f1079e1d5020a717a559e60430619bec6d 18-Jan-2020 mjg <mjg@FreeBSD.org> vfs: distribute freevnodes counter per-cpu

It gets rolled up to the global when deferred requeueing is performed.
A dedicated read routine makes sure to return a value only off by a certain

This soothes a global serialisation point for all 0<->1 hold count transitions.

Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23235
152d94131ef5e38fa1a59406370b79ac86285452 17-Jan-2020 mjg <mjg@FreeBSD.org> vfs: shorten lock hold time in vdbatch_process
90033c6131eb78700deb74b0bce4dfd9e74e41ee 16-Jan-2020 mjg <mjg@FreeBSD.org> vfs: increment numvnodes without the vnode list lock unless under pressure

The vnode list lock is only needed to reclaim free vnodes or kick the vnlru
thread (or to block and not miss a wake up (but note the sleep has a timeout so
this would not be a correctness issue)). Try to get away without the lock by
just doing an atomic increment.

The lock is contended e.g., during poudriere -j 104 where about half of all
acquires come from vnode allocation code.

Note the entire scheme needs a rewrite, the above just reduces it's SMP impact.

Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D23140
c3918a4c542d7ae1a5c737ec5f5ba3701fdb5df7 16-Jan-2020 mjg <mjg@FreeBSD.org> vfs: refcator vnode allocation

Semantics are almost identical. Some code is deduplicated and there are
fewer memory accesses.

Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D23158
2fbceb1cd7ea2e75f9bb11b54ab8629a42e90397 16-Jan-2020 mjg <mjg@FreeBSD.org> vfs: reimplement vlrureclaim to actually use LRU

Take advantage of global ordering introduced in r356672.

Reviewed by: mckusick (previous version)
Differential Revision: https://reviews.freebsd.org/D23067
9bce1ff1fbc521e710710f68b43134f779117d9b 13-Jan-2020 mjg <mjg@FreeBSD.org> vfs: per-cpu batched requeuing of free vnodes

Constant requeuing adds significant lock contention in certain
workloads. Lessen the problem by batching it.

Per-cpu areas are locked in order to synchronize against UMA freeing

vnode's v_mflag is converted to short to prevent the struct from

Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock: 122.38s user 1780.45s system 6242% cpu 30.480 total
patched: 144.84s user 985.90s system 4856% cpu 23.282 total

Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22998
9b80414f5c7df36fff6f9efda508f8419bec9fe5 13-Jan-2020 mjg <mjg@FreeBSD.org> vfs: rework vnode list management

The current notion of an active vnode is eliminated.

Vnodes transition between 0<->1 hold counts all the time and the
associated traversal between different lists induces significant
scalability problems in certain workloads.

Introduce a global list containing all allocated vnodes. They get
unlinked only when UMA reclaims memory and are only requeued when
hold count reaches 0.

Sample result from an incremental make -s -j 104 bzImage on tmpfs:
stock: 118.55s user 3649.73s system 7479% cpu 50.382 total
patched: 122.38s user 1780.45s system 6242% cpu 30.480 total

Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22997
4e5b4032abe0668b1c5e64f420b8102fd60cd4c8 13-Jan-2020 mjg <mjg@FreeBSD.org> vfs: add per-mount vnode lazy list and use it for deferred inactive + msync

This obviates the need to scan the entire active list looking for vnodes
of interest.

msync is handled by adding all vnodes with write count to the lazy list.

deferred inactive directly adds vnodes as it sets the VI_DEFINACT flag.

Vnodes get dequeued from the list when their hold count reaches 0.

Newly added MNT_VNODE_FOREACH_LAZY* macros support filtering so that
spurious locking is avoided in the common case.

Reviewed by: jeff
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D22995
165ba25434b2bb8b19301b4b2a10e65eaf2de2fa 12-Jan-2020 mjg <mjg@FreeBSD.org> Add KERNEL_PANICKED macro for use in place of direct panicstr tests
f378822de743890fc8570affdb5754f01aa68a49 11-Jan-2020 mjg <mjg@FreeBSD.org> vfs: only recalculate watermarks when limits are changing

Previously they would get recalculated all the time, in particular in:
getnewvnode -> vcheckspace -> vspace
f3f621443747ae01ee19a754a48e511e423da2ac 11-Jan-2020 mjg <mjg@FreeBSD.org> vfs: deduplicate vnode allocation logic

This creates a dedicated routine (vn_alloc) to allocate vnodes.

As a side effect code duplicationw with getnewvnode_reserve is eleminated.

Add vn_free for symmetry.
d9c6cac9c348b207bf8410ce7c309010592c0291 11-Jan-2020 mjg <mjg@FreeBSD.org> vfs: prealloc vnodes in getnewvnode_reserve

Having a reserved vnode count does not guarantee that getnewvnodes wont
block later. Said blocking partially defeats the purpose of reserving in
the first place.

Preallocate instaed. The only consumer was always passing "1" as count
and never nesting reservations.
85edb793f694ee4f4e7915a96b08de790f3cd8b1 11-Jan-2020 mjg <mjg@FreeBSD.org> vfs: incomplete pass at converting more ints to u_long

Most notably numvnodes and freevnodes were u_long, but parameters used to
govern them remained as ints.
e21b9aa538e7e3005dbc6437733e8563260c1176 11-Jan-2020 mjg <mjg@FreeBSD.org> vfs: add missing CLTFLA_MPSAFE annotations

This covers all kern/vfs_*.c files.
7bf1b9c7de740ee8973953aa8d123adc843794cc 07-Jan-2020 mjg <mjg@FreeBSD.org> vfs: handle doomed vnodes in vdefer_inactive

vgone dooms the vnode while keeping VI_OWEINACT set and then drops the

vputx can pick up the interlock and pass it to vdefer_inactive since the
flag is set.

The race is harmless, just don't defer anything as vgone will take care of it.

Reported by: pho
1204b9c8821b43d6b09eabf97a1c61a4ecb14711 07-Jan-2020 mjg <mjg@FreeBSD.org> vfs: reimplement deferred inactive to use a dedicated flag (VI_DEFINACT)

The previous behavior of leaving VI_OWEINACT vnodes on the active list without
a hold count is eliminated. Hold count is kept and inactive processing gets
explicitly deferred by setting the VI_DEFINACT flag. The syncer is then
responsible for vdrop.

Reviewed by: kib (previous version)
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D23036
55696801396bb03c7df814a44d45691854971f41 07-Jan-2020 mjg <mjg@FreeBSD.org> vfs: trylock in vfs_msync and refactor the func

- use LK_NOWAIT instead of calling VOP_ISLOCKED before deciding to lock
- evaluate flags before looping over vnodes

Reviewed by: kib
Tested by: pho (in a larger patch, previous version)
Differential Revision: https://reviews.freebsd.org/D23035
037b78139b289a50e462178fb5d37db2660b76d6 07-Jan-2020 mjg <mjg@FreeBSD.org> vfs: use a dedicated counter for free vnode recycling

Otherwise vlrureclaim activitity is mixed in and it is hard to tell which
vnodes got reclaimed.
b1939ef45067c47639bf1e7bedda1646776cd721 07-Jan-2020 mjg <mjg@FreeBSD.org> vfs: prevent numvnodes and freevnodes re-reads when appropriate

Otherwise in code like this:
if (numvnodes > desiredvnodes)
vnlru_free_locked(numvnodes - desiredvnodes, NULL);

numvnodes can drop below desiredvnodes prior to the call and if the
compiler generated another read the subtraction would get a negative
6facf4df8b251ad201d0386c7dd6e6cb12b9726c 07-Jan-2020 mjg <mjg@FreeBSD.org> vfs: annotate numvnodes and vnode_free_list_mtx with __exclusive_cache_line
d5666b82d255a5824be01a329ffe329d639a8ba6 07-Jan-2020 mjg <mjg@FreeBSD.org> vfs: eliminate v_tag from struct vnode

There was only one consumer and it was using it incorrectly.

It is given an equivalent hack.

Reviewed by: jeff
Differential Revision: https://reviews.freebsd.org/D23037
bb33b4e736d552f8ba6f5e990d2aa93f7609535d 07-Jan-2020 mjg <mjg@FreeBSD.org> vfs: add a helper for allocating marker vnodes
22dbb8bf5eb601e5a364d589afe1c26915279006 05-Jan-2020 mjg <mjg@FreeBSD.org> vfs: drop thread argument from vinactive
12c96f78301506fd1bcc9643bbff08b09f53c591 05-Jan-2020 mjg <mjg@FreeBSD.org> vfs: patch up vnode count assertions to report found value
f121d45000fd1c42611ca1e54872bd4545398933 03-Jan-2020 mjg <mjg@FreeBSD.org> vfs: drop the mostly unused flags argument from VOP_UNLOCK

Filesystems which want to use it in limited capacity can employ the

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D21427
a437db75e81004c7bb15004791444b241f803610 01-Jan-2020 mjg <mjg@FreeBSD.org> vfs: drop an always-false check from vlrureclaim

The vnode gets held few lines prior, making the VI_FREE condition
af6891923640e394110f44cc502b71f3881036fe 27-Dec-2019 mjg <mjg@FreeBSD.org> vfs: remove production kernel checks and mp == NULL support from vdrop

1. The only place in the tree which calls getnewvnode with mp == NULL does it
for vp_crossmp which will never execute this codepath. Any vnode which legally
has ->v_mount == NULL is also doomed, which once more wont execute this code.
2. Remove an assertion for v_holdcnt from production kernels. It gets taken care
of by refcount macros in debug kernels.

Any code which would want to pass NULL mp can construct a fake one instead.

Reviewed by: kib (previous version)
Differential Revision: https://reviews.freebsd.org/D22722
048a894ebc2b0a0bac4b0c8f0e42b54a06fc049b 16-Dec-2019 mjg <mjg@FreeBSD.org> vfs: flatten vop vectors

This eliminates the following loop from all VOP calls:

while(vop != NULL && \
vop->vop_spare2 == NULL && vop->vop_bypass == NULL)
vop = vop->vop_default;

Reviewed by: jeff
Tesetd by: pho
Differential Revision: https://reviews.freebsd.org/D22738
c023a15140276342d1b9178a9d6511514265b126 10-Dec-2019 mjg <mjg@FreeBSD.org> vfs: refactor vhold and vdrop

No fuctional changes.
bcfa67ab8b8212a8a15763e8a6855a272b8fde22 08-Dec-2019 mjg <mjg@FreeBSD.org> vfs: introduce v_irflag and make v_type smaller

The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by: kib, jeff
Differential Revision: https://reviews.freebsd.org/D22715
4b9989aca8e353089ff9fa8125817de42094997a 08-Dec-2019 mjg <mjg@FreeBSD.org> vfs: clean up vputx a little

1. replace hand-rolled macros for operation type with enum
2. unlock the vnode in vput itself, there is no need to branch on it. existence
of VPUTX_VPUT remains significant in that the inactive variant adds LK_NOWAIT
to locking request.
3. remove the useless v_usecount assertion. few lines above the checks if
v_usecount > 0 and leaves. should the value be negative, refcount would fail.
4. the CTR return vnode %p to the freelist is incorrect as vdrop may find the
vnode with holdcnt > 1. if the like should exist, it should be moved there
5. no need to error = 0 for everyone

Reviewed by: kib, jeff (previous version)
Differential Revision: https://reviews.freebsd.org/D22718
872f296f3c5f2288aa4dbce995d55486b4eec6ab 08-Dec-2019 mjg <mjg@FreeBSD.org> vfs: factor out vnode destruction out of vdrop

Sponsored by: The FreeBSD Foundation
0a3ea4b564d90ec2281a0d45f23806e70188a89b 07-Dec-2019 mjg <mjg@FreeBSD.org> vfs: clean up delmntque similarly to vdrop r355414
818ef82e15705798cfb333be5e7364e80d294c9e 07-Dec-2019 mjg <mjg@FreeBSD.org> vfs: catch vn_printf up with reality

- add the missing VV_VMSIZEVNLOCK and VV_READLINK flags
- add decoding v_mflag

While here sort flags.
3e04f4b855030bc49ba7df282764556fdd11cba3 05-Dec-2019 mjg <mjg@FreeBSD.org> vfs: remove 'active' variable from _vdrop

No functional changes.
41890de3345edc39bb87d0dbe1b0e8551f9bea50 20-Nov-2019 mjg <mjg@FreeBSD.org> vfs: perform a more racy check in vfs_notify_upper

Locking mp does not buy anything interms of correctness and only contributes to
b1e239e6e210d67a351248db47d67634bd658a99 20-Nov-2019 mjg <mjg@FreeBSD.org> vfs: change si_usecount management to count used vnodes

Currently si_usecount is effectively a sum of usecounts from all associated
vnodes. This is maintained by special-casing for VCHR every time usecount is
modified. Apart from complicating the code a little bit, it has a scalability
impact since it forces a read from a cacheline shared with said count.

There are no consumers of the feature in the ports tree. In head there are only
2: revoke and devfs_close. Both can get away with a weaker requirement than the
exact usecount, namely just the count of active vnodes. Changing the meaning to
the latter means we only need to modify it on 0<->1 transitions, avoiding the
check plenty of times (and entirely in something like vrefact).

Reviewed by: kib, jeff
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22202
bff69757f02a3d7ded266c0c97a24c8627f925a3 29-Oct-2019 jeff <jeff@FreeBSD.org> Replace OBJ_MIGHTBEDIRTY with a system using atomics. Remove the TMPFS_DIRTY
flag and use the same system.

This enables further fault locking improvements by allowing more faults to
proceed with a shared lock.

Reviewed by: kib
Tested by: pho
Differential Revision: https://reviews.freebsd.org/D22116
a6dbd937982cd33461ac0e0867109d5e8aa52030 23-Oct-2019 kib <kib@FreeBSD.org> Fix undefined behavior.

Create a sequence point by ending a full expression for call to
vspace() and use of the globals which are modified by vspace().

Reported and reviewed by: imp
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D22126
2393dd146cdb837d522c66946f96bbb693b963a2 23-Oct-2019 kib <kib@FreeBSD.org> vn_printf(): Decode VI_TEXT_REF.

Sponsored by: The FreeBSD Foundation
MFC after: 3 days
7cb37ce3115b6b2183332f7ea2a9109db20c2045 13-Oct-2019 mjg <mjg@FreeBSD.org> vfs: add MNTK_NOMSYNC

On many filesystems the traversal is effectively a no-op. Add a way to avoid
the overhead.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22009
c576b0223d6d1046572683f9c3fabfb636274e46 13-Oct-2019 mjg <mjg@FreeBSD.org> vfs: return free vnode batches in sync instead of vfs_msync

It is a more natural fit. vfs_msync only deals with active vnodes.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D22008
8ed5fe7c0df3c6d11d4a166c2e983453a5b27963 06-Oct-2019 mjg <mjg@FreeBSD.org> vfs: add optional root vnode caching

Root vnodes looekd up all the time, e.g. when crossing a mount point.
Currently used routines always perform a costly lookup which can be
trivially avoided.

Reviewed by: jeff (previous version), kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21646
580abaccd79dd59f69d4d9085ef5ced8be79069f 04-Oct-2019 vangyzen <vangyzen@FreeBSD.org> Add CTLFLAG_STATS to some vfs sysctl OIDs

Add CTLFLAG_STATS to the following OIDs:


Refer to r353111.

MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
3913895c3d89bdae0eb1c446b01d21dd2e8e73a2 02-Oct-2019 emaste <emaste@FreeBSD.org> simplify path handling in sysctl_try_reclaim_vnode

MAXPATHLEN / PATH_MAX includes space for the terminating NUL, and namei
verifies the presence of the NUL. Thus there is no need to increase the
buffer size here.

The sysctl passes the string excluding the NUL, so req->newlen equal to
PATH_MAX is too long.

Reviewed by: kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21876
f0e5ce5f1009017c6d16ba8a9a38bf20d7ef151d 23-Sep-2019 sef <sef@FreeBSD.org> Add two options to allow mount to avoid covering up existing mount points.
The two options are

* nocover/cover: Prevent/allow mounting over an existing root mountpoint.
E.g., "mount -t ufs -o nocover /dev/sd1a /usr/local" will fail if /usr/local
is already a mountpoint.
* emptydir/noemptydir: Prevent/allow mounting on a non-empty directory.
E.g., "mount -t ufs -o emptydir /dev/sd1a /usr" will fail.

Neither of these options is intended to be a default, for historical and
compatibility reasons.

Reviewed by: allanjude, kib
Differential Revision: https://reviews.freebsd.org/D21458
6090f91124648b9dd92a8f04505d75f405a398cf 16-Sep-2019 mjg <mjg@FreeBSD.org> vfs: convert struct mount counters to per-cpu

There are 3 counters modified all the time in this structure - one for
keeping the structure alive, one for preventing unmount and one for
tracking active writers. Exact values of these counters are very rarely
needed, which makes them a prime candidate for conversion to a per-cpu
scheme, resulting in much better performance.

Sample benchmark performing fstatfs (modifying 2 out of 3 counters) on
a 104-way 2 socket Skylake system:
before: 852393 ops/s
after: 76682077 ops/s

Reviewed by: kib, jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21637
e19820cd96bd905655fdef26c8f1174fa7ef875c 16-Sep-2019 mjg <mjg@FreeBSD.org> vfs: manage mnt_lockref with atomics

See r352424.

Reviewed by: kib, jeff
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21574
bec2ffc72ab27f17354881026adfb47440a2d5c5 16-Sep-2019 mjg <mjg@FreeBSD.org> vfs: manage mnt_ref with atomics

New primitive is introduced to denote sections can operate locklessly
on aspects of struct mount, but which can also be disabled if necessary.
This provides an opportunity to start scaling common case modifications
while providing stable state of the struct when facing unmount, write
suspendion or other events.

mnt_ref is the first counter to start being managed in this manner with
the intent to make it per-cpu.

Reviewed by: kib, jeff
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21425
de19b5c6a0f95def974a1e14991883e4d7604573 13-Sep-2019 mjg <mjg@FreeBSD.org> vfs: release usecount using fetchadd

1. If we release the last usecount we take ownership of the hold count, which
means the vnode will remain allocated until we vdrop it.
2. If someone else vrefs they will find no usecount and will proceed to add
their own hold count.
3. No code has a problem with v_usecount transitioning to 0 without the

These facts combined mean we can fetchadd instead of having a cmpset loop.

Reviewed by: kib (previous version)
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21528
7c2370b33ce485f15ead77d9f5497622b0763fc4 05-Sep-2019 mjg <mjg@FreeBSD.org> vfs: temporarily revert r351825

There are 2 problems:
- it introduces a funny bug where it can end up trylocking the same vnode [1]
- it exposes a pre-existing softdep deadlock [2]

Both are easier to run into that the bug which got fixed, so revert until
a complete solution is worked out.

Reported by: cy [1], pho [2]
Sponsored by: The FreeBSD Foundation
7555f0f24e3cd0be29dbfdc86274c0a9fa11ff8c 04-Sep-2019 mjg <mjg@FreeBSD.org> vfs: fully hold vnodes in vnlru_free_locked

Currently the code only bumps holdcnt and clears the VI_FREE flag, not
performing actual vhold. Since the vnode is still visible elsewhere, a
potential new user can find it and incorrectly assume it is properly held.

Use vholdl instead to correctly hold the vnode. Another place recycling
(vlrureclaim) does this already.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21522
c5b9ed7f3328b164dd0edf2ebcd0db507a67ac9b 03-Sep-2019 mjg <mjg@FreeBSD.org> vfs: implement usecount implying holdcnt

vnodes have 2 reference counts - holdcnt to keep the vnode itself from getting
freed and usecount to denote it is actively used.

Previously all operations bumping usecount would also bump holdcnt, which is
not necessary. We can detect if usecount is already > 1 (in which case holdcnt
is also > 1) and utilize it to avoid bumping holdcnt on our own. This saves
on atomic ops.

Reviewed by: kib
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21471
5451b35f06da65eff6282c76a7a1061fb222f9ae 01-Sep-2019 markj <markj@FreeBSD.org> Extend uma_reclaim() to permit different reclamation targets.

The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure. This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.

With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets. Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only
items in excess of the estimate are reclaimed. Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme

Now, when under memory pressure, the page daemon will trim zones
rather than draining them. As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.

Reviewed by: jeff
Tested by: pho (an earlier version)
MFC after: 2 months
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16667
eb4d1499a27851dce381f709ba0f1895aa09a223 30-Aug-2019 mjg <mjg@FreeBSD.org> vfs: add a missing VNODE_REFCOUNT_FENCE_REL to v_incr_usecount_locked

Sponsored by: The FreeBSD Foundation
4f45cc9c1d677e390ebbe51ca666995af1890b59 30-Aug-2019 mjg <mjg@FreeBSD.org> vfs: tidy up assertions in vfs_subr

- assert unlocked vnode interlock in vref
- assert right counts in vputx
- print debug info for panic in vdrop

Sponsored by: The FreeBSD Foundation
54ba4b35c0f5c3fb6309e51f537872ce90e49993 29-Aug-2019 kib <kib@FreeBSD.org> Rework v_object lifecycle for vnodes.

Current implementation of vnode_create_vobject() and
vnode_destroy_vobject() is written so that it prepared to handle the
vm object destruction for live vnode. Practically, no filesystems use
this, except for some remnants that were present in UFS till today.
One of the consequences of that model is that each filesystem must
call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result
all of them get rid of the v_object in reclaim.

Move the call to vnode_destroy_vobject() to vgonel() before
VOP_RECLAIM(). This makes v_object stable: either the object is NULL,
or it is valid vm object till the vnode reclamation. Remove code from
vnode_create_vobject() to handle races with the parallel destruction.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21412
e8f0ed264eff0af69a0e51984c1a49babe94b570 28-Aug-2019 mjg <mjg@FreeBSD.org> vfs: add VOP_NEED_INACTIVE

vnode usecount drops to 0 all the time (e.g. for directories during path lookup).
When that happens the kernel would always lock the exclusive lock for the vnode
in order to call vinactive(). This blocks other threads who want to use the vnode
for looukp.

vinactive is very rarely needed and can be tested for without the vnode lock held.

This patch gives filesytems an opportunity to do it, sample total wait time for
tmpfs over 500 minutes of poudriere -j 104:

before: 557563641706 (lockmgr:tmpfs)
after: 46309603301 (lockmgr:tmpfs)

Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21371
98c369f63d492b3df27573f2f95ec5d97a4b58b8 27-Aug-2019 mjg <mjg@FreeBSD.org> vfs: stop passing LK_INTERLOCK to VOP_UNLOCK

The plan is to drop the flags argument. There is also a temporary bug
now that nullfs ignores the flag.

Reviewed by: kib
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21252
2ccfe333361c195ff676241f9a842e5b9cbbcf58 25-Aug-2019 mjg <mjg@FreeBSD.org> vfs: add vholdnz (for already held vnodes)

Reviewed by: kib (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21358
8b8c10ee6c0b10e96aecd0be35e71acd219d8305 23-Aug-2019 kib <kib@FreeBSD.org> De-commision the MNTK_NOINSMNTQ kernel mount flag.

After all the changes, its dynamic scope is same as for MNTK_UNMOUNT,
but to allow the syncer vnode to be re-installed on unmount failure.
But the case of syncer was already handled by using the VV_FORCEINSMQ
flag for quite some time.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
c25a7c144952a6d00847074dcb5aecd8026622b5 19-Aug-2019 jeff <jeff@FreeBSD.org> Use an atomic reference count for paging in progress so that callers do not
require the object lock.

Reviewed by: markj
Tested by: pho (as part of a larger branch)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21311
c21cd5cd124c7b1252e360f8fda66d1e95af0627 28-Jul-2019 asomers <asomers@FreeBSD.org> Better comments for vlrureclaim

MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
3c6bfc0920fc21632b1049b2d4883cf12c71dc0e 28-Jul-2019 asomers <asomers@FreeBSD.org> Add v_inval_buf_range, like vtruncbuf but for a range of a file

v_inval_buf_range invalidates all buffers within a certain LBA range of a
file. It will be used by fusefs(5). This commit is a partial merge of
r346162, r346606, and r346756 from projects/fuse2.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21032
e0dae398511bea1d4f28e2024ea8685af159b29e 06-Jun-2019 asomers <asomers@FreeBSD.org> Add a testing facility to manually reclaim a vnode

Add the debug.try_reclaim_vnode sysctl. When a pathname is written to it, it
will be reclaimed, as long as it isn't already or doomed. The purpose is to
gain test coverage for vnode reclamation, which is otherwise hard to

Add the debug.ftry_reclaim_vnode sysctl. It does the same thing, except
that its argument is a file descriptor instead of a pathname.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20519
6798617f6dda96e8adc7fc32e0cea5ba6c203c05 24-May-2019 asomers <asomers@FreeBSD.org> Remove "struct ucred*" argument from vtruncbuf

vtruncbuf takes a "struct ucred*" argument. AFAICT, it's been unused ever
since that function was first added in r34611. Remove it. Also, remove some
"struct ucred" arguments from fuse and nfs functions that were only used by

Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20377
3038f1af7b5d53f0c56c2b7edc86ad5edf2dcd23 21-May-2019 cem <cem@FreeBSD.org> Include ktr.h in more compilation units

Similar to r348026, exhaustive search for uses of CTRn() and cross reference
ktr.h includes. Where it was obvious that an OS compat header of some kind
included ktr.h indirectly, .c files were left alone. Some of these files
clearly got ktr.h via header pollution in some scenarios, or tinderbox would
not be passing prior to this revision, but go ahead and explicitly include it
in files using it anyway.

Like r348026, these CUs did not show up in tinderbox as missing the include.

Reported by: peterj (arm64/mp_machdep.c)
X-MFC-With: r347984
Sponsored by: Dell EMC Isilon
312be5657cd27f08fbad8ece33294a4ed2c7dc3c 19-May-2019 kib <kib@FreeBSD.org> Fix rw->ro remount when there is a text vnode mapping.

Reported and tested by: hrs
Sponsored by: The FreeBSD Foundation
MFC after: 16 days
2dc0d9edaa7487c11806a0ea8cae77e3a4a79785 05-May-2019 kib <kib@FreeBSD.org> Switch to use shared vnode locks for text files during image activation.

kern_execve() locks text vnode exclusive to be able to set and clear
VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0

The change removes VV_TEXT, replacing it with the condition
v_writecount <= -1, and puts v_writecount under the vnode interlock.
Each text reference decrements v_writecount. To clear the text
reference when the segment is unmapped, it is recorded in the
vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and
v_writecount is incremented on the map entry removal

The operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that
v_writecount does not contradict the desired change. vn_writecheck()
is now racy and its use was eliminated everywhere except access.
Atomic check for writeability and increment of v_writecount is
performed by the VOP. vn_truncate() now increments v_writecount
around VOP_SETATTR() call, lack of which is arguably a bug on its own.

nullfs bypasses v_writecount to the lower vnode always, so nullfs
vnode has its own v_writecount correct, and lower vnode gets all
references, since object->handle is always lower vnode.

On the text vnode' vm object dealloc, the v_writecount value is reset
to zero, and deadfs vop_unset_text short-circuit the operation.
Reclamation of lowervp always reclaims all nullfs vnodes referencing
lowervp first, so no stray references are left.

Reviewed by: markj, trasz
Tested by: mjg, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 month
Differential revision: https://reviews.freebsd.org/D19923
34e6bdddf03c57016bf6b80d7840cd87568c4285 11-Mar-2019 mckusick <mckusick@FreeBSD.org> Update the main loop in the flushbuflist() routine to properly select
buffers for flushing when requested to flush both normal and extended
attributes buffers.

Sponsored by: Netflix
e8a9c3693fb0d71700ac532d6c018bb01eef5ede 08-Jan-2019 tuexen <tuexen@FreeBSD.org> Avoid overfow in vtruncbuf()

Using daddr_t instead of int avoids trunclbn to become negative when it
This isssue was found by running syzkaller.

Reviewed by: mckusick, kib, markj
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D18763
51739d277bc9b0860b6973c506d24e2975e76c04 27-Dec-2018 mckusick <mckusick@FreeBSD.org> When loading an inode from disk, verify that its mode is valid.
If invalid, return EINVAL. Note that inode check-hashes greatly
reduce the chance that these errors will go undetected.

Reported by: Christopher Krah <krah@protonmail.com>
Reported as: FS-5-UFS-2: Denial Of Service in nmount-3 (ffs_read)
Reviewed by: kib
MFC after: 1 week
Sponsored by: Netflix

M sys/fs/ext2fs/ext2_vnops.c
M sys/kern/vfs_subr.c
M sys/ufs/ffs/ffs_snapshot.c
M sys/ufs/ufs/ufs_vnops.c
f31f9d541398d61aaa2430642b20a474b04d43e3 23-Dec-2018 kib <kib@FreeBSD.org> Properly test for vmio buffer in bnoreuselist().

The presence of allocated v_object does not imply that the buffer is
necessary VMIO kind. Buffer might has been allocated before the
object created, then the buffer is malloced. Although we try to avoid
such situation, it seems to be still legitimate.

Reported and tested by: pho
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
7e31d1de7edc43eb4db8f4e6248b0441a44c1e5e 11-Dec-2018 mjg <mjg@FreeBSD.org> Remove unused argument to priv_check_cred.

Patch mostly generated with cocinnelle:

expression E1,E2;

- priv_check_cred(E1,E2,0)
+ priv_check_cred(E1,E2)

Sponsored by: The FreeBSD Foundation
d1a00acf4dbc12204b03a1bfa39d43e2715288ae 17-Aug-2018 markj <markj@FreeBSD.org> Typo.

X-MFC with: r337974
4e68a99c040aaab0228dbf78890910e1959a6ea9 17-Aug-2018 markj <markj@FreeBSD.org> Add INVARIANTS-only fences around lockless vnode refcount updates.

Some internal KASSERTs access the v_iflag field without the vnode
interlock held after such a refcount update. The fences are needed for
the assertions to be correct in the face of store reordering.

Reported and tested by: jhibbits
Reviewed by: kib, mjg
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D16756
ddc936d6509d85749917eebe49480e03d8c422a7 06-Jun-2018 jhibbits <jhibbits@FreeBSD.org> Revert r334708

This is the wrong place to put the barrier.
Requested by: kib,mjg
74490551987990ea0766bbee268a319ba60e5c97 06-Jun-2018 jhibbits <jhibbits@FreeBSD.org> Add a memory barrier after taking a reference on the vnode holdcnt in _vhold

This is needed to avoid a race between the VNASSERT() below, and another
thread updating the VI_FREE flag, on weakly-ordered architectures.

On a 72-thread POWER9, without this barrier a 'make -j72 buildworld' would
panic on the assert regularly.

It may be possible to use a weaker barrier, and I'll investigate that once
all stability issues are worked out on POWER9.
4eacc085864b7704258c49fbea5cb660ce9a8abc 19-May-2018 mmacy <mmacy@FreeBSD.org> vfs: annotate variables only used by debug builds as __unused
1c11f552d63c8d13159b579aed059a7649bbf5aa 04-May-2018 jamie <jamie@FreeBSD.org> Make it easier for filesystems to count themselves as jail-enabled,
by doing most of the work in a new function prison_add_vfs in kern_jail.c
Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and
the rest is taken care of. This includes adding a jail parameter like
allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed.
Both of these used to be a static list of known filesystems, with
predefined permission bits.

Reviewed by: kib
Differential Revision: D14681
9d79658aab1a30f34fee169ce74bdff4ca405c18 06-Apr-2018 brooks <brooks@FreeBSD.org> Move most of the contents of opt_compat.h to opt_global.h.

opt_compat.h is mentioned in nearly 180 files. In-progress network
driver compabibility improvements may add over 100 more so this is
closer to "just about everywhere" than "only some files" per the
guidance in sys/conf/options.

Keep COMPAT_LINUX32 in opt_compat.h as it is confined to a subset of
sys/compat/linux/*.c. A fake _COMPAT_LINUX option ensure opt_compat.h
is created on all architectures.

Move COMPAT_LINUXKPI to opt_dontuse.h as it is only used to control the
set of compiled files.

Reviewed by: kib, cem, jhb, jtl
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D14941
81c19620beed0e9d4076711d451ad8bd3a2b2699 28-Mar-2018 avg <avg@FreeBSD.org> ZFS vn_rele_async: catch up with the use of refcount(9) for the vnode use count

It's not sufficient nor required to use the vnode interlock when
checking if we are going to drop the last use count as the code in
vputx() uses refcount (atomic) operations for both checking and
decrementing the use code. Apply the same method to vn_rele_async().
While here, remove vn_rele_inactive(), a wrapper around vrele() that
didn't add any value.

Also, the change required making vfs_refcount_release_if_not_last()
public. I've made vfs_refcount_acquire_if_not_zero() public as well.
They are in sys/refcount.h now. While making the move I've dropped the
vfs_ prefix.

Reviewed by: mjg
MFC after: 2 weeks
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D14869
e3be9f8fb6d01fd2604b306b6f8ab52afb2d1173 20-Feb-2018 jeff <jeff@FreeBSD.org> Further parallelize the buffer cache.

Provide multiple clean queues partitioned into 'domains'. Each domain manages
its own bufspace and has its own bufspace daemon. Each domain has a set of
subqueues indexed by the current cpuid to reduce lock contention on the cleanq.

Refine the sleep/wakeup around the bufspace daemon to use atomics as much as

Add a B_REUSE flag that is used to requeue bufs during the scan to approximate
LRU rather than locking the queue on every use of a frequently accessed buf.

Implement bufspace_reserve with only atomic_fetchadd to avoid loop restarts.

Reviewed by: markj
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14274
dca0cf286959f3fe786bba10e4567e09793ce932 31-Jan-2018 mckusick <mckusick@FreeBSD.org> One of the vnode fields listed by vn_printf is the union of pointers
whose type depends on the type of vnode. Correct vn_printf so that
it correctly identifies the name of the pointer that it is printing.

Submitted by: Andreas Longwitz <longwitz at incore.de>
MFC after: 1 week
749ff5c6102e255c22f49e88604e7e08111abea9 12-Jan-2018 mjg <mjg@FreeBSD.org> vfs: tidy up vdrop

Skip vfs_refcount_release_if_not_last if the interlock is held and just
go straight to refcount_release.

While here do cosmetic rearrangement of _vhold to better show it contains
equivalent behaviour.
421a929b1ebeb7e81e8fe54b895d211985f2f6f1 27-Dec-2017 eadler <eadler@FreeBSD.org> kernel: Fix several typos and minor errors

- duplicate words
- typos
- references to old versions of FreeBSD

Reviewed by: imp, benno
c8da6fae2c8073f216b7a35739f7dfa140c9a8d9 25-Dec-2017 kan <kan@FreeBSD.org> Do pass removing some write-only variables from the kernel.

This reduces noise when kernel is compiled by newer GCC versions,
such as one used by external toolchain ports.

Reviewed by: kib, andrew(sys/arm and sys/arm64), emaste(partial), erj(partial)
Reviewed by: jhb (sys/dev/pci/* sys/kern/vfs_aio.c and sys/kern/kern_synch.c)
Differential Revision: https://reviews.freebsd.org/D10385
4736ccfd9c3411d50371d7f21f9450a47c19047e 20-Nov-2017 pfg <pfg@FreeBSD.org> sys: further adoption of SPDX licensing ID tags.

Mainly focus on files that use BSD 3-Clause license.

The Software Package Data Exchange (SPDX) group provides a specification
to make it easier for automated tools to detect and summarize well known
opensource licenses. We are gradually adopting the specification, noting
that the tags are considered only advisory and do not, in any way,
superceed or replace the license texts.

Special thanks to Wind River for providing access to "The Duke of
Highlander" tool: an older (2014) run over FreeBSD tree was useful as a
starting point.
ce2d698d244d1d64ddb76591c4243604fbc05f95 20-Oct-2017 markj <markj@FreeBSD.org> Avoid the nbp lookup in the final loop iteration in flushbuflist().

The end of the loop must re-lookup the next buf since the bufobj lock
is dropped in the loop body. If the lookup fails, the loop is restarted.
This mechanism non-obviously also terminates the loop when the end of
the buf list is reached. Split up the two loops termination cases to
make the code a bit less fragile. No functional change intended.

Reviewed by: kib
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12730
9fce26667971748be51ba9f3b16033e70324e5cb 17-Oct-2017 markj <markj@FreeBSD.org> Fix a racy VI_DOOMED check in MNT_VNODE_FOREACH_ALL().

MNT_VNODE_FOREACH_ALL() is supposed to avoid returning doomed vnodes,
but the VI_DOOMED check it used was done without the vnode interlock
held, so it could race with a concurrent vgone().

Submitted by: Don Morris <don.morris@isilon.com>
Reviewed by: kib, mckusick
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12704
23d65de60e03a3f2dd2ce6443e57660e8217a99c 19-Sep-2017 kib <kib@FreeBSD.org> For unlinked files, do not msync(2) or sync on the vnode deactivation.

One consequence of the patch is that msyncing unlinked file mappings
no longer reduces the amount of the dirty memory in the system, but I
do not think that there are users of msync(2) that utilize it for such

Reported and tested by: tjil
PR: 222356
Reviewed by: alc
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D12411
79ee71cc3881a3db75370a73152f3eb9247bde29 28-Aug-2017 bdrewery <bdrewery@FreeBSD.org> Allow vdrop() of a vnode not yet on the per-mount list after r306512.

The old code allowed calling vdrop() before insmntque() to place the vnode back
onto the freelist for later recycling. Some downstream consumers may rely on
this support. Normally insmntque() failing is fine since is uses vgone() and
immediately frees the vnode rather than attempting to add it to the freelist if
vdrop() were used instead.

Also assert that vhold() cannot be used on such a vnode.

Reviewed by: kib, cem, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D12126
b8035d686eae55068f027e714749b4da24639970 20-Aug-2017 kib <kib@FreeBSD.org> Allow vinvalbuf() to operate with the shared vnode lock.

This mode allows other clean buffers to arrive while we flush the buf
lists for the vnode, which is fine for the targeted use. We only need
that all buffers existed at the time of the function start were
flushed. In fact, only one assert has to be relaxed.

In collaboration with: pho
Reviewed by: rmacklem
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
X-Differential revision: https://reviews.freebsd.org/D12083
8a7f8bb123cb2056ca239a472deaef02962bdd3e 02-Jun-2017 glebius <glebius@FreeBSD.org> For UNIX sockets make vnode point not to the socket, but to the UNIX PCB,
since the latter is the thing that links together VFS and sockets.

While here, make the union in the struct vnode anonymous.
99c3c32cc1177983d75d18d4a3d03993ba03b0eb 15-May-2017 kib <kib@FreeBSD.org> mnt_vnode_next_active: use conventional lock order when trylock fails.

Previously, when the VI_TRYLOCK failed, we would spin under the mutex
that protects the vnode active list until we either succeeded or
noticed that we had hogged the CPU. Since we were violating the lock
order, this would guarantee that we would become a hog under any
deadlock condition (e.g. a race with vdrop(9) on the same vnode). In
the presence of many concurrent threads in sync(2) or vdrop etc, the
victim could hang for a long time.

Now, avoid spinning by dropping and reacquiring the locks in the
conventional lock order when the trylock fails. This requires a dance
with the vnode hold count.

Submitted by: Tom Rix <trix@juniper.net>
Tested by: pho
Differential revision: https://reviews.freebsd.org/D10692
19618f8455d0f59bb06f16c5f1a5af3738738744 05-Apr-2017 kib <kib@FreeBSD.org> Add V_VMIO flag for vinvalbuf(9) to indicate that the flush request
was issued during VM-initiated i/o (pageout), so that the function
does not try to flush or remove pages or wait for the vm object
paging-in-progress counter.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
X-Differential revision: https://reviews.freebsd.org/D10241
2d2b5c686150a82cd7e8193f9d9eb2f00a13d063 04-Apr-2017 brooks <brooks@FreeBSD.org> Correct a kernel stack leak in 32-bit compat when vfc_name is short.

Don't zero unused pointer members again.

Per discussion with secteam we are not issuing an advisory for this
issue as we have no current evidence it leaks exploitable information.

Reviewed by: rwatson, glebius, delphij
MFC after: 1 day
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D10227
016bffc695eb698c22e43c3f2af5b34eb89ab553 12-Mar-2017 ian <ian@FreeBSD.org> Change 'Hz' back to 'HZ'... it's referring to the kernel config option
named HZ, not being used as an abbreviation of the unit of measure.
2788e31637decc0341b631bcaf90195c5ee22f79 12-Mar-2017 ian <ian@FreeBSD.org> Correct the abbreviations for microseconds (us, not ms), and for Hz (not HZ).
68b155428477481308b1a17d02cd8c35d2017dbd 05-Feb-2017 mjg <mjg@FreeBSD.org> vfs: use atomic_fcmpset in vfs_refcount_*
8869161bb2b5ba0d2c214525595dea8944eb377f 22-Jan-2017 trasz <trasz@FreeBSD.org> Improve debugging printf.
e90da0df921ebdc1c9aed3421f3c06dac7b923f4 21-Jan-2017 mjg <mjg@FreeBSD.org> vfs: hide the getvnode NULL mp message behind DIAGNOSTIC

Since crossmp vnode changes the message was being printed on each boot.

Reported by: trasz
Discussed with: kib
941219909870df681731b03d2f8a51132ec42048 31-Dec-2016 mjg <mjg@FreeBSD.org> vfs: switch nodes_created, recycles_count and free_owe_inact to counter(9)

Reviewed by: kib
f03b37f3e8c9e424c3a74ed50c1ed0fcf5053530 12-Dec-2016 mjg <mjg@FreeBSD.org> vfs: add vrefact, to be used when the vnode has to be already active

This allows blind increment of relevant counters which under contention
is cheaper than inc-not-zero loops at least on amd64.

Use it in some of the places which are guaranteed to see already active

Reviewed by: kib (previous version)
e06021a9453b7e163e41513e0bb59c650ceab866 26-Nov-2016 markj <markj@FreeBSD.org> Launder VPO_NOSYNC pages upon vnode deactivation.

As of r234483, vnode deactivation causes non-VPO_NOSYNC pages to be
laundered. This behaviour has two problems:

1. Dirty VPO_NOSYNC pages must be laundered before the vnode can be
reclaimed, and this work may be unfairly deferred to the vnlru process
or an unrelated application when the system is under vnode pressure.
2. Deactivation of a vnode with dirty VPO_NOSYNC pages requires a scan of
the corresponding VM object's memq for non-VPO_NOSYNC dirty pages; if
the laundry thread needs to launder pages from an unreferenced such
vnode, it will reactivate and deactivate the vnode with each laundering,
potentially resulting in a large number of expensive scans.

Therefore, ensure that all dirty pages are laundered upon deactivation,
i.e., when all maps of the vnode are removed and all references are

Reviewed by: alc, kib
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D8641
af2e2494f1d6a05dc3154f6772a2ae49b44503ae 08-Oct-2016 mjg <mjg@FreeBSD.org> vfs: clear the tmp free list flag before taking the free vnode list lock

Safe access is already guaranteed because of the mnt_listmx lock.
fc32cf9f5037a197feef799e9b3dc4f5189ef383 06-Oct-2016 bdrewery <bdrewery@FreeBSD.org> vrefl: Assert that the interlock is held.

Sponsored by: Dell EMC Isilon
MFC after: 2 weeks
f3e00c4570bfc0d014ad8e4f118c5981ad7fb1a2 06-Oct-2016 bdrewery <bdrewery@FreeBSD.org> Add vrecyclel() to vrecycle() a vnode with the interlock already held.

Obtained from: OneFS
Sponsored by: Dell EMC Isilon
MFC after: 2 weeks
2d8b77f87e3f6b9a2117e9fe1934017aa36a935f 04-Oct-2016 bdrewery <bdrewery@FreeBSD.org> Correct some comments after r294299.

Sponsored by: Dell EMC Isilon
b06af67fb0acd4c0b2cf1ab3ca331020b592dec6 30-Sep-2016 mjg <mjg@FreeBSD.org> vfs: batch free vnodes in per-mnt lists

Previously free vnodes would always by directly returned to the global
LRU list. With this change up to mnt_free_list_batch vnodes are collected

syncer runs always return the batch regardless of its size.

While vnodes on per-mnt lists are not counted as free, they can be
returned in case of vnode shortage.

Reviewed by: kib
Tested by: pho
6a50fe29a5f57b9f8d0f794e8b19c3a726409921 30-Sep-2016 mjg <mjg@FreeBSD.org> vfs: remove the __bo_vnode field from struct vnode

The pointer can be obtained using __containerof instead.

Reviewed by: kib
00b67b15b9ffa1019fe84745accba07152c64e44 15-Sep-2016 emaste <emaste@FreeBSD.org> Renumber license clauses in sys/kern to avoid skipping #3
cf980f6c7a374f1a73e47bafc3e2f03549a63f47 12-Aug-2016 trasz <trasz@FreeBSD.org> Print vnode details when vnode locking assertion gets triggered.

MFC after: 1 month
255ed885fae2f6cdcee12dc1787193b7fa977971 10-Aug-2016 trasz <trasz@FreeBSD.org> Replace all remaining calls to vprint(9) with vn_printf(9), and remove
the old macro.

MFC after: 1 month
522c3665caf6959241e075e635baf32f652ead95 04-Aug-2016 trasz <trasz@FreeBSD.org> Remove unused - never actually implemented - vnode lock types
from vnode_if.src.

MFC after: 1 month
9d45f8230f14803f8949181d033089987b632f4f 11-Jul-2016 kib <kib@FreeBSD.org> Fix grammar.

Submitted by: alc
MFC after: 2 weeks
c859b3f77dd41ec74749a01cf70c742432c4f74d 11-Jul-2016 kib <kib@FreeBSD.org> In vgonel(), postpone setting BO_DEAD until VOP_RECLAIM() is called,
if vnode is VMIO. For VMIO vnodes, set BO_DEAD in vm_object_terminate().

The vnode_destroy_object(), when calling into vm_object_terminate(),
must be able to flush buffers. BO_DEAD purpose is to quickly destroy
buffers on write when the underlying vnode is not operable any more
(one example is the devfs node after geom is gone). Setting BO_DEAD
for reclaiming vnode before object is terminated is premature, and
results in unability to flush buffers with live SU dependencies from
vinvalbuf() in vm_object_terminate().

Reported by: David Cross <dcrosstech@gmail.com>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
c31bb3499e1215ae394355f759575e2bb6b3b562 03-Jul-2016 kib <kib@FreeBSD.org> Remove racy assert. The thread which changes vnode usecount from 0 to 1
does it under the vnode interlock, but the interlock is not owned by the
asserting thread. As result, we might read increased use counter but also
still see VI_OWEINACT.

In collaboration with: nwhitehorn
Hardware donated by: IBM LTC
Sponsored by: The FreeBSD Foundation (kib)
Approved by: re (gjb)
06125ebef50f0ddd7807915589bd685571774531 20-Jun-2016 kib <kib@FreeBSD.org> Fix typo. Note that atomic is still required even for interlocked case.

Sponsored by: The FreeBSD Foundation
Approved by: re (marius)
ed393257f09838840b18e74272f45f2f53a249d5 17-Jun-2016 mjg <mjg@FreeBSD.org> vfs: ifdef out noop vop_* primitives on !DEBUG_VFS_LOCKS kernels

This removes calls to empty functions like vop_lock_{pre/post} from
common vfs routines.

Approved by: re (gjb)
40488ecf8467fa970cb0682384828a46d11726d3 17-Jun-2016 kib <kib@FreeBSD.org> Add VFS interface to flush specified amount of free vnodes belonging
to mount points with the given filesystem type, specified by mount
vfs_ops pointer.

Based on patch by: mckusick
Reviewed by: avg, mckusick
Tested by: allanjude, madpilot
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
661ae931713af6ff0c98bb5ce8361f289885d7b3 31-May-2016 trasz <trasz@FreeBSD.org> Cosmetics - add missing space after ellipses in shutdown messages.

MFC after: 1 month
Sponsored by: The FreeBSD Foundation
594a7170305c0649a184418dfb6b51eb1a9a15a2 16-May-2016 avg <avg@FreeBSD.org> vfs_read_dirent: increment ncookies after adding a cookie

It seems that at present vfs_read_dirent() is used only with filesystems
that do not support cookies, so the bug never manifested itself.

MFC after: 1 week
838f7d1304c723206899cd4965a1a8ba0761d68f 03-May-2016 kib <kib@FreeBSD.org> Add EVFILT_VNODE open, read and close notifications.

While there, order EVFILT_VNODE notes descriptions alphabetically.

Based on submission, and tested by: Vladimir Kondratyev <wulf@cicgroup.ru>
MFC after: 2 weeks
9eb6f0fde4c92d101df058d2ad6dea964988cebb 02-May-2016 kib <kib@FreeBSD.org> Issue NOTE_EXTEND when a directory entry is added to or removed from
the monitored directory as the result of rename(2) operation. The
renames staying in the directory are not reported.

Submitted by: Vladimir Kondratyev <wulf@cicgroup.ru>
MFC after: 2 weeks
5bc5df16639fd16299b315b38cdba281bf17a539 02-May-2016 kib <kib@FreeBSD.org> Fix reporting of NOTE_LINK when directory link count changes due to
rename removing or adding subdirectory entry.

Discussed with and tested by: Vladimir Kondratyev <wulf@cicgroup.ru>
NetBSD PR: 48958 (http://gnats.netbsd.org/48958)
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
28823d06561e2e9911180b17a57e05ff19d7cbf6 29-Apr-2016 pfg <pfg@FreeBSD.org> sys/kern: spelling fixes in comments.

No functional change.
fc01419148d065603607b1008d536431465f3bc3 26-Apr-2016 pfg <pfg@FreeBSD.org> sys: extend use of the howmany() macro when available.

We have a howmany() macro in the <sys/param.h> header that is
convenient to re-use as it makes things easier to read.
392fea70ccccfb399c24065c3b18ae3f4f2cd8a9 24-Feb-2016 kib <kib@FreeBSD.org> Provide more correct sizing of the KVA consumed by a vnode, used by
the virtvnodes calculation. Include the size of fs-specific v_data as
the nfs nclnode inline, the NFS nclnode is bigger than either ZFS
znode or UFS inode. Include the size of namecache_ts and short cache
path element, multiplied by the name cache population factor, again

Inline defines are used to avoid pollution of the vnode.h with the
subsystem-private objects. Non-significant unsynchronized changes of
the definitions are fine, we do not care about that precision, and
e.g. ZFS consumes much malloced memory per vnode for reasons
unaccounted in the formula.

Lower the partition of kmem dedicated to vnodes, from 1/7 to 1/10.

The measures reduce vnode cache pressure on kmem and bring the vnode
cache memory use below some apparent thresholds that were exceeded by
r291244 due to more robust vnode reuse.

Reported and tested by: marius (i386, previous version)
Reviewed by: bde
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
a5f88cf93faec476bfcd6a7b8a86d9a1962e4746 17-Feb-2016 kib <kib@FreeBSD.org> In bnoreuselist(), check both ends of the specified logical block
numbers range.

This effectively skips indirect and extdata blocks on the buffer
queue. Since their logical block numbers are negative, bnoreuselist()
could loop infinitely.

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
09fb369fc5164b2baaba44740ab7379c43306071 18-Jan-2016 markj <markj@FreeBSD.org> Add vrefl(), a locked variant of vref(9).

This API has no in-tree consumers at the moment but is useful to at least
one out-of-tree consumer, and naturally complements existing vnode refcount
functions (vholdl(9), vdropl(9)).

Obtained from: kib (sys/ portion)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4947
Differential Revision: https://reviews.freebsd.org/D4953
8c46f725d5083f7e2c52a22367c441fc217ca7b1 05-Jan-2016 kib <kib@FreeBSD.org> Two fixes for excessive iterations after r292326.

Advance the logical block number to the lblkno of the found block plus
one, instead of incrementing the block number which was used for
lookup. This change skips sparcely populated buffer ranges, similar
to r292325, instead of doing useless lookups.

Do not restart the bnoreuselist() from the start of the range if
buffer lock cannot be obtained without sleep. Only retry lookup and
lock for the same queue and same logical block number.

Reported by: benno
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
b5160b028098d5dcb7369b864d338c7e3aeaddcd 16-Dec-2015 kib <kib@FreeBSD.org> Optimize vop_stdadvise(POSIX_FADV_DONTNEED). Instead of looking up a
buffer for each block number in the range with gbincore(), look up the
next instantiated buffer with the logical block number which is
greater or equal to the next lblkno. This significantly speeds up the
iteration for sparce-populated range.

Move the iteration into new helper bnoreuselist(), which is structured
similarly to flushbuflist().

Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
764a2409cb5541ae3a247c163556269e90a152bb 16-Dec-2015 kib <kib@FreeBSD.org> Simplify the loop step in the flushbuflist() and make it independed on
the type stability of the buffers memory. Instead of memoizing
pointer to the next buffer and validating it, remember the next
logical block number in the bo list and re-lookup.

Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
1a9ecd3df94c4288283e8beb306463d6464a4302 04-Dec-2015 mckusick <mckusick@FreeBSD.org> We need to zero out the clustering variables in a freed vnode structure.
For completeness add a VNASSERT that there are no threads waiting on a
range lock (this was previously checked on every vnode free).

Reported by; Rick Macklem
Fix from: Mateusz Guzik
PR: 204949
25671cd0d574bba616ecf957b39e1f1029403d42 03-Dec-2015 mckusick <mckusick@FreeBSD.org> We need to zero out the union of pointers in a freed vnode structure.

PR: 204949
Fix from: Mateusz Guzik
Tested by: Jason Unovitch
6f0b4b3366b2f6b2fe1ba4c0f7f4ac0c2da3e8b7 29-Nov-2015 mckusick <mckusick@FreeBSD.org> As the kernel allocates and frees vnodes, it fully initializes them
on every allocation and fully releases them on every free. These
are not trivial costs: it starts by zeroing a large structure then
initializes a mutex, a lock manager lock, an rw lock, four lists,
and six pointers. And looking at vfs.vnodes_created, these operations
are being done millions of times an hour on a busy machine.

As a performance optimization, this code update uses the uma_init
and uma_fini routines to do these initializations and cleanups only
as the vnodes enter and leave the vnode_zone. With this change the
initializations are only done kern.maxvnodes times at system startup
and then only rarely again. The frees are done only if the vnode_zone
shrinks which never happens in practice. For those curious about the
avoided work, look at the vnode_init() and vnode_fini() functions in
kern/vfs_subr.c to see the code that has been removed from the main
vnode allocation/free path.

Reviewed by: kib
Tested by: Peter Holm
d58541e2279973980a7ebc8ae3c108724f6ed855 27-Nov-2015 kib <kib@FreeBSD.org> Remove VI_AGE vnode iflag, it is unused.

Noted by: bde
Sponsored by: The FreeBSD Foundation
3cb64bd1cedb85352b1ef883f83dec1ad91b24ce 27-Nov-2015 kib <kib@FreeBSD.org> Move the comment about resident pages preventing vnode from leaving
active list, into the header comment for vdrop(), which is the
function that decides whether to leave the vnode on the list. Note
that dirty page write-out in vinactive() is asynchronous.

Discussed with: alc
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
896302b6a8ad03423851a3c4a57224c84ac914bb 24-Nov-2015 kib <kib@FreeBSD.org> Rework the vnode cache recycling to meet free and unused vnodes
targets. See the comment above wantfreevnodes variable for the
description of the algorithm.

The vfs.vlru_alloc_cache_src sysctl is removed. New code frees
namecache sources as the last chance to satisfy the highest watermark,
instead of selecting the source vnodes randomly. This provides good
enough behaviour to keep vn_fullpath() working in most situations.
The filesystem layout with deep trees, where the removed knob was
required, is thus handled automatically.

Submitted by: bde
Discussed with: mckusick
Tested by: pho
MFC after: 1 month
6460e3db4e6d4865cbc26bceee8a5e18a88e6adc 20-Nov-2015 glebius <glebius@FreeBSD.org> Remove remnants of the old NFS from vnode pager.

Reviewed by: kib
Sponsored by: Netflix
bb2ac24a51cee9e3e4c5079b2573346889379d54 26-Sep-2015 markj <markj@FreeBSD.org> Remove a check for a condition that is always false by a preceding KASSERT
that was added in r144704.
3eefaf542b436c06090bc5ed2458fa05e57cf8af 26-Sep-2015 markj <markj@FreeBSD.org> Fix argument ordering in vn_printf().

MFC after: 3 days
9fe341127ea25693bd07e64b1d48b37c082b8088 15-Sep-2015 cem <cem@FreeBSD.org> kevent(2): Note DOOMED vnodes with NOTE_REVOKE

In poll mode, check for and wake VBAD vnodes. (Vnodes that are VBAD at
registration will never be woken by the RECLAIM trigger.)

Add post-VOP_RECLAIM hook to trigger notes on vnode reclamation. (Vnodes that
were fine at registration but are vgoned while being monitored should signal

Reviewed by: kib
Approved by: markj (mentor)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3675
17357462a0f336a402d0f8216be99b499905f00f 06-Sep-2015 mckusick <mckusick@FreeBSD.org> Track changes to kern.maxvnodes and appropriately increase or decrease
the size of the name cache hash table (mapping file names to vnodes)
and the vnode hash table (mapping mount point and inode number to vnode).
An appropriate locking strategy is the key to changing hash table sizes
while they are in active use.

Reviewed by: kib
Tested by: Peter Holm
Differential Revision: https://reviews.freebsd.org/D2265
MFC after: 2 weeks
c1207ee9b8eb51f8deceb0d4dac7fcde9862beff 24-Aug-2015 trasz <trasz@FreeBSD.org> Make vfs_unmountall() unmount /dev after /, not before. The only
reason this didn't result in an unclean shutdown is that devfs ignores

Reviewed by: kib@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D3467
9e31188bdc0785a826b4a3fbd33d2b89264516f4 23-Aug-2015 trasz <trasz@FreeBSD.org> After r286237 it should be fine to call vgone(9) on a busy GEOM vnode;
remove KASSERT that would prevent forced devfs unmount from working.

MFC after: 1 month
Sponsored by: The FreeBSD Foundation
4a54322f0b8dcd72b1946004f9b964ef0093af1a 05-Aug-2015 ed <ed@FreeBSD.org> Make it possible to implement poll(2) on top of kqueue(2).

It looks like EVFILT_READ and EVFILT_WRITE trigger under the same
conditions as poll()'s POLLRDNORM and POLLWRNORM as described by POSIX.
The only difference is that POLLRDNORM has to be triggered on regular
files unconditionally, whereas EVFILT_READ only triggers when not EOF.

Introduce a new flag, NOTE_FILE_POLL, that can be used to make
EVFILT_READ and EVFILT_WRITE behave identically to poll(). This flag
will be used by cloudlibc's poll() function.

Reviewed by: jmg
Differential Revision: https://reviews.freebsd.org/D3303
b8165ab6a9cf903e39418903598efa0909a6ce46 04-Aug-2015 trasz <trasz@FreeBSD.org> Mark vgonel() as static. It was already declared static earlier;
no idea why compilers don't warn about this.

MFC after: 1 month
Sponsored by: The FreeBSD Foundation
28fa5eedfe3c8c9827db6b8b82eab79e95f15bdd 16-Jul-2015 mjg <mjg@FreeBSD.org> vfs: implement v_holdcnt/v_usecount manipulation using atomic ops

Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free
list) of either counter are still guarded with vnode interlock.

Reviewed by: kib (earlier version)
Tested by: pho
b3b0716b638ef3918b3c9f796e6b1d0d74fbde0d 11-Jul-2015 mjg <mjg@FreeBSD.org> vfs: always clear VI_OWEINACT in consumers bumping v_usecount

Previously vputx would detect the condition and clear the flag.

With this change it is invalid to have both v_usecount > 0 and the flag
set. Assert the condition is met in all revlevant places.

Reviewed by: kib
74d7b1e72e856bb50946f5ac909aeea06fd2141e 11-Jul-2015 mjg <mjg@FreeBSD.org> vfs: move si_usecount manipulation to dedicated functions

Reviewed by: kib
564e88d4999f488b5bfa09bee75872f9c90ce973 11-Jul-2015 kib <kib@FreeBSD.org> Do not allow creation of the dirty buffers for the dead buffer
objects, i.e. for buffer objects which vnode was reclaimed. Buffer
cache cannot write such buffers. Return the error and discard the
buffer immediately on write attempt.

BO_DIRTY now always set during vnode reclamation, since it is used not
only for the INVARIANTS checks. Do allow placement of the clean
buffers on dead bufobj list, otherwise filesystems cannot use bufcache
at all after the devvp reclaim.

Reported and tested by: trasz
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
f94d7a8603d94f28c2c67817a28a10f38aa7eea4 05-Jul-2015 markj <markj@FreeBSD.org> Remove a stale descriptive comment for gbincore().

The splay trees referenced in the comment were converted to
path-compressed tries in r250551.

MFC after: 3 days
110a049369fb1fc8bf011fe87e28a699b154ecbe 23-Jun-2015 jmg <jmg@FreeBSD.org> zero this struct as it depends upon it...

Reviewed by: mjg
Differential Revision: https://reviews.freebsd.org/D2890
becc575eec74ac5ab5b5c8e2033984ef817278cd 17-Jun-2015 kib <kib@FreeBSD.org> vfs_msync(), called from syncer vnode fsync VOP, only iterates over
the active vnode list for the given mount point, with the assumption
that vnodes with dirty pages are active. This is enforced by
vinactive() doing vm_object_page_clean() pass over the vnode pages.

The issue is, if vinactive() cannot be called during vput() due to the
vnode being only shared-locked, we might end up with the dirty pages
for the vnode on the free list. Such vnode is invisible to syncer,
and pages are only cleaned on the vnode reactivation. In other words,
the race results in the broken guarantee that user data, written
through the mmap(2), is written to the disk not later than in 30
seconds after the write.

Fix this by keeping the vnode which is freed but still owing
inactivation, on the active list. When syncer loops find such vnode,
it is deactivated and cleaned by the final vput() call.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
d77dbf3761f318998d4f01063cea23ad73dc7bdf 27-May-2015 kib <kib@FreeBSD.org> Right now, dounmount() is called with unreferenced mount point.
Nothing stops a parallel unmount to suceed before the given call to
dounmount() checks and locks the covered vnode. Prevent dounmount()
from acting on the freed (although type-stable) memory by changing the
interface to require the mount point to be referenced. dounmount()
consumes the reference on return, regardless of the sucessfull or
erronous result.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
ad77d0b1c1c3c09a9bf76bebb397102d08ed35c6 15-Apr-2015 rmacklem <rmacklem@FreeBSD.org> File systems that do not use the buffer cache (such as ZFS) must
use VOP_FSYNC() to perform the NFS server's Commit operation.
This patch adds a mnt_kern_flag called MNTK_USES_BCACHE which
is set by file systems that use the buffer cache. If this flag
is not set, the NFS server always does a VOP_FSYNC().
This should be ok for old file system modules that do not set
MNTK_USES_BCACHE, since calling VOP_FSYNC() is correct, although
it might not be optimal for file systems that use the buffer cache.

Reviewed by: kib
MFC after: 2 weeks
3bc9cbc06ab1177bb780e4c4e3f0d71e65acaf21 27-Feb-2015 kib <kib@FreeBSD.org> The VNASSERT in vflush() FORCECLOSE case is trying to panic early to
prevent errors from yanking devices out from under filesystems. Only
care about special vnodes on devfs, special nodes on other kinds of
filesystems do not have special properties.

Sponsored by: EMC / Isilon Storage Division
Submitted by: Conrad Meyer
MFC after: 1 week
994a2af400b8f7d3fad70460c95a8d6791133c20 17-Feb-2015 ngie <ngie@FreeBSD.org> Add the mnt_lockref field to the ddb(4) 'show mount' command

MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D1688
Submitted by: Conrad Meyer <conrad.meyer@isilon.com>
Sponsored by: EMC / Isilon Storage Division
4247c4fbb3c08542de966e57e8767f2c079df26a 14-Feb-2015 jhb <jhb@FreeBSD.org> Add two new counters for vnode life cycle events:
- vfs.recycles counts the number of vnodes forcefully recycled to avoid
exceeding kern.maxvnodes.
- vfs.vnodes_created counts the number of vnodes created by successful
calls to getnewvnode().

Differential Revision: https://reviews.freebsd.org/D1671
Reviewed by: kib
MFC after: 1 week
71715e274d4165b5e34b83a6a25c3f769083671f 25-Jan-2015 jhb <jhb@FreeBSD.org> Change the default VFS timestamp precision from seconds to microseconds.

Discussed on: arch@
MFC after: 2 weeks
ca2d4fd9c1acb3914333668ef0b231d092912fc0 13-Dec-2014 kib <kib@FreeBSD.org> The vinactive() call in vgonel() may start writes for the dirty pages,
creating delayed write buffers belonging to the reclaimed vnode. Put
the buffer cleanup code after inactivation.

Add asserts that ensure that buffer queues are empty and add BO_DEAD
flag for bufobj to check that no buffers are added after the cleanup.
BO_DEAD is only used by INVARIANTS-enabled kernels.

Reported and tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
86353785aa22b50719f7e109c7714ffd5c1482b8 09-Dec-2014 kib <kib@FreeBSD.org> Apply chunk forgotten in r275620. Remove local variable for real.

CID: 1257462
Sponsored by: The FreeBSD Foundation
1a8d4344d0a617f3f123dabedfb9c1615b94ebfa 08-Dec-2014 kib <kib@FreeBSD.org> Add functions syncer_suspend() and syncer_resume(), which are supposed
to be called before suspension and after resume, correspondingly. The
syncer_suspend() ensures that all filesystems dirty data and metadata
are saved to the permanent storage, and stops kernel threads which
might modify filesystems. The syncer_resume() restores stopped

For now, only syncer is stopped. This is needed, because each sync
loop causes superblock updates for UFS.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
143261aa90afa1d15189f4eb6ee1885d3766713f 15-Oct-2014 mjg <mjg@FreeBSD.org> Don't take devmtx unnecessarily in vn_isdisk.

MFC after: 1 week
eba83cccb3037f8be4fded0d6617347b34177193 01-Oct-2014 will <will@FreeBSD.org> In the syncer, drop the sync mutex while patting the watchdog.

Some watchdog drivers (like ipmi) need to sleep while patting the watchdog.
See sys/dev/ipmi/ipmi.c:ipmi_wd_event(), which calls malloc(M_WAITOK).

Submitted by: asomers
MFC after: 1 month
Sponsored by: Spectra Logic
MFSpectraBSD: 637548 on 2012/10/04
17b754a0261fb1acd7f0974416bde5667c726f31 03-Aug-2014 kib <kib@FreeBSD.org> Remove Giant acquisition from the mount and unmount pathes.

It could be claimed that two things were reasonable protected by
Giant. One is vfsconf list links, which is converted to the new
dedicated sx vfsconf_sx. Another is vfsconf.vfc_refcount, which is
now updated with atomics.

Note that vfc_refcount still has the same races now as it has under
the Giant, the unload of filesystem modules can happen while the
module is still in use.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
717539518a7a07bad076cc7bf65f26f8f7ce42b4 29-Jul-2014 kib <kib@FreeBSD.org> Remove one-time use macros which check for the vnode lifecycle. More,
some parts of the checks are in fact redundand in the surrounding
code, and it is more clear what the conditions are by direct testing
of the flags. Two of the three macros were only used in assertions.

In vnlru_free(), all relevant parts of vholdl() were already inlined,
except the increment of v_holdcnt itself. Do not call vholdl() to do
the increment as well, this allows to make assertions in
vholdl()/vhold() more strict.

In v_incr_usecount(), call vholdl() before incrementing other ref
counters. The change is no-op, but it makes less surprising to see
the vnode state in debugger if interrupted inside v_incr_usecount().

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
9e5655843e8a396b95fa03ef3c67bb34e39b0775 12-Jun-2014 mav <mav@FreeBSD.org> Implement simple direct-mapped cache for popular filesystem identifiers to
avoid congestion on global mountlist_mtx mutex in vfs_busyfs(), while
traversing through the list of mount points.

This change significantly improves NFS server scalability, since it had
to do this translation for every request, and the global lock becomes quite

This code is more optimized for relatively small number of mount points.
On systems with hundreds of active mount points this simple cache may have
many collisions. But the original traversal code in that case should also
behave much worse, so we are not loosing much.

Reviewed by: attilio
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
60cbac59442236d3d8d9a9d3475a8b1b7421ae33 11-Jun-2014 mav <mav@FreeBSD.org> Remove unneeded mountlist_mtx acquisition from sync_fsync().

All struct mount fields accessed by sync_fsync() are protected by MNT_MTX.
6106e186e66f75454d6257dad160bddf21d99d2a 08-Jun-2014 mav <mav@FreeBSD.org> Remove extra branching from r267232.

MFC after: 2 weeks
14c8389ad08e251819c8c35fcaa75e0519595437 08-Jun-2014 mav <mav@FreeBSD.org> Use atomics to modify numvnodes variable.

This allows to mostly avoid lock usage in getnewvnode_[drop_]reserve(),
that reduces number of global vnode_free_list_mtx mutex acquisitions
from 4 to 2 per NFS request on ZFS, improving SMP scalability.

Reviewed by: kib
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
7280c1da3992c942f0dc88a032005d9e88e283c0 21-May-2014 bjk <bjk@FreeBSD.org> Check for mismatched vref()/vdrop()

Assert that the hold count has not fallen below the use count, a situation
that would only happen when a vref() (or similar) is erroneously paired
with a vdrop(). This situation has not been observed in the wild, but
could be helpful for someone implementing a new filesystem.

Reviewed by: kib
Approved by: hrs (mentor)
6fcf6199a4a9aefe9f2e59d947f0e0df171367b5 22-Mar-2014 bdrewery <bdrewery@FreeBSD.org> Rename global cnt to vm_cnt to avoid shadowing.

To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from: arch@
Sponsored by: EMC / Isilon Storage Division
d973ab2c238486eba90659984a91cc2843190d7f 09-Oct-2013 kib <kib@FreeBSD.org> Do not flush buffers when the v_object of the passed vnode does not
really belong to it. Such vnodes, with the pointers to other vnodes
v_objects, are typically instantiated by the bypass filesystems.
Invalidating mappings of other vnode pages and the pages is wrong,
since reclamation of the upper vnode does not imply that lower vnode
is reclaimed too.

One of the consequences of the improper reclamation was destruction of
the wired mappings of the lower vnode pages, triggering miscellaneous
assertions in the VM system.

Reported by: John Marshall <john.marshall@riverwillow.com.au>
Tested by: John Marshall <john.marshall@riverwillow.com.au>, pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (gjb)
2f645d31a99aa44e5c7cb6007b01fe32eac949f8 01-Oct-2013 kib <kib@FreeBSD.org> When printing the vnode information from ddb, print the lengths of the
dirty and clean buffer queues.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (gjb)
26571b5c445dd5dd8a8fed19d62788c331e16ce4 29-Sep-2013 kib <kib@FreeBSD.org> For vunref(), try to upgrade the vnode lock if the function was called
with the vnode shared-locked. If upgrade succeeded, the inactivation
can be done immediately, instead of being postponed.

Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (glebius)
c58dbf73e0a6e0d54f57e8c3b51b2b554625defe 26-Sep-2013 kib <kib@FreeBSD.org> Acquire a hold reference on the vnode when a knote is instantiated.
Otherwise, knote keeps a pointer to a vnode which could become invalid
any time.

Reported by: many
Tested by: Patrick Lamaiziere <patfbsd@davenulle.org>
Discussed with: jmg
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: re (marius)
635a029a89244f62779cfb14643818f6544fdf31 17-Aug-2013 pjd <pjd@FreeBSD.org> In r114945 the line 'nmp = TAILQ_NEXT(mp, mnt_list);' was duplicated.
Instead of just removing the duplicate, convert the loop to TAILQ_FOREACH().
6660649d5cef4f23ef970e0df3195a939b46cbb3 28-Jul-2013 kib <kib@FreeBSD.org> When creation of the v_pollinfo raced and our instance of vpollinfo
must be destroyed, knlist_clear() and seldrain() calls could be
avoided, since vpollinfo was not used. More, the knlist_clear()
calling protocol requires the knlist locked, which is not true at the
call site.

Split the destruction into the helper destroy_vpollinfo_free(), and
call it when raced, instead of destroy_vpollinfo().

Reported and tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
fcfcea28a9ec4ca136a55eaa0a83020c836d673a 17-Jul-2013 kib <kib@FreeBSD.org> Clear the vnode knotes before destroying vpollinfo.

Reported and tested by: Patrick Lamaiziere <patfbsd@davenulle.org>
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
da0ffbeb0e4ca70cbf67d19e93ab1c5af518aa23 03-Jun-2013 kib <kib@FreeBSD.org> Be more generous when donating the current thread time to the owner of
the vnode lock while iterating over the free vnode list. Instead of
yielding, pause for 1 tick. The change is reported to help in some
virtualized environments.

Submitted by: Roger Pau Monn? <roger.pau@citrix.com>
Discussed with: jilles
Tested by: pho
MFC after: 2 weeks
d7efebc4db8c1b875c915fb1a2759cb9df4f2956 31-May-2013 jeff <jeff@FreeBSD.org> - Convert the bufobj lock to rwlock.
- Use a shared bufobj lock in getblk() and inmem().
- Convert softdep's lk to rwlock to match the bufobj lock.
- Move INFREECNT to b_flags and protect it with the buf lock.
- Remove unnecessary locking around bremfree() and BKGRDINPROG.

Sponsored by: EMC / Isilon Storage Division
Discussed with: mckusick, kib, mdf
1cfa4a3bc403c2063b3d466b40803508808ccd88 12-May-2013 jeff <jeff@FreeBSD.org> - Add a new general purpose path-compressed radix trie which can be used
with any structure containing a uint64_t index. The tree code
auto-generates type safe wrappers.
- Eliminate the buf splay and replace it with pctrie. This is not only
significantly faster with large files but also allows for the possibility
of shared locking.

Reviewed by: alc, attilio
Sponsored by: EMC / Isilon Storage Division
dfd7a7f46d4cb6fa7e3cfa78fcff22319891d343 11-May-2013 kib <kib@FreeBSD.org> - Fix nullfs vnode reference leak in nullfs_reclaim_lowervp(). The
null_hashget() obtains the reference on the nullfs vnode, which must
be dropped.

- Fix a wart which existed from the introduction of the nullfs
caching, do not unlock lower vnode in the nullfs_reclaim_lowervp().
It should be innocent, but now it is also formally safe. Inform the
nullfs_reclaim() about this using the NULLV_NOUNLOCK flag set on
nullfs inode.

- Add a callback to the upper filesystems for the lower vnode
unlinking. When inactivating a nullfs vnode, check if the lower
vnode was unlinked, indicated by nullfs flag NULLV_DROP or VV_NOSYNC
on the lower vnode, and reclaim upper vnode if so. This allows
nullfs to purge cached vnodes for the unlinked lower vnode, avoiding
excessive caching.

Reported by: G??ran L??wkrantz <goran.lowkrantz@ismobile.com>
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
bd49099038824e1817f3a5da01c52f82eebe392e 09-May-2013 marcel <marcel@FreeBSD.org> Add option WITNESS_NO_VNODE to suppress printing LORs between VNODE
locks. To support this, VNODE locks are created with the LK_IS_VNODE
flag. This flag is propagated down using the LO_IS_VNODE flag.

Note that WITNESS still records the LOR. Only the printing and the
optional entering into the kernel debugger is bypassed with the
f106bf787a663f94d44ad2b9916394298e13428e 04-May-2013 mdf <mdf@FreeBSD.org> Add missing vdrop() in error case.

Submitted by: Fahad (mohd.fahadullah@isilon.com)
MFC after: 1 week
90b89365fd2130fd73f9b1e2c3af63c35ce2a3d1 16-Apr-2013 rmacklem <rmacklem@FreeBSD.org> Allow the vnode to be unlocked for the weird case of
LK_EXCLOTHER. LK_EXCLOTHER is only used to acquire a
usecount on a vnode during NFSv4 recovery from an
expired lease.

Reported and tested by: pho
MFC after: 2 weeks
fa887dba7b511d88568bd53f128474c5635f9264 06-Apr-2013 jeff <jeff@FreeBSD.org> Prepare to replace the buf splay with a trie:

- Don't insert BKGRDMARKER bufs into the splay or dirty/clean buf lists.
No consumers need to find them there and it complicates the tree.
These flags are all FFS specific and could be moved out of the buf
- Use pbgetvp() and pbrelvp() to associate the background and journal
bufs with the vp. Not only is this much cheaper it makes more sense
for these transient bufs.
- Fix the assertions in pbget* and pbrel*. It's not safe to check list
pointers which were never initialized. Use the BX flags instead. We
also check B_PAGING in reassignbuf() so this should cover all cases.

Discussed with: kib, mckusick, attilio
Sponsored by: EMC / Isilon Storage Division
15bf891afe5ecb096114725fc8e6dc1cc3ef70d6 20-Feb-2013 attilio <attilio@FreeBSD.org> Rename VM_OBJECT_LOCK(), VM_OBJECT_UNLOCK() and VM_OBJECT_TRYLOCK() to
their "write" versions.

Sponsored by: EMC / Isilon storage division
658534ed5a02db4fef5b0630008502474d6c26d6 20-Feb-2013 attilio <attilio@FreeBSD.org> Switch vm_object lock to be a rwlock.
* VM_OBJECT_LOCK and VM_OBJECT_UNLOCK are mapped to write operations
* VM_OBJECT_SLEEP() is introduced as a general purpose primitve to
get a sleep operation using a VM_OBJECT_LOCK() as protection
* The approach must bear with vm_pager.h namespace pollution so many
files require including directly rwlock.h
5a71b324fb7db5e1e684b63e0aadd839a7f176be 14-Jan-2013 kib <kib@FreeBSD.org> Add a trivial comment to record the proper commit log for r245407:

Set the v_hash for a new vnode in the getnewvnode() to the value
calculated based on the vnode structure address. Filesystems using
vfs_hash_insert() override the v_hash using the standard formula of
(inode_number + mnt_hashseed). For other filesystems, the
initialization allows the vfs_hash_index() to provide useful hash too.

Suggested, reviewed and tested by: peter
Sponsored by: The FreeBSD Foundation
MFC after: 5 days
cef86179d2638a11b0d1650a4ab6ac5a856bce57 14-Jan-2013 kib <kib@FreeBSD.org> diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index 7c243b6..0bdaf36 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
@@ -279,6 +279,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW,
#define VSHOULDFREE(vp) (!((vp)->v_iflag & VI_FREE) && !(vp)->v_holdcnt)
#define VSHOULDBUSY(vp) (((vp)->v_iflag & VI_FREE) && (vp)->v_holdcnt)

+static int vnsz2log;

* Initialize the vnode management data structures.
@@ -293,6 +294,7 @@ SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW,
static void
vntblinit(void *dummy __unused)
+ u_int i;
int physvnodes, virtvnodes;

@@ -332,6 +334,9 @@ vntblinit(void *dummy __unused)
syncer_maxdelay = syncer_mask + 1;
mtx_init(&sync_mtx, "Syncer mtx", NULL, MTX_DEF);
cv_init(&sync_wakeup, "syncer");
+ for (i = 1; i <= sizeof(struct vnode); i <<= 1)
+ vnsz2log++;
+ vnsz2log--;

@@ -1067,6 +1072,14 @@ alloc:

+ /*
+ * For the filesystems which do not use vfs_hash_insert(),
+ * still initialize v_hash to have vfs_hash_index() useful.
+ * E.g., nullfs uses vfs_hash_index() on the lower vnode for
+ * its own hashing.
+ */
+ vp->v_hash = (uintptr_t)vp >> vnsz2log;
*vpp = vp;
return (0);
27fa8d59ff9ba9772b2c8708ac558a08e5514100 26-Dec-2012 attilio <attilio@FreeBSD.org> Fixup r244240: mp_ncpus will be 1 also in the !SMP and smp_disabled=1
case. There is no point in optimizing further the code and use a TRUE
litteral for a path that does heavyweight stuff anyway (like lock acq),
at the price of obfuscated code.

Use the appropriate check where necessary and remove a macro.

Sponsored by: EMC / Isilon storage division
MFC after: 3 days
0d14b65c785757387aa6e75157e63bdd8bb2a0bc 21-Dec-2012 attilio <attilio@FreeBSD.org> Fixup r218424: uio_yield() was scaling directly to userland priority.
When kern_yield() was introduced with the possibility to specify
a new priority, the behaviour changed by not lowering priority at all
in the consumers, making the yielding mechanism highly ineffective for
high priority kthreads like bufdaemon, syncer, vlrudaemon, etc.
There are no evidences that consumers could bear with such change in
semantic and this situation could finally lead to bugs similar to the
ones fixed in r244240.
Re-specify userland pri for kthreads involved.

Tested by: pho
Reviewed by: kib, mdf
MFC after: 1 week
28185626518e530030999ed40cd3ea13edef36ec 15-Dec-2012 kib <kib@FreeBSD.org> When mnt_vnode_next_active iterator cannot lock the next vnode and
yields, specify the user priority for the yield. Otherwise, a
higher-priority (kernel) thread could fall into the priority-inversion
with the thread owning the mutex lock.

On single-processor machines or UP kernels, do not loop adaptively
when the next vnode cannot be locked, instead yield unconditionally.

Restructure the iteration initializer and the iterator to remove code
duplication. Put the code to fetch and lock a vnode next to the
current marker, into the mnt_vnode_next_active() function, and use it
instead of repeating the loop.

Reported by: hrs, rmacklem
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 3 days
ac910a885b8cfd1135628cae925006f2525c7956 10-Dec-2012 kib <kib@FreeBSD.org> Do not yield while owning a mutex. The Giant reacquire in the
kern_yield() is problematic than.

The owned mutex is the mount interlock, and it is in fact not needed
to guarantee the stability of the mount list of active vnodes, so fix
the the issue by only taking the mount interlock for MNT_REF and
MNT_REL operations.

While there, augment the unconditional yield by some amount of
spinning [1].

Reported and tested by: pho
Reviewed by: attilio
Submitted by: attilio [1]
MFC after: 3 days
69dfcf0272e9843f5132e35241d7124bbae253ba 03-Dec-2012 kib <kib@FreeBSD.org> The vnode_free_list_mtx is required unconditionally when iterating
over the active list. The mount interlock is not enough to guarantee
the validity of the tailq link pointers. The __mnt_vnode_next_active()
and __mnt_vnode_first_active() active lists iterators helper functions
did not provided the neccessary stability for the list, allowing the
iterators to pick garbage.

This was uncovered after the r243599 made the active list iterators

Since a vnode interlock is before the vnode_free_list_mtx, obtain the
vnode ilock in the non-blocking manner when under vnode_free_list_mtx,
and restart iteration after the yield if the lock attempt failed.

Assert that a vnode found on the list is active, and assert that the
helpers return the vnode with interlock owned.

Reported and tested by: pho
MFC after: 1 week
adc108b87e4c1bc0bc3165e6588350f49580effd 27-Nov-2012 davidxu <davidxu@FreeBSD.org> Take first active vnode correctly.

Reviewed by: kib
MFC after: 3 days
9c2d52ecdeada4a95b085a934ffcfb6123bfea18 24-Nov-2012 avg <avg@FreeBSD.org> assert_vop_locked: make the assertion race-free and more efficient

this is really a minor improvement for the sake of correctness

MFC after: 6 days
c20b4131f0fb070ac22f89cb415c9e89167c62bd 22-Nov-2012 avg <avg@FreeBSD.org> remove vop_lookup_pre and vop_lookup_post

Suggested by: kib
MFC after: 5 days
e331787780eea9671a3c0fa076f46f68aed00b42 19-Nov-2012 attilio <attilio@FreeBSD.org> insmntque() is always called with the lock held in exclusive mode,
- assume the lock is held in exclusive mode and remove a moot check
about the lock acquisition.
- in the destructor remove !MPSAFE specific chunk.

Reviewed by: kib
MFC after: 2 weeks
4726ba44fc9798f831806588e9c7a021fcd00bfc 19-Nov-2012 avg <avg@FreeBSD.org> assert_vop_locked should treat LK_EXCLOTHER as the not locked case

... from a perspective of the current thread.

Spotted by: mjg
Discussed with: kib
MFC after: 18 days
4d2c561ebf8be7ac1fcb05244c285a9eb14363d7 19-Nov-2012 avg <avg@FreeBSD.org> vnode_if: fix locking protocol description for lookup and cachedlookup

Also remove the checks from vop_lookup_pre and vop_lookup_post, which
are now completely redundant (before this change they were partially

Discussed with: kib
MFC after: 10 days
d5d551ec46edfdd7b35884740863514cf342c207 09-Nov-2012 attilio <attilio@FreeBSD.org> Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.
0e30ee6e51d14354615209958bb9ad07f1a90590 05-Nov-2012 kib <kib@FreeBSD.org> A clarification to the behaviour of the active vnode list management
regarding the vnode page cleaning.

In collaboration with: pho
MFC after: 1 week
58ab2ca2eb887d7b629107d491de73eb43b2e14b 04-Nov-2012 kib <kib@FreeBSD.org> Add decoding of the missed MNT_KERN_ flags to ddb "show mount" command.

MFC after: 3 weeks
f8b34c9be8cbdb04363ea85620c3887ca03a7438 04-Nov-2012 kib <kib@FreeBSD.org> Add decoding of the missed VI_ and VV_ flags to ddb "show vnode" command.

MFC after: 3 days
84d617582f332d392fc3a1ad7a2b6a5b6fb175df 04-Nov-2012 kib <kib@FreeBSD.org> Order the enumeration of the MNT_ flags to be the same as the order of
their definitions.

MFC after: 3 days
560aa751e0f5cfef868bdf3fab01cdbc5169ef82 22-Oct-2012 kib <kib@FreeBSD.org> Remove the support for using non-mpsafe filesystem modules.

In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by: attilio
Tested by: pho
cc37fedcd0324cd8e44b810a580a7e9fcdd26f99 14-Oct-2012 kib <kib@FreeBSD.org> Add a KPI to allow to reserve some amount of space in the numvnodes
counter, without actually allocating the vnodes. The supposed use of
the getnewvnode_reserve(9) is to reclaim enough free vnodes while the
code still does not hold any resources that might be needed during the
reclamation, and to consume the slack later for getnewvnode() calls
made from the innards. After the critical block is finished, the
caller shall free any reserve left, by getnewvnode_drop_reserve(9).

Reviewed by: avg
Tested by: pho
MFC after: 1 week
d63ec4c4ce1de03b00bdd4813aeee11ee77907ea 13-Sep-2012 attilio <attilio@FreeBSD.org> Remove all the checks on curthread != NULL with the exception of some MD
trap checks (eg. printtrap()).

Generally this check is not needed anymore, as there is not a legitimate
case where curthread != NULL, after pcpu 0 area has been properly

Reviewed by: bde, jhb
MFC after: 1 week
5a86a6849adc026d814651fd6afc864b2377b9c8 09-Sep-2012 kib <kib@FreeBSD.org> Add a facility for vgone() to inform the set of subscribed mounts
about vnode reclamation. Typical use is for the bypass mounts like
nullfs to get a notification about lower vnode going away.

Now, vgone() calls new VFS op vfs_reclaim_lowervp() with an argument
lowervp which is reclaimed. It is possible to register several
reclamation event listeners, to correctly handle the case of several
nullfs mounts over the same directory.

For the filesystem not having nullfs mounts over it, the overhead
added is a single mount interlock lock/unlock in the vnode reclamation

In collaboration with: pho
MFC after: 3 weeks
9d2e20143f198869149c9e2dc114da6153f2df12 22-Aug-2012 kib <kib@FreeBSD.org> Provide some compat32 shims for sysctl vfs.conflist. It is required
for getvfsbyname(3) operation when called from 32bit process, and
getvfsbyname(3) is used by recent bsdtar import.

Reported by: many
Tested by: David Naylor <naylor.b.david@gmail.com>
MFC after: 5 days
85a02186bcd95689dfa239312a24eb374da5a373 03-Jun-2012 avg <avg@FreeBSD.org> free wdog_kern_pat calls in post-panic paths from under SW_WATCHDOG

Those calls are useful with hardware watchdog drivers too.

MFC after: 3 weeks
6f4e16f8338923e9fd89009ec9cb4a5a3d770983 30-May-2012 kib <kib@FreeBSD.org> Add a rangelock implementation, intended to be used to range-locking
the i/o regions of the vnode data space. The implementation is quite
simple-minded, it uses the list of the lock requests, ordered by
arrival time. Each request may be for read or for write. The
implementation is fair FIFO.

MFC after: 2 month
023bd7c6bf24588c8650d3cdc2bbe1a905ec1820 23-Apr-2012 trasz <trasz@FreeBSD.org> Remove unused thread argument to vrecycle().

Reviewed by: kib
baac623cd9f021cfdf826109aecab017d5cae570 23-Apr-2012 trasz <trasz@FreeBSD.org> Remove unused thread argument from vtruncbuf().

Reviewed by: kib
d9895ac1fe988335e2619f56d942d954d835bb07 20-Apr-2012 mckusick <mckusick@FreeBSD.org> This update uses the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point to replace
MNT_VNODE_FOREACH_ALL in the vfs_msync, ffs_sync_lazy, and qsync

The vfs_msync routine is run every 30 seconds for every writably
mounted filesystem. It ensures that any files mmap'ed from the
filesystem with modified pages have those pages queued to be
written back to the file from which they are mapped.

The ffs_lazy_sync and qsync routines are run every 30 seconds for
every writably mounted UFS/FFS filesystem. The ffs_lazy_sync routine
ensures that any files that have been accessed in the previous
30 seconds have had their access times queued for updating in the
filesystem. The qsync routine ensures that any files with modified
quotas have those quotas queued to be written back to their
associated quota file.

In a system configured with 250,000 vnodes, less than 1000 are
typically active at any point in time. Prior to this change all
250,000 vnodes would be locked and inspected twice every minute
by the syncer. For UFS/FFS filesystems they would be locked and
inspected six times every minute (twice by each of these three
routines since each of these routines does its own pass over the
vnodes associated with a mount point). With this change the syncer
now locks and inspects only the tiny set of vnodes that are active.

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
5b7b29e35b332608560671cf15919cf03d76f9ca 20-Apr-2012 mckusick <mckusick@FreeBSD.org> This change creates a new list of active vnodes associated with
a mount point. Active vnodes are those with a non-zero use or hold
count, e.g., those vnodes that are not on the free list. Note that
this list is in addition to the list of all the vnodes associated
with a mount point.

To avoid adding another set of linkage pointers to the vnode
structure, the active list uses the existing linkage pointers
used by the free list (previously named v_freelist, now renamed

This update adds the MNT_VNODE_FOREACH_ACTIVE interface that loops
over just the active vnodes associated with a mount point (typically
less than 1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
a9a210460f06cd26a6306c11698f006e5227d818 18-Apr-2012 mckusick <mckusick@FreeBSD.org> Delete a no longer useful VNASSERT missed during changes in 234400.

Suggested by: kib
be8731298f76613cdf510faf03b669409c82ed02 18-Apr-2012 mckusick <mckusick@FreeBSD.org> Fix a memory leak of M_VNODE_MARKER introduced in 234386.

Found by: Peter Holm
841f20af50d952e3673fbec1bcd442c1310b31b5 17-Apr-2012 mckusick <mckusick@FreeBSD.org> Drop export of vdestroy() function from kern/vfs_subr.c as it is
used only as a helper function in that file. Replace sole call to
vbusy() with inline code in vholdl(). Replace sole calls to vfree()
and vdestroy() with inline code in vdropl().

The Clang compiler already inlines these functions, so they do not
show up in a kernel backtrace which is confusing. Also you cannot
set their frame in kgdb which means that it is impossible to view
their local variables. So, while the produced code is unchanged,
the debugging should be easier.

Discussed with: kib
MFC after: 2 weeks
ffee40eeff5500634fb5111e387f11dee781d197 17-Apr-2012 mckusick <mckusick@FreeBSD.org> Replace the MNT_VNODE_FOREACH interface with MNT_VNODE_FOREACH_ALL.
The primary changes are that the user of the interface no longer
needs to manage the mount-mutex locking and that the vnode that
is returned has its mutex locked (thus avoiding the need to check
to see if its is DOOMED or other possible end of life senarios).

To minimize compatibility issues for third-party developers, the
old MNT_VNODE_FOREACH interface will remain available so that this
change can be MFC'ed to 9. Following the MFC to 9, MNT_VNODE_FOREACH
will be removed in head.

The reason for this update is to prepare for the addition of the
MNT_VNODE_FOREACH_ACTIVE interface that will loop over just the
active vnodes associated with a mount point (typically less than
1% of the vnodes associated with the mount point).

Reviewed by: kib
Tested by: Peter Holm
MFC after: 2 weeks
7901256b30173c2a03a22934f725363bd169a6b9 11-Apr-2012 mckusick <mckusick@FreeBSD.org> Export vinactive() from kern/vfs_subr.c (e.g., make it no longer
static and declare its prototype in sys/vnode.h) so that it can be
called from process_deferred_inactive() (in ufs/ffs/ffs_snapshot.c)
instead of the body of vinactive() being cut and pasted into

Reviewed by: kib
MFC after: 2 weeks
5abd2bb7cbd2cc42a6de61b0ed4363777f59a304 09-Mar-2012 kib <kib@FreeBSD.org> Decomission mnt_noasync. Introduce MNTK_NOASYNC mnt_kern_flag which
allows a filesystem to request VFS to not allow MNTK_ASYNC.

MFC after: 1 week
87f7f0cfe8d3d17bc154ca2c0dcd51bca1444006 25-Feb-2012 trociny <trociny@FreeBSD.org> When detaching an unix domain socket, uipc_detach() checks
unp->unp_vnode pointer to detect if there is a vnode associated with
(binded to) this socket and does necessary cleanup if there is.

The issue is that after forced unmount this check may be too late as
the unp_vnode is reclaimed and the reference is stale.

To fix this provide a helper function that is called on a socket vnode
reclamation to do necessary cleanup.

Pointed by: kib
Reviewed by: kib
MFC after: 2 weeks
52c17430bc70cd8c1e6dc2ff5c7786cc3f4871e4 06-Feb-2012 kib <kib@FreeBSD.org> Current implementations of sync(2) and syncer vnode fsync() VOP uses
mnt_noasync counter to temporary remove MNTK_ASYNC mount option, which
is needed to guarantee a synchronous completion of the initiated i/o
before syscall or VOP return. Global removal of MNTK_ASYNC option is
harmful because not only i/o started from corresponding thread becomes
synchronous, but all i/o is synchronous on the filesystem which is
initiated during sync(2) or syncer activity.

Instead of removing MNTK_ASYNC from mnt_kern_flag, provide a local
thread flag to disable async i/o for current thread only. Use the
opportunity to move DOINGASYNC() macro into sys/vnode.h and
consistently use it through places which tested for MNTK_ASYNC.

Some testing demonstrated 60-70% improvements in run time for the
metadata-intensive operations on async-mounted UFS volumes, but still
with great deviation due to other reasons.

Reviewed by: mckusick
Tested by: scottl
MFC after: 2 weeks
cc993a6b7525aff45c9554a479cd681d4bf2f573 25-Jan-2012 kib <kib@FreeBSD.org> When doing vflush(WRITECLOSE), clean vnode pages.

Unmounts do vfs_msync() before calling VFS_UNMOUNT(), but there is
still a race allowing a process to dirty pages after msync
finished. Remounts rw->ro just left dirty pages in system.

Reviewed by: alc, tegge (long time ago)
Tested by: pho
MFC after: 2 weeks
af2e331939df2a83069ad4a6f5ab17eea9f82e8b 17-Jan-2012 mckusick <mckusick@FreeBSD.org> Make sure all intermediate variables holding mount flags (mnt_flag)
and that all internal kernel calls passing mount flags are declared
as uint64_t so that flags in the top 32-bits are not lost.

MFC after: 2 weeks
cdafa9e162a703f6bd82dc3309887bd15ea8def0 06-Jan-2012 jhb <jhb@FreeBSD.org> Use proper argument structure types for the extattr post-VOP hooks.
The wrong structure happened to work since the only argument used was
the vnode which is in the same place in both VOP_SETATTR() and the two
extattr VOPs.

MFC after: 3 days
33f06e36cf482807ff85997787e2b99952b68c4c 23-Dec-2011 jhb <jhb@FreeBSD.org> Add post-VOP hooks for VOP_DELETEEXTATTR() and VOP_SETEXTATTR() and use
these to trigger a NOTE_ATTRIB EVFILT_VNODE kevent when the extended
attributes of a vnode are changed.

Note that OS X already implements this behavior.

Reviewed by: rwatson
MFC after: 2 weeks
78c075174e74e727279365476d0d076d6c3e3075 04-Nov-2011 jhb <jhb@FreeBSD.org> Add the posix_fadvise(2) system call. It is somewhat similar to
madvise(2) except that it operates on a file descriptor instead of a
memory region. It is currently only supported on regular files.

Just as with madvise(2), the advice given to posix_fadvise(2) can be
divided into two types. The first type provide hints about data access
patterns and are used in the file read and write routines to modify the
I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are
thus filesystem independent. Note that to ease implementation (and
since this API is only advisory anyway), only a single non-normal
range is allowed per file descriptor.

The second type of hints are used to hint to the OS that data will or
will not be used. These hints are implemented via a new VOP_ADVISE().
A default implementation is provided which does nothing for the WILLNEED
request and attempts to move any clean pages to the cache page queue for
the DONTNEED request. This latter case required two other changes.
First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests
vinvalbuf() to only flush clean buffers for the vnode from the buffer
cache and to not remove any backing pages from the vnode. This is
used to ensure clean pages are not wired into the buffer cache before
attempting to move them to the cache page queue. The second change adds
a new vm_object_page_cache() method. This method is somewhat similar to
vm_object_page_remove() except that instead of freeing each page in the
specified range, it attempts to move clean pages to the cache queue if

To preserve the ABI of struct file, the f_cdevpriv pointer is now reused
in a union to point to the currently active advice region if one is
present for regular files.

Reviewed by: jilles, kib, arch@
Approved by: re (kib)
MFC after: 1 month
b4486349bd6ee7fb7358793aa61b4efbdefd9937 27-Oct-2011 jhb <jhb@FreeBSD.org> Whitespace fix.
b49c6b382f186d015afccf6b313c3dde18c689d5 25-Oct-2011 pjd <pjd@FreeBSD.org> The v_data field is a pointer, so set it to NULL, not 0.

MFC after: 3 days
9c782bcdf637c77a4fd07dbf73eb4515c5085b65 07-Oct-2011 jonathan <jonathan@FreeBSD.org> Change one printf() to log().

As noted in kern/159780, printf() is not very jail-friendly, since it can't be easily monitored by jail management tools. This patch reports an error via log() instead, which, if nobody is watching the log file, still prints to the console.

Approved by: mentor (rwatson)
Submitted by: Eugene Grosbein <eugen@eg.sd.rdtc.ru>
MFC after: 5 days
cc9fd3ad5383affc8a0eb5db3fe3cdbedf00ef85 04-Oct-2011 kib <kib@FreeBSD.org> Move parts of the commit log for r166167, where Tor explained the
interaction between vnode locks and vfs_busy(), into comment.

MFC after: 1 week
683d7a54ce4dd84b1a8914748ed15c18a63164d8 25-Aug-2011 attilio <attilio@FreeBSD.org> Fix a deficiency in the selinfo interface:
If a selinfo object is recorded (via selrecord()) and then it is
quickly destroyed, with the waiters missing the opportunity to awake,
at the next iteration they will find the selinfo object destroyed,
causing a PF#.

That happens because the selinfo interface has no way to drain the
waiters before to destroy the registered selinfo object. Also this
race is quite rare to get in practice, because it would require a
selrecord(), a poll request by another thread and a quick destruction
of the selrecord()'ed selinfo object.

Fix this by adding the seldrain() routine which should be called
before to destroy the selinfo objects (in order to avoid such case),
and fix the present cases where it might have already been called.
Sometimes, the context is safe enough to prevent this type of race,
like it happens in device drivers which installs selinfo objects on
poll callbacks. There, the destruction of the selinfo object happens
at driver detach time, when all the filedescriptors should be already
closed, thus there cannot be a race.
For this case, mfi(4) device driver can be set as an example, as it
implements a full correct logic for preventing this from happening.

Sponsored by: Sandvine Incorporated
Reported by: rstone
Tested by: pluknet
Reviewed by: jhb, kib
Approved by: re (bz)
MFC after: 3 weeks
ffeefed9fc8fa85d3fcbd19640ba38e51e2ff4da 24-Jul-2011 mckusick <mckusick@FreeBSD.org> Move the MNTK_SUJ flag in mnt_kern_flag to MNT_SUJ in mnt_flag
so that it is visible to userland programs. This change enables
the `mount' command with no arguments to be able to show if a
filesystem is mounted using journaled soft updates as opposed
to just normal soft updates.

Approved by: re (bz)
21902be08cad124037a2152459b485a54308e5ca 29-Jun-2011 alc <alc@FreeBSD.org> Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this
option to vm_object_page_remove() asserts that the specified range of pages
is not mapped, or more precisely that none of these pages have any managed
mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on
the pages.

This change not only saves time by eliminating pointless calls to
pmap_remove_all(), but it also eliminates an inconsistency in the use of
pmap_remove_all() versus related functions, like pmap_remove_write(). It
eliminates harmless but pointless calls to pmap_remove_all() that were being
performed on PG_UNMANAGED pages.

Update all of the existing assertions on pmap_remove_all() to reflect this

Reviewed by: kib
d77535edb69e85936620addc80ca253da18f745e 24-Jun-2011 jonathan <jonathan@FreeBSD.org> Tidy up a capabilities-related comment.

This comment refers to an #ifdef that hasn't been merged [yet?]; remove it.

Approved by: rwatson
bbbc4c545502c531d59efe43f9b7bde9b9a92979 13-May-2011 mdf <mdf@FreeBSD.org> Use a name instead of a magic number for kern_yield(9) when the priority
should not change. Fetch the td_user_pri under the thread lock. This
is probably not necessary but a magic number also seems preferable to
knowing the implementation details here.

Requested by: Jason Behmer < jason DOT behmer AT isilon DOT com >
d685681d59c6feed493660adbeb6140c4fdab936 28-Apr-2011 attilio <attilio@FreeBSD.org> Add the watchdogs patting during the (shutdown time) disk syncing and
disk dumping.
With the option SW_WATCHDOG on, these operations are doomed to let
watchdog fire, fi they take too long.

I implemented the stubs this way because I really want wdog_kern_*
KPI to not be dependant by SW_WATCHDOG being on (and really, the option
only enables watchdog activation in hardclock) and also avoid to
call them when not necessary (avoiding not-volountary watchdog

Sponsored by: Sandvine Incorporated
Discussed with: emaste, des
MFC after: 2 weeks
872195caf937fd6dc952eeeed83c61258243bb32 23-Apr-2011 rmacklem <rmacklem@FreeBSD.org> Fix a LOR in vfs_busy() where, after msleeping, it would lock
the mutexes in the wrong order for the case where the
MBF_MNTLSTLOCK is set. I believe this did have the
potential for deadlock. For example, if multiple nfsd threads
called vfs_busyfs(), which calls vfs_busy() with MBF_MNTLSTLOCK.
Thanks go to pho for catching this during his testing.

Tested by: pho
Submitted by: kib
MFC after: 2 weeks
6d33997006747ff0335fff5289414202b0680dcc 04-Apr-2011 pluknet <pluknet@FreeBSD.org> Remove malloc type M_NETADDR unused since splitting into vfs_subr.c
and vfs_export.c.

MFC after: 1 week
4d0733e0f8bd37f600ca86b0f1323a24ed9c7fae 08-Mar-2011 kib <kib@FreeBSD.org> Do not assert buffer lock in VFS_STRATEGY() when kernel already paniced.

Sponsored by: The FreeBSD Foundation
MFC after: 1 week
33ee365b5548dd6130fd6a2707e2169369e1fab6 08-Feb-2011 mdf <mdf@FreeBSD.org> Based on discussions on the svn-src mailing list, rework r218195:

- entirely eliminate some calls to uio_yeild() as being unnecessary,
such as in a sysctl handler.

- move should_yield() and maybe_yield() to kern_synch.c and move the
prototypes from sys/uio.h to sys/proc.h

- add a slightly more generic kern_yield() that can replace the
functionality of uio_yield().

- replace source uses of uio_yield() with the functional equivalent,
or in some cases do not change the thread priority when switching.

- fix a logic inversion bug in vlrureclaim(), pointed out by bde@.

- instead of using the per-cpu last switched ticks, use a per thread
variable for should_yield(). With PREEMPTION, the only reasonable
use of this is to determine if a lock has been held a long time and
relinquish it. Without PREEMPTION, this is essentially the same as
the per-cpu variable.
b291e9a36525d7da10edd8e2df1a5b70f92905af 02-Feb-2011 mdf <mdf@FreeBSD.org> Put the general logic for being a CPU hog into a new function
should_yield(). Use this in various places. Encapsulate the common
case of check-and-yield into a new function maybe_yield().

Change several checks for a magic number of iterations to use
should_yield() instead.

MFC after: 1 week
f5d7ab843b98fb85ec392d084b5362e83acfa842 25-Jan-2011 kib <kib@FreeBSD.org> When vtruncbuf() iterates over the vnode buffer list, lock buffer object
before checking the validity of the next buffer pointer. Otherwise, the
buffer might be reclaimed after the check, causing iteration to run into
wrong buffer.

Reported and tested by: pho
MFC after: 1 week
ca68749f2ad8bc8992280cb0a9ad055f2958f9c8 18-Jan-2011 mdf <mdf@FreeBSD.org> Specify a CTLTYPE_FOO so that a future sysctl(8) change does not need
to rely on the format string.
f6a71a40b2504dd316580a282777fc25cc857800 12-Jan-2011 mdf <mdf@FreeBSD.org> sysctl(9) cleanup checkpoint: amd64 GENERIC builds cleanly.

Commit the kernel changes.
3a7fd5f8c7bf1c040291ae6d1a90a9a7c363a513 06-Jan-2011 jhb <jhb@FreeBSD.org> - Restore dropping the priority of syncer down to PPAUSE when it is idle.
This was lost when it was converted to using a condition variable instead
of lbolt.
- Drop the priority of flowtable down to PPAUSE when it is idle as well
since it is a similar background task.

MFC after: 2 weeks
747e187ac46599961803e8330326df72f4ecbef1 27-Dec-2010 kib <kib@FreeBSD.org> Teach ddb "show mount" about MNTK_SUJ flag.
87501c4bfefe33b233bd5e4f18fc8cb483a689a2 24-Nov-2010 kib <kib@FreeBSD.org> Allow shared-locked vnode to be passed to vunref(9).
When shared-locked vnode is supplied as an argument to vunref(9) and
resulting usecount is 0, set VI_OWEINACT and do not try to upgrade vnode
lock. The later could cause vnode unlock, allowing the vnode to be
reclaimed meantime.

Tested by: pho
MFC after: 1 week
7980fb6d3a02d0329ef2263493dc0c6e59d193d9 19-Nov-2010 kib <kib@FreeBSD.org> Remove prtactive variable and related printf()s in the vop_inactive
and vop_reclaim() methods. They seems to be unused, and the reported
situation is normal for the forced unmount.

MFC after: 1 week
X-MFC-note: keep prtactive symbol in vfs_subr.c
d3c1b43ec6f7b4a016b944a1218469f80bcd8496 14-Nov-2010 brucec <brucec@FreeBSD.org> Fix some more style(9) issues.
34e35cbdead8771354fbf95aaddee934f313d158 14-Nov-2010 brucec <brucec@FreeBSD.org> Fix style(9) issues from r215281 and r215282.

MFC after: 1 week
76f2d09c8d23879fca40e24956aa3f0296312247 14-Nov-2010 brucec <brucec@FreeBSD.org> Add descriptions to some more sysctls.

PR: kern/148510
MFC after: 1 week
107ea66c07f4e9862fd2961da07c078eab84d05f 11-Sep-2010 kib <kib@FreeBSD.org> Protect mnt_syncer with the sync_mtx. This prevents a (rare) vnode leak
when mount and update are executed in parallel.

Encapsulate syncer vnode deallocation into the helper function
vfs_deallocate_syncvnode(), to not externalize sync_mtx from vfs_subr.c.

Found and reviewed by: jh (previous version of the patch)
Tested by: pho
MFC after: 3 weeks
5216dea167c38680c62c1f24a81f70232eafadb7 01-Sep-2010 emaste <emaste@FreeBSD.org> As long as we are going to panic anyway, there's no need to hide additional
information behind DIAGNOSTIC.
c2cb836190c82be4859c68f5fabd199508638762 30-Aug-2010 jh <jh@FreeBSD.org> execve(2) has a special check for file permissions: a file must have at
least one execute bit set, otherwise execve(2) will return EACCES even
for an user with PRIV_VFS_EXEC privilege.

Add the check also to vaccess(9), vaccess_acl_nfs4(9) and
vaccess_acl_posix1e(9). This makes access(2) to better agree with
execve(2). Because ZFS doesn't use vaccess(9) for VEXEC, add the check
to zfs_freebsd_access() too. There may be other file systems which are
not using vaccess*() functions and need to be handled separately.

PR: kern/125009
Reviewed by: bde, trasz
Approved by: pjd (ZFS part)
bc73fabf276ad141999d460cfc27ecb6624741a0 28-Aug-2010 pjd <pjd@FreeBSD.org> There is a bug in vfs_allocate_syncvnode() failure handling in mount code.
Actually it is hard to properly handle such a failure, especially in MNT_UPDATE
case. The only reason for the vfs_allocate_syncvnode() function to fail is
getnewvnode() failure. Fortunately it is impossible for current implementation
of getnewvnode() to fail, so we can assert this and make
vfs_allocate_syncvnode() void. This in turn free us from handling its failures
in the mount code.

Reviewed by: kib
MFC after: 1 month
ade28bdd4036313b9b17b915d917a7851f02122f 12-Aug-2010 kib <kib@FreeBSD.org> The buffers b_vflags field is not always properly protected by
bufobj lock. If b_bufobj is not NULL, then bufobj lock should be
held when manipulating the flags. Not doing this sometimes leaves
BV_BKGRDINPROG to be erronously set, causing softdep' getdirtybuf() to
stuck indefinitely in "getbuf" sleep, waiting for background write to
finish which is not actually performed.

Add BO_LOCK() in the cases where it was missed.

In collaboration with: pho
Tested by: bz
Reviewed by: jeff
MFC after: 1 month
b6ec5a5f0a2402c67b95bcef3364f3b7fbfc6ffc 04-Aug-2010 alc <alc@FreeBSD.org> In order for MAXVNODES_MAX to be an "int" on powerpc and sparc, we must
cast PAGE_SIZE to an "int". (Powerpc and sparc, unlike the other
architectures, define PAGE_SIZE as a "long".)

Submitted by: Andreas Tobler
329f9f043551c019dc1b38d678b2b49c227ebee8 02-Aug-2010 alc <alc@FreeBSD.org> Update the "desiredvnodes" calculation. In particular, make the part of
the calculation that is based on the kernel's heap size more conservative.
Hopefully, this will eliminate the need for MAXVNODES_MAX, but for the
time being set MAXVNODES_MAX to a large value.

Reviewed by: jhb@
MFC after: 6 weeks
76489ac1ea604f511232838164573ea21e9a74a8 21-Jun-2010 ed <ed@FreeBSD.org> Use ISO C99 integer types in sys/kern where possible.

There are only about 100 occurences of the BSD-specific u_int*_t
datatypes in sys/kern. The ISO C99 integer types are used here more
b3024a4af99afe33dd2b989126703db18d03ed46 17-Jun-2010 pjd <pjd@FreeBSD.org> Backout r207970 for now, it can lead to deadlocks.

Reported by: kan
MFC after: 3 days
2ba33ab98eb607efe3f9a55b0cd863ca7a7fad43 03-Jun-2010 kib <kib@FreeBSD.org> Sometimes vnodes share the lock despite being different vnodes on
different mount points, e.g. the nullfs vnode and the covered vnode
from the lower filesystem. In this case, existing assertion in
vop_rename_pre() may be triggered.

Check for vnode locks equiality instead of the vnodes itself to
not trip over the situation.

Submitted by: Mikolaj Golub <to.my.trociny@gmail.com>
Tested by: pho
MFC after: 2 weeks
773cda6040b0b1d6fe89422c1946947d521fd2eb 12-May-2010 zml <zml@FreeBSD.org> Add VOP_ADVLOCKPURGE so that the file system is called when purging
locks (in the case where the VFS impl isn't using lf_*)

Submitted by: Matthew Fleming <matthew.fleming@isilon.com>
Reviewed by: zml, dfr
05f836c1c3fdb085eb93908c94736251d6a86f6d 12-May-2010 pjd <pjd@FreeBSD.org> When there is no memory or KVA, try to help by reclaiming some vnodes.
This helps with 'kmem_map too small' panics.

No objections from: kib
Tested by: Alexander V. Ribchansky <shurik@zk.informjust.ua>
MFC after: 1 week
f1b200bbcc33b7b3791b8dd805a525fe2f856723 11-May-2010 pjd <pjd@FreeBSD.org> I added vfs_lowvnodes event, but it was only used for a short while and now
it is totally unused. Remove it.

MFC after: 3 days
a57449541074720475dfc21dfb8b025695b573eb 24-Apr-2010 jeff <jeff@FreeBSD.org> - Merge soft-updates journaling from projects/suj/head into head. This
brings in support for an optional intent log which eliminates the need
for background fsck on unclean shutdown.

Sponsored by: iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm
6b4bef1bca4df32af447cc3bd048188556c784ff 04-Apr-2010 jh <jh@FreeBSD.org> Add missing MNT_NFS4ACLS.
bd6cec6aca6f9a9972968a44bede22ab98da6c91 03-Apr-2010 pjd <pjd@FreeBSD.org> Fix some whitespace nits.
f0663e1c41c37cf06ece9fe56ae81e509a220ac7 03-Apr-2010 pjd <pjd@FreeBSD.org> Add missing mnt_kern_flag flags in 'show mount' output.
86c35b90b77323913c46c2542f163e5102676b8c 02-Apr-2010 kib <kib@FreeBSD.org> Add function vop_rename_fail(9) that performs needed cleanup for locks
and references of the VOP_RENAME(9) arguments. Use vop_rename_fail()
in deadfs_rename().

Tested by: Mikolaj Golub
MFC after: 1 week
1c4578d2239497e62b1656eb3cc8b6a85b145fad 17-Jan-2010 kib <kib@FreeBSD.org> Add new function vunref(9) that decrements vnode use count (and hold
count) while vnode is exclusively locked.

The code for vput(9), vrele(9) and vunref(9) is merged.

In collaboration with: pho
Reviewed by: alc
MFC after: 3 weeks
56603546c67f75a4f5ead37f6bdeb0c40db07c8e 28-Dec-2009 kib <kib@FreeBSD.org> Add a knob to allow reclaim of the directory vnodes that are source of
the namecache records. The reclamation is not enabled by default because
for typical workload it would make namecache unusable, but large nested
directory tree easily puts any process that accesses filesystem into 1
second wait for vlru.

Reported by: yar (long time ago)
MFC after: 3 days
a78d6a0fdab8ef9df3cfbb1d0b47b6f76cb8a71c 26-Dec-2009 trasz <trasz@FreeBSD.org> Now that all the callers seem to be fixed, add KASSERTs to make sure VAPPEND
is not being used improperly.
b79e14054c8e7da84bd67e9e6e02fccceac99d76 21-Dec-2009 kib <kib@FreeBSD.org> VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object
flag. Besides providing the redundand information, need to update both
vnode and object flags causes more acquisition of vnode interlock.
OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects.

Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for
vnode-backed vm objects.

Suggested and reviewed by: alc
Tested by: pho
MFC after: 3 weeks
0c0aa71530aa1d7a3f406f6e7c6c075b16f6d76d 19-Nov-2009 jh <jh@FreeBSD.org> Extend ddb(4) "show mount" command to print active string mount options.
Note that only option names are printed, not values.

Reviewed by: pjd
Approved by: trasz (mentor)
MFC after: 2 weeks
d5661d631d28fee99ac0a8758c56535ac2c9b01b 01-Oct-2009 trasz <trasz@FreeBSD.org> Provide default implementation for VOP_ACCESS(9), so that filesystems which
want to provide VOP_ACCESSX(9) don't have to implement both. Note that
this commit makes implementation of either of these two mandatory.

Reviewed by: kib
53eaed07bcf1ec0f1a38e0b490d95f1ee307055a 12-Sep-2009 rwatson <rwatson@FreeBSD.org> Use C99 initialization for struct filterops.

Obtained from: Mac OS X
Sponsored by: Apple Inc.
MFC after: 3 weeks
3119d26c952b144660c1507510826da62234c628 09-Sep-2009 kib <kib@FreeBSD.org> In vfs_mark_atime(9), be resistent against reclaimed vnodes.
Assert that neccessary locks are taken, since vop might not be called.

Tested by: pho
MFC after: 3 days
c0264518e962730af09892510b13ca35bdb38741 02-Jul-2009 jamie <jamie@FreeBSD.org> Call prison_check from vfs_suser rather than re-implementing it.

Approved by: re (kib), bz (mentor)
e1cb2941d4424de90eb68716d6c4d95f4c0af0ba 10-Jun-2009 kib <kib@FreeBSD.org> Adapt vfs kqfilter to the shared vnode lock used by zfs write vop. Use
vnode interlock to protect the knote fields [1]. The locking assumes
that shared vnode lock is held, thus we get exclusive access to knote
either by exclusive vnode lock protection, or by shared vnode lock +
vnode interlock.

Do not use kl_locked() method to assert either lock ownership or the
fact that curthread does not own the lock. For shared locks, ownership
is not recorded, e.g. VOP_ISLOCKED can return LK_SHARED for the shared
lock not owned by curthread, causing false positives in kqueue subsystem
assertions about knlist lock.

Remove kl_locked method from knlist lock vector, and add two separate
assertion methods kl_assert_locked and kl_assert_unlocked, that are
supposed to use proper asserts. Change knlist_init accordingly.

Add convenience function knlist_init_mtx to reduce number of arguments
for typical knlist initialization.

Submitted by: jhb [1]
Noted by: jhb [2]
Reviewed by: jhb
Tested by: rnoland
f4934662e5d837053e785525653e390ce6933d2b 05-Jun-2009 rwatson <rwatson@FreeBSD.org> Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERIC
and used in a large number of files, but also because an increasing number
of incorrect uses of MAC calls were sneaking in due to copy-and-paste of
MAC-aware code without the associated opt_mac.h include.

Discussed with: pjd
e6a06610ffed1cb06d6b1cfa7d9350bde5926b4d 30-May-2009 attilio <attilio@FreeBSD.org> Remove the now invalid (and possibly unused) debug.mpsafevfs

Reviewed by: emaste
Sponsored by: Sandvine Incorporated
0c63bcbfa4fb9f208ea176334478d17cafd66eac 30-May-2009 trasz <trasz@FreeBSD.org> Add VOP_ACCESSX, which can be used to query for newly added V*
permissions, such as VWRITE_ACL. For a filsystems that don't
implement it, there is a default implementation, which works
as a wrapper around VOP_ACCESS.

Reviewed by: rwatson@
a013e0afcbb44052a86a7977277d669d8883b7e7 27-May-2009 jamie <jamie@FreeBSD.org> Add hierarchical jails. A jail may further virtualize its environment
by creating a child jail, which is visible to that jail and to any
parent jails. Child jails may be restricted more than their parents,
but never less. Jail names reflect this hierarchy, being MIB-style
dot-separated strings.

Every thread now points to a jail, the default being prison0, which
contains information about the physical system. Prison0's root
directory is the same as rootvnode; its hostname is the same as the
global hostname, and its securelevel replaces the global securelevel.
Note that the variable "securelevel" has actually gone away, which
should not cause any problems for code that properly uses
securelevel_gt() and securelevel_ge().

Some jail-related permissions that were kept in global variables and
set via sysctls are now per-jail settings. The sysctls still exist for
backward compatibility, used only by the now-deprecated jail(2) system

Approved by: bz (mentor)
1dcb84131b6679f5d53452314d4ca1dfe2d8c5f0 11-May-2009 attilio <attilio@FreeBSD.org> Remove the thread argument from the FSD (File-System Dependent) parts of
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.

In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.

While here fix a bug: now, UFS_EXTATTR can be compiled alone without the

VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
ca739eae4a06d758f8eae83c7b26012ff5898d3d 29-Mar-2009 kan <kan@FreeBSD.org> Replace v_dd vnode pointer with v_cache_dd pointer to struct namecache
in directory vnodes. Allow namecache dotdot entry to be created pointing
from child vnode to parent vnode if no existing links in opposite
direction exist. Use direct link from parent to child for dotdot lookups

This restores more efficient dotdot caching in NFS filesystems which
was lost when vnodes stoppped being type stable.

Reviewed by: kib
e17295cf6f26850137ec6961ede7e51335800027 02-Mar-2009 kan <kan@FreeBSD.org> Change vfs_busy to wait until an outcome of pending unmount
operation is known and to retry or fail accordingly to that
outcome. This fixes the problem with namespace traversing
programs failing with random ENOENT errors if someone just
happened to try to unmount that same filesystem at the same

Reported by: dhw
Reviewed by: kib, attilio
Sponsored by: Juniper Networks, Inc.
f856c6d618010b9224df31d6521124b672608255 06-Feb-2009 jhb <jhb@FreeBSD.org> Tweak the output of VOP_PRINT/vn_printf() some.
- Align the fifo output in fifo_print() with other vn_printf() output.
- Remove the leading space from lockmgr_printinfo() so its output lines up
in vn_printf().
- lockmgr_printinfo() now ends with a newline, so remove an extra newline
from vn_printf().
d102122bd0838dfffdce66a0c2a7a5dded257d04 06-Feb-2009 trasz <trasz@FreeBSD.org> Add KASSERTs to make it easier to debug problems like the one fixed
in r188141.

Reviewed by: kib,attilio
Approved by: rwatson (mentor)
Tested by: pho
Sponsored by: FreeBSD Foundation
beddfe59b032db4229498b2f018c8d86779f5536 05-Feb-2009 attilio <attilio@FreeBSD.org> Add more KTR_VFS logging point in order to have a more effective tracing.

Reviewed by: brueffer, kib
Tested by: Gianni Trematerra <giovanni D trematerra A gmail D com>
d7c8a44c0dac083e6c66996e0ef16ab8af82fc9e 23-Jan-2009 jhb <jhb@FreeBSD.org> Tweak the wording for vfs_mark_atime() since the I/O it is avoiding by not
updating va_atime via VOP_SETATTR() isn't always synchronous. For some
filesystems it is asynchronous.

Suggested by: bde
4efa7c83e1eec7569c05a340284a2286efb7d56e 23-Jan-2009 jhb <jhb@FreeBSD.org> Push down Giant in the vlnru kproc main loop so that it is only acquired
around calls to vlrureclaim() on non-MPSAFE filesystems. Specifically,
vnlru no longer needs Giant for the common case of waking up and deciding
there is nothing for it to do.

MFC after: 2 weeks
2939c2f76f4c6f9ca6b56d817cdf9f28de9bd66f 21-Jan-2009 jhb <jhb@FreeBSD.org> Fix a few style bogons.

Submitted by: bde
47455a7b41fddec8ed401d12470434bd77477189 21-Jan-2009 jhb <jhb@FreeBSD.org> Move the VA_MARKATIME flag for VOP_SETATTR() out into its own VOP:
VOP_MARKATIME() since unlike the rest of VOP_SETATTR(), VA_MARKATIME
can be performed while holding a shared vnode lock (the same functionality
is done internally by VOP_READ which can run with a shared vnode lock).
Add missing locking of the vnode interlock to the ufs implementation and
remove a special note and test from the NFS client about not supporting the

Inspired by: ups
Tested by: pho
cbb8defa10711f9b65be83f41f04cbabab84b5fc 20-Jan-2009 kib <kib@FreeBSD.org> FFS puts the extended attributes blocks at the negative blocks for the
vnode, from -1 down. When vinvalbuf(vp, V_ALT) is done for the vnode, it
incorrectly does vm_object_page_remove(0, 0), removing all pages from
the underlying vm object, not only the pages that back the extended
attributes data.

Change vinvalbuf() to not remove any pages from the object when
V_NORMAL or V_ALT are specified. Instead, the only in-tree caller
in ffs_inode.c:ffs_truncate() that specifies V_ALT explicitely
removes the corresponding page range. The V_NORMAL caller
does vnode_pager_setsize(vp, 0) immediately after the call to
vinvalbuf(V_NORMAL) already.

Reported by: csjp
Reviewed by: ups
MFC after: 3 weeks
697e2a94e4778901b53481a53119fcb1da110429 16-Dec-2008 attilio <attilio@FreeBSD.org> 1) Fix a deadlock in the VFS:
- threadA runs vfs_rel(mp1)
- threadB does unmount the mp1 fs, sets MNTK_UNMOUNT and drop MNT_ILOCK()
- threadA runs vfs_busy(mp1) and, as long as, MNTK_UNMOUNT is set, sleeps
waiting for threadB to complete the unmount
- threadB, in vfs_mount_destroy(), finds mnt_lock > 0 and sleeps waiting
for the refcount to expire.

Fix the deadlock by adding a flag called MNTK_REFEXPIRE which signals the
unmounter is waiting for mnt_ref to expire.
The vfs_busy contenders got awake, fails, and if they retry the
MNTK_REFEXPIRE won't allow them to sleep again.

2) Simplify significantly the code of vfs_mount_destroy() trimming
unnecessary codes:
- as long as any reference exited, it is no-more possible to have
write-op (primarty and secondary) in progress.
- it is no needed to drop and reacquire the mount lock.
- filling the structures with dummy values is unuseful as long as
it is going to be freed.

Tested by: pho, Andrea Barberio <insomniac at slackware dot it>
Discussed with: kib
bf74bb2e167fcc7089b250309aa7131a27b672e2 29-Nov-2008 kib <kib@FreeBSD.org> In the nfsrv_fhtovp(), after the vfs_getvfs() function found the pointer
to the fs, but before a vnode on the fs is locked, unmount may free fs
structures, causing access to destroyed data and freed memory.

Introduce a vfs_busymp() function that looks up and busies found
fs while mountlist_mtx is held. Use it in nfsrv_fhtovp() and in the
implementation of the handle syscalls.

Two other uses of the vfs_getvfs() in the vfs_subr.c, namely in
sysctl_vfs_ctl and vfs_getnewfsid seems to be ok. In particular,
sysctl_vfs_ctl is protected by Giant by being a non-sleeping sysctl
handler, that prevents Giant-locked unmount code to interfere with it.

Noted by: tegge
Reviewed by: dfr
Tested by: pho
MFC after: 1 month
bbe899b96e388a8b82439f81ed3707e0d9c6070d 17-Nov-2008 pjd <pjd@FreeBSD.org> Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.

This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.


Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.

- slog

Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).

- vfs.zfs.super_owner

Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

Not all the flags are supported. This still needs work.

- ZFSBoot

Support to boot off of ZFS pool. Not finished, AFAIK.

Submitted by: dfr

- Snapshot properties

- New failure modes

Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.

- Sparse volumes

ZVOLs that don't reserve space in the pool.

- External attributes

Compatible with extattr(2).

- NFSv4-ACLs

Not sure about the status, might not be complete yet.

Submitted by: trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from: OpenSolaris
26a604f3bccb7c4a377c7cbf4facbae8c20e1fed 03-Nov-2008 attilio <attilio@FreeBSD.org> Remove the mnt_holdcnt and mnt_holdcntwaiters because they are useless.
Really, the concept of holdcnt in the struct mount is rappresented by
the mnt_ref (which prevents the type-stable structure from being
"recycled) handled through vfs_ref() and vfs_rel().
On this optic, switch the holdcnt acquisition into an emulated vfs_ref()
(and subsequent release into vfs_rel()).

Discussed with: kib
Tested by: pho
e1f493235eaeaf3c0cd0e029936db2b5b6e32bf7 02-Nov-2008 attilio <attilio@FreeBSD.org> Improve VFS locking:
- Implement real draining for vfs consumers by not relying on the
mnt_lock and using instead a refcount in order to keep track of lock
- Due to the change above, remove the mnt_lock lockmgr because it is now
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
Change so its KPI by removing the interlock argument and defining 2 new
flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
old version (which was unlinked from the lockmgr alredy) and
MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
mnt_ref if running for more than 3 seconds, make it totally useless.
Remove it as it was thought to work into older versions.
If a problem of "refcount held never going away" should appear, we will
need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
leaving the MNTK_MWAIT flag on even if it the waiters were actually
woken up. Just a place in vfs_mount_destroy() is left because it is
going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.

This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.

Discussed with: kib
Tested by: pho
0ad8692247694171bf2d3f963f24b15f5223a0de 28-Oct-2008 trasz <trasz@FreeBSD.org> Introduce accmode_t. This is required for NFSv4 ACLs - it will be neccessary
to add more V* constants, and the variables changed by this patch were often
being assigned to mode_t variables, which is 16 bit.

Approved by: rwatson (mentor)
b9b0d2c54ca54660de39e9aa6f9bfd4c9653adb3 28-Oct-2008 kib <kib@FreeBSD.org> Style return statements in vn_pollrecord().
86b5e61ab2f215486d5afad7de1d637bebd75dca 28-Oct-2008 kib <kib@FreeBSD.org> Protect check for v_pollinfo == NULL and assignment of the newly allocated
vpollinfo with vnode interlock. Fully initialize vpollinfo before putting
pointer to it into vp->v_pollinfo.

Discussed with: dwhite
Tested by: pho
MFC after: 1 week
015479d466a4e5609bed16fe5bc0d7c02cd1a239 20-Oct-2008 kib <kib@FreeBSD.org> In vfs_busy(), lockmgr() cannot legitimately sleep, because code checked
MNTK_UNMOUNT before, and mnt_mtx is used as interlock. vfs_busy() always
tries to obtain a shared lock on mnt_lock, the other user is unmount who
tries to drain it, setting MNTK_UNMOUNT before.

Reviewed by: tegge, attilio
Tested by: pho
MFC after: 2 weeks
b8bf37e5857b138031059dd4768deb4937efe183 10-Oct-2008 attilio <attilio@FreeBSD.org> Remove the struct thread unuseful argument from bufobj interface.
In particular following functions KPI results modified:
- bufobj_invalbuf()
- bufsync()

and BO_SYNC() "virtual method" of the buffer objects set.
Main consumers of bufobj functions are affected by this change too and,
in particular, functions which changed their KPI are:
- vinvalbuf()
- g_vfs_close()

Due to the KPI breakage, __FreeBSD_version will be bumped in a later

As a side note, please consider just temporary the 'curthread' argument
passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP

Reviewed by: kib
Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
e2ca413d09b408b7f5caff24c9c36ddb1c43dc7c 31-Aug-2008 attilio <attilio@FreeBSD.org> Decontextualize vfs_busy(), vfs_unbusy() and vfs_mount_alloc() functions.

Manpages are updated accordingly.

Tested by: Diego Sardina <siarodx at gmail dot com>
dbf35e279f37ad4a573bf93923d141cb4a454c7d 28-Aug-2008 attilio <attilio@FreeBSD.org> Decontextualize the couplet VOP_GETATTR / VOP_SETATTR as the passed thread
was always curthread and totally unuseful.

Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>
dd53532f9375b0bc12ccefe8249b0f02d0ea09c4 28-Aug-2008 kib <kib@FreeBSD.org> Introduce the VV_FORCEINSMQ vnode flag. It instructs the insmnque() function
to ignore the unmounting and forces insertion of the vnode into the mount
vnode list.

Change insmntque() to fail when forced unmount is in progress and
VV_FORCEINSMQ is not specified.

Add an assertion to the insmntque(), requiring the vnode to be
exclusively locked for mp-safe filesystems.

Use the VV_FORCEINSMQ for the creation of the syncvnode.

Tested by: pho
Reviewed by: tegge
MFC after: 1 month
e30e00f1b7d10c5c1372d16f3ece0f6cceeb5442 24-Aug-2008 csjp <csjp@FreeBSD.org> Remove worrying printf warning on bootup when processing vnodes which
have NULL mount-points. This is the case for special vnodes, such as the
one used in nameiinit() which is used for crossing mount points in lookup()
to avoid lock ordering issues.

MFC after: 2 weeks
Discussed with: rwatson, kib
faa0cddcb0c74e532ea21c55581a6bc848ad0c4a 30-Jul-2008 ed <ed@FreeBSD.org> Remove the use of lbolt from the VFS syncer.

It seems we only use `lbolt' inside the VFS syncer and the TTY layer
now. Because I'm planning to replace the TTY layer next month, there's
no reason to keep `lbolt' if it's only used in a single thread inside
the kernel.

Because the syncer code wanted to wake up the syncer thread before the
timeout, it called sleepq_remove(). Because we now just use a condvar(9)
with a timeout value of `hz', we can wake it up using cv_broadcast()
without waking up any unrelated threads.

Reviewed by: phk
3f1807709d7cfe3766d846df960fb657feef0ffb 27-Jul-2008 pjd <pjd@FreeBSD.org> Assert for exclusive vnode lock in vinactive(), vrecycle() and vgonel()

Reviewed by: kib
4dd19696a7df7d5a2ba853c021b68dfe52ebbe5a 27-Jul-2008 pjd <pjd@FreeBSD.org> - Move vp test for beeing NULL under IGNORE_LOCK().
- Check if panicstr isn't set, if it is ignore the lock. This helps to avoid
confusion, because lockmgr is a no-op when panicstr isn't NULL, so
asserting anything at this point doesn't make sense and can just race with
other panic.

Discussed with: kib
823ce79a5bcafc8eb919f84fef6937a5636d1978 21-Jul-2008 attilio <attilio@FreeBSD.org> - Disallow XFS mounting in write mode. The write support never worked really
and there is no need to maintain it.
- Fix vn_get() in order to let it call vget(9) with a valid locking
request. vget(9) returns the vnode locked in order to prevent recycling,
but in this case internal XFS locks alredy prevent it from happening, so
it is safe to drop the vnode lock before to return by vn_get().
- Add a VNASSERT() in vget(9) in order to catch malformed locking requests.

Discussed with: kan, kib
Tested by: Lothar Braun <lothar at lobraun dot de>
a1af6d977bb920efdae391bf7af304a535058d10 18-May-2008 pjd <pjd@FreeBSD.org> Be more friendly for DDB pager.

Educated by: jhb's BSDCan presentation
bb68298f623ded730e0dc981b2f072c542325a78 04-May-2008 attilio <attilio@FreeBSD.org> sync_vnode() has some messy code about locking in order to deal with
mount fs needing Giant to be held when processing bufobjs.
Use a different subqueue for pending workitems on filesystems requiring
Giant. This simplifies the code notably and also reduces the number of
Giant acquisitions (and the whole processing cost).

Suggested by: jeff
Reviewed by: kib
Tested by: pho
cb7610bd5214be1b0977bd2b83d8e95686c8881b 26-Apr-2008 pjd <pjd@FreeBSD.org> Implement 'show mount' command in DDB. Without argument, it prints short
info about all currently mounted file systems. When an address is given
as an argument, prints detailed info about the given mount point.

MFC after: 2 weeks
9f2031da023ed595ea1534b042896a4d05803dc5 24-Apr-2008 kib <kib@FreeBSD.org> Allow the vnode zone to return the unused memory. The vnode reference
count is/shall be properly maintained for the long time, and VFS
shall be safe against the vnode memory reclamation.

Proposed by: jeff
Tested by: pho
52243403eb48561abd7b33995f5a4be6a56fa1f0 16-Apr-2008 kib <kib@FreeBSD.org> Move the head of byte-level advisory lock list from the
filesystem-specific vnode data to the struct vnode. Provide the
default implementation for the vop_advlock and vop_advlockasync.
Purge the locks on the vnode reclaim by using the lf_purgelocks().
The default implementation is augmented for the nfs and smbfs.
In the nfs_advlock, push the Giant inside the nfs_dolock.

Before the change, the vop_advlock and vop_advlockasync have taken the
unlocked vnode and dereferenced the fs-private inode data, racing with
with the vnode reclamation due to forced unmount. Now, the vop_getattr
under the shared vnode lock is used to obtain the inode size, and
later, in the lf_advlockasync, after locking the vnode interlock, the
VI_DOOMED flag is checked to prevent an operation on the doomed vnode.

The implementation of the lf_purgelocks() is submitted by dfr.

Reported by: kris
Tested by: kris, pho
Discussed with: jeff, dfr
MFC after: 2 weeks
639ca8f21b04fc75a1084ba1d38635e3716dd4e6 02-Apr-2008 jeff <jeff@FreeBSD.org> - Destroy the bo mtx when the vnode is destroyed.
7e107a0c8ce6820c1a2604de4575a1551b907341 28-Mar-2008 attilio <attilio@FreeBSD.org> b_waiters cannot be adequately protected by the interlock because it is
dropped after the call to lockmgr() so just revert this approach using
something similar to the precedent one:
BUF_LOCKWAITERS() just checks if there are waiters (not the actual number
of them) and it is based on newly introduced lockmgr_waiters() which
returns if the lockmgr has waiters or not. The name has been choosen
differently by old lockwaiters() in order to not confuse them.

KPI results enriched by this commit so __FreeBSD_version bumping and
manpage update will be happening soon.
'struct buf' also changes, so kernel ABI is disturbed.

Bug found by: jeff
Approved by: jeff, kib
3ad75daf19094d458a872196403a2d5a77437da1 24-Mar-2008 jeff <jeff@FreeBSD.org> - Greatly simplify vget() by removing the guarantee that any new
references to a vnode with VI_OWEINACT set will force the vinactive()
call. The kernel makes no guarantees about which reference was the
last to close a file or when the actual inactive processing will
happen. The previous code was designed to preserve existing semantics
in the face of shared locks, however, this was unnecessary.

Discussed with: mckusick
8103d042fbc5493bdaa92b21536c617c3e281ad3 23-Mar-2008 jeff <jeff@FreeBSD.org> - Only return 1 from sync_vnode() in cases where the vnode is still
at the head of the sync list. This prevents sched_sync() from
re-queueing a vnode which may have been freed already.

Discussed with: kib
73b6a5597c0960ea6e9c14c579f5923be3e6a3a9 23-Mar-2008 jeff <jeff@FreeBSD.org> - Pass BO_MTX(bo) to lockmgr in vtruncbuf, we don't own the vnode
interlock here anymore.

Reported by: kris
a9d123c3ab34baa9fe2c8c25bd9acfbfb31b381e 22-Mar-2008 jeff <jeff@FreeBSD.org> - Complete part of the unfinished bufobj work by consistently using
BO_LOCK/UNLOCK/MTX when manipulating the bufobj.
- Create a new lock in the bufobj to lock bufobj fields independently.
This leaves the vnode interlock as an 'identity' lock while the bufobj
is an io lock. The bufobj lock is ordered before the vnode interlock
and also before the mnt ilock.
- Exploit this new lock order to simplify softdep_check_suspend().
- A few sync related functions are marked with a new XXX to note that
we may not properly interlock against a non-zero bv_cnt when
attempting to sync all vnodes on a mountlist. I do not believe this
race is important. If I'm wrong this will make these locations easier
to find.

Reviewed by: kib (earlier diff)
Tested by: kris, pho (earlier diff)
877d7c65ba9b74233df6c9197fc39c770e809d02 16-Mar-2008 rwatson <rwatson@FreeBSD.org> In keeping with style(9)'s recommendations on macros, use a ';'
after each SYSINIT() macro invocation. This makes a number of
lightweight C parsers much happier with the FreeBSD kernel
source, including cflow's prcc and lxr.

MFC after: 1 month
Discussed with: imp, rink
0d873341312bfcbee292129a09cf72ab59e3ef38 01-Mar-2008 attilio <attilio@FreeBSD.org> - Handle buffer lock waiters count directly in the buffer cache instead
than rely on the lockmgr support [1]:
* bump the waiters only if the interlock is held
* let brelvp() return the waiters count
* rely on brelvp() instead than BUF_LOCKWAITERS() in order to check
for the waiters number
- Remove a namespace pollution introduced recently with lockmgr.h
including lock.h by including lock.h directly in the consumers and
making it mandatory for using lockmgr.
- Modify flags accepted by lockinit():
* introduce LK_NOPROFILE which disables lock profiling for the
specified lockmgr
* introduce LK_QUIET which disables ktr tracing for the specified
lockmgr [2]
* disallow LK_SLEEPFAIL and LK_NOWAIT to be passed there so that it
can only be used on a per-instance basis
- Remove BUF_LOCKWAITERS() and lockwaiters() as they are no longer

This patch breaks KPI so __FreBSD_version will be bumped and manpages
updated by further commits. Additively, 'struct buf' changes results in
a disturbed ABI also.

[2] Really, currently there is no ktr tracing in the lockmgr, but it
will be added soon.

[1] Submitted by: kib
Tested by: pho, Andrea Barberio <insomniac at slackware dot it>
4014b558307253555f43f360be60f49ea39b7ceb 25-Feb-2008 attilio <attilio@FreeBSD.org> Axe the 'thread' argument from VOP_ISLOCKED() and lockstatus() as it is
always curthread.

As KPI gets broken by this patch, manpages and __FreeBSD_version will be
updated by further commits.

Tested by: Andrea Barberio <insomniac at slackware dot it>
e1db4e70b3e0f4dcd6006b5380fd6484833c54e5 08-Feb-2008 attilio <attilio@FreeBSD.org> Conver all explicit instances to VOP_ISLOCKED(arg, NULL) into
VOP_ISLOCKED(arg, curthread). Now, VOP_ISLOCKED() and lockstatus() should
only acquire curthread as argument; this will lead in axing the additional
argument from both functions, making the code cleaner.

Reviewed by: jeff, kib
7213f4c32b94b60add6400f4213c1ca347bd609f 24-Jan-2008 attilio <attilio@FreeBSD.org> Cleanup lockmgr interface and exported KPI:
- Remove the "thread" argument from the lockmgr() function as it is
always curthread now
- Axe lockcount() function as it is no longer used
- Axe LOCKMGR_ASSERT() as it is bogus really and no currently used.
Hopefully this will be soonly replaced by something suitable for it.
- Remove the prototype for dumplockinfo() as the function is no longer

- Introduce a KASSERT() in lockstatus() in order to let it accept only
curthread or NULL as they should only be passed
- Do a little bit of style(9) cleanup on lockmgr.h

KPI results heavilly broken by this change, so manpages and
FreeBSD_version will be modified accordingly by further commits.

Tested by: matteo
caa2ca048b3049232f35da2c8dc2e7b2cb199d71 19-Jan-2008 attilio <attilio@FreeBSD.org> - Introduce the function lockmgr_recursed() which returns true if the
lockmgr lkp, when held in exclusive mode, is recursed
- Introduce the function BUF_RECURSED() which does the same for bufobj
locks based on the top of lockmgr_recursed()
- Introduce the function BUF_ISLOCKED() which works like the counterpart
VOP_ISLOCKED(9), showing the state of lockmgr linked with the bufobj

BUF_RECURSED() and BUF_ISLOCKED() entirely replace the usage of bogus
BUF_REFCNT() in a more explicative and SMP-compliant way.
This allows us to axe out BUF_REFCNT() and leaving the function
lockcount() totally unused in our stock kernel. Further commits will
axe lockcount() as well as part of lockmgr() cleanup.

KPI results, obviously, broken so further commits will update manpages
and freebsd version.

Tested by: kris (on UFS and NFS)
71b7824213151e91b40ee4afa9fa7f100c90ed0b 13-Jan-2008 attilio <attilio@FreeBSD.org> VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in
conjuction with 'thread' argument passing which is always curthread.
Remove the unuseful extra-argument and pass explicitly curthread to lower
layer functions, when necessary.

KPI results broken by this change, which should affect several ports, so
version bumping and manpage update will be further committed.

Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>
18d0a0dd51c7995ce9e549616f78ef724096b1bd 10-Jan-2008 attilio <attilio@FreeBSD.org> vn_lock() is currently only used with the 'curthread' passed as argument.
Remove this argument and pass curthread directly to underlying
VOP_LOCK1() VFS method. This modify makes the code cleaner and in
particular remove an annoying dependence helping next lockmgr() cleanup.
KPI results, obviously, changed.

Manpage and FreeBSD_version will be updated through further commits.

As a side note, would be valuable to say that next commits will address
a similar cleanup about VFS methods, in particular vop_lock1 and

Tested by: Diego Sardina <siarodx at gmail dot com>,
Andrea Di Pasquale <whyx dot it at gmail dot com>
166f16a0c65fbc0905ba3a63268a239a2540a250 28-Dec-2007 rwatson <rwatson@FreeBSD.org> In "show lockedvnods" DDB command, use db_printf() rather than printf()
so that the results end up in the DDB output stream rather than the
console output stream.

This should likely also be done for the vprint() function it calls.

MFC after: 3 months
d9b244638ead855a1ff6847b1f72eec87015d416 27-Dec-2007 attilio <attilio@FreeBSD.org> As LK_EXCLUPGRADE is used in conjuction with LK_NOWAIT, LK_UPGRADE becames
equivalent with this and so operate the switch.

That call is the only one remaining LK_EXCLUPGRADE consumer and removing
it will prepare the ground for LK_EXCLUPGRADE axing and further
lockmgr improvements.

Discussed with: jeff, ups
bdee30611dab246a5227856892385a02c7352f12 25-Dec-2007 rwatson <rwatson@FreeBSD.org> Add a new 'why' argument to kdb_enter(), and a set of constants to use
for that argument. This will allow DDB to detect the broad category of
reason why the debugger has been entered, which it can use for the
purposes of deciding which DDB script to run.

Assign approximate why values to all current consumers of the
kdb_enter() interface.
53229c8ee97f8b4b7ae8a8ecea863a22054ec1b3 05-Dec-2007 kib <kib@FreeBSD.org> Use curthread instead of the FIRST_THREAD_IN_PROC for vnlru and syncer,
when applicable.

Aquire Giant slightly later for vnlru.

In the syncer, aquire the Giant only when a vnode belongs to the
non-MPsafe fs.

In both speedup_syncer() and syncer_shutdown(), remove the syncer thread from
the lbolt sleep queue after the syncer state is modified, not before.

Herded by: attilio
Tested by: Peter Holm
Reviewed by: ups
MFC after: 1 week
60570a92bf794d255e5f8ed235b49c553776ad92 24-Oct-2007 rwatson <rwatson@FreeBSD.org> Merge first in a series of TrustedBSD MAC Framework KPI changes
from Mac OS X Leopard--rationalize naming for entry points to
the following general forms:


The previous naming scheme was inconsistent and mostly
reversed from the new scheme. Also, make object types more
consistent and remove spaces from object types that contain
multiple parts ("posix_sem" -> "posixsem") to make mechanical
parsing easier. Introduce a new "netinet" object type for
certain IPv4/IPv6-related methods. Also simplify, slightly,
some entry point names.

All MAC policy modules will need to be recompiled, and modules
not updates as part of this commit will need to be modified to
conform to the new KPI.

Sponsored by: SPARTA (original patches against Mac OS X)
Obtained from: TrustedBSD Project, Apple Computer
51d643caa6efc11780104da450ee36a818170f81 20-Oct-2007 julian <julian@FreeBSD.org> Rename the kthread_xxx (e.g. kthread_create()) calls
to kproc_xxx as they actually make whole processes.
Thos makes way for us to add REAL kthread_create() and friends
that actually make theads. it turns out that most of these
calls actually end up being moved back to the thread version
when it's added. but we need to make this cosmetic change first.

I'd LOVE to do this rename in 7.0 so that we can eventually MFC the
new kthread_xxx() calls.
e651705b7e9100bab6eada7793ec8c47c24d65aa 12-Sep-2007 kib <kib@FreeBSD.org> When restoring the mount after umount failed, the MNTK_UNMOUNT flag
prevents insmntque() from placing reallocated syncer vnode on mount
list, that causes panic in vfs_allocate_syncvnode().

Introduce MNTK_NOINSMNTQ flag, that marks the period when instmntque is
not allowed to success, instead of MNTK_UNMOUNT. The MNTK_NOINSMNTQ is
set and cleared simultaneously with MNTK_UNMOUNT, except on umount error
path, where it is cleaned just before the syncer vnode is going to be

Reported by: Peter Jeremy <peterjeremy optushome com au>
Suggested by: tegge
Approved by: re (rwatson)
8d074382c8e722d1563c515d0859fbfe6d6182f6 13-Aug-2007 pjd <pjd@FreeBSD.org> Improve vn_printf() by:
- adding missing vnode flags,
- printing unknown flags as numbers,
- using strlcat() instead of strcat().

Approved by: re (bmah)
00b02345d424dac8a490ff28ff75fd9386196583 12-Jun-2007 rwatson <rwatson@FreeBSD.org> Eliminate now-unused SUSER_ALLOWJAIL arguments to priv_check_cred(); in
some cases, move to priv_check() if it was an operation on a thread and
no other flags were present.

Eliminate caller-side jail exception checking (also now-unused); jail
privilege exception code now goes solely in kern_jail.c.

We can't yet eliminate suser() due to some cases in the KAME code where
a privilege check is performed and then used in many different deferred
paths. Do, however, move those prototypes to priv.h.

Reviewed by: csjp
Obtained from: TrustedBSD Project
7dd8ed88a925a943f1963baa072f4b6c6a8c9930 31-May-2007 attilio <attilio@FreeBSD.org> Revert VMCNT_* operations introduction.
Probabilly, a general approach is not the better solution here, so we should
solve the sched_lock protection problems separately.

Requested by: alc
Approved by: jeff (mentor)
79a2e408120d207be00a63c991d92c31c218bfeb 27-May-2007 rwatson <rwatson@FreeBSD.org> Universally adopt most conventional spelling of acquire.
162fa8dc6d0f4fe3538d66df083f7a92e01928b3 18-May-2007 kib <kib@FreeBSD.org> Since renaming of vop_lock to _vop_lock, pre- and post-condition
function calls are no more generated for vop_lock.
Rename _vop_lock to vop_lock1 to satisfy tools/vnode_if.awk assumption
about vop naming conventions. This restores pre/post-condition calls.
e1996cb9609d2e55a26ee78dddbfce4ba4073b53 18-May-2007 jeff <jeff@FreeBSD.org> - define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating
vmcnts. This can be used to abstract away pcpu details but also changes
to use atomics for all counters now. This means sched lock is no longer
responsible for protecting counts in the switch routines.

Contributed by: Attilio Rao <attilio@FreeBSD.org>
ad49fbe326807d31cbb5c33e73cd8dd65d29f317 13-Apr-2007 pjd <pjd@FreeBSD.org> Fix jails and jail-friendly file systems handling:
- We need to allow for PRIV_VFS_MOUNT_OWNER inside a jail.
- Move security checks to vfs_suser() and deny unmounting and updating
for jailed root from different jails, etc.

OK'ed by: rwatson
e140c1e4f16396661dc38b9dbc4ac9e2400b757d 13-Apr-2007 pjd <pjd@FreeBSD.org> When we are running low on vnodes, there is currently no way to ask other
subsystems to release some vnodes. Implement backpressure based on
vfs_lowvnodes event (similar to vm_lowmem for memory).
c6b82992cd5fdc75cb11b07e95a80240f7595630 10-Apr-2007 pjd <pjd@FreeBSD.org> Minor style cleanups (mostly removal of trailing whitespaces).
592c863ef386bf41bf0904c4874ac0cdf1a6973f 10-Apr-2007 pjd <pjd@FreeBSD.org> Correct typos.
c20a93a3456abbbbf22cf096edafe5cb40333087 01-Apr-2007 pjd <pjd@FreeBSD.org> Now that the vdropl() function is public, assert that the vnode interlock
is held.
b0b258dcad8e2fefcf53d7274cbe03b227d1a7f8 31-Mar-2007 des <des@FreeBSD.org> Make vdropl() public; zfs needs it. There is also plenty of existing
file system code (mostly *_reclaim()) which look like this:

/* examine vp */

This can now be rewritten to:

/* examine vp */
vdropl(vp); /* will unlock vp */

MFC after: 1 week
3436aa65049f61a72781bd4880a79d2c079d2974 27-Mar-2007 marcel <marcel@FreeBSD.org> PowerPC is the only architecture with mpsafe_vfs=0. This is now
broken. Rudimentary tests show that PowerPC can run with
mpsafe_vfs=1. Make it so...
214bc5723c38739a6060170f2c421e59d87b2c82 13-Mar-2007 tegge <tegge@FreeBSD.org> Make insmntque() externally visibile and allow it to fail (e.g. during
late stages of unmount). On failure, the vnode is recycled.

Add insmntque1(), to allow for file system specific cleanup when
recycling vnode on failure.

Change getnewvnode() to no longer call insmntque(). Previously,
embryonic vnodes were put onto the list of vnode belonging to a file
system, which is unsafe for a file system marked MPSAFE.

Change vfs_hash_insert() to no longer lock the vnode. The caller now
has that responsibility.

Change most file systems to lock the vnode and call insmntque() or
insmntque1() after a new vnode has been sufficiently setup. Handle
failed insmntque*() calls by propagating errors to callers, possibly
after some file system specific cleanup.

Approved by: re (kensmith)
Reviewed by: kib
In collaboration with: kib
0c00ea16dbadcde1937dfc2bef577820b762e95a 13-Nov-2006 kmacy <kmacy@FreeBSD.org> change vop_lock handling to allowing tracking of callers' file and line for
acquisition of lockmgr locks

Approved by: scottl (standing in for mentor rwatson)
25a000d49f5cae561b322d4e4bf6da5f28040410 07-Nov-2006 jhb <jhb@FreeBSD.org> Simplify operations with sync_mtx in sched_sync():
- Don't drop the lock just to reacquire it again to check rushjob, this
only wastes time.
- Use msleep() to drop the mutex while sleeping instead of explicitly
unlocking around tsleep.

Reviewed by: pjd
f4782279b7cf92b6d32fc109acc82f9db76727e1 07-Nov-2006 jhb <jhb@FreeBSD.org> Fix comment typo and function declaration.
10d0d9cf473dc5f0ce1bf263ead445ffe7819154 06-Nov-2006 rwatson <rwatson@FreeBSD.org> Sweep kernel replacing suser(9) calls with priv(9) calls, assigning
specific privilege names to a broad range of privileges. These may
require some future tweaking.

Sponsored by: nCircle Network Security, Inc.
Obtained from: TrustedBSD Project
Discussed on: arch@
Reviewed (at least in part) by: mlaier, jmg, pjd, bde, ceri,
Alex Lyashkov <umka at sevcity dot net>,
Skip Ford <skip dot ford at verizon dot net>,
Antoine Brodin <antoine dot brodin at laposte dot net>
c524521d2f579301e91dba59f2cab05751984415 04-Nov-2006 pjd <pjd@FreeBSD.org> Typo, 'from' vnode is locked here, not 'to' vnode.
036e929548382eba04c176d581bb24928a5d4155 31-Oct-2006 pjd <pjd@FreeBSD.org> Add gjournal specific code to the UFS file system:
- Add FS_GJOURNAL flag which enables gjournal support on a file system.
- Add cg_unrefs field to the cylinder group structure which holds
number of unreferenced (orphaned) inodes in the given cylinder group.
- Add fs_unrefs field to the super block structure which holds
total number of unreferenced (orphaned) inodes.
- When file or a directory is orphaned (last reference is removed, but
object is still open), increase fs_unrefs and cg_unrefs fields,
which is a hint for fsck in which cylinder groups looks for such
(orphaned) objects.
- When file is last closed, decrease {fs,cg}_unrefs fields.
- Add VV_DELETED vnode flag which points at orphaned objects.

Sponsored by: home.pl
7beaaf5cd2391ef1f8159791b46dbeb83ab0c2fb 22-Oct-2006 rwatson <rwatson@FreeBSD.org> Complete break-out of sys/sys/mac.h into sys/security/mac/mac_framework.h
begun with a repo-copy of mac.h to mac_framework.h. sys/mac.h now
contains the userspace and user<->kernel API and definitions, with all
in-kernel interfaces moved to mac_framework.h, which is now included
across most of the kernel instead.

This change is the first step in a larger cleanup and sweep of MAC
Framework interfaces in the kernel, and will not be MFC'd.

Obtained from: TrustedBSD Project
Sponsored by: SPARTA
aa82c7280868d41b2efd0d802a50550c404120fa 02-Oct-2006 kib <kib@FreeBSD.org> Correct the comment: numvnodes is decreased on vdestroying the vnode.

OKed by: tegge
Approved by: pjd (mentor)
MFC after: 1 week
f42473d76b149849661d41c1f44a36dd01096d40 26-Sep-2006 tegge <tegge@FreeBSD.org> Add mnt_noasync counter to better handle interleaved calls to nmount(),
sync() and sync_fsync() without losing MNT_ASYNC. Add MNTK_ASYNC flag
which is set only when MNT_ASYNC is set and mnt_noasync is zero, and
check that flag instead of MNT_ASYNC before initiating async io.
83154f853d9ca39deb1add01e032aff1f0678514 26-Sep-2006 tegge <tegge@FreeBSD.org> Use mount interlock to protect all changes to mnt_flag and mnt_kern_flag.
This eliminates a race where MNT_UPDATE flag could be lost when nmount()
raced against sync(), sync_fsync() or quotactl().
12baf6e1ec9554f0d2da00985e15d179fce94c88 04-Sep-2006 pjd <pjd@FreeBSD.org> Add 'show vnode <addr>' DDB command.
a32a200792ebecceec10edd874e410c1913c939c 10-Aug-2006 pjd <pjd@FreeBSD.org> getnewvnode() can be called with NULL mp.

Found by: Coverity Prevent (tm)
Coverity ID: 1521
Confirmed by: phk
7f9e892ea9ffc1021556a11a2be7eae73baeafe9 09-Aug-2006 pjd <pjd@FreeBSD.org> Add a bandaid to avoid a deadlock in a situation, when we are trying to suspend
a file system, but need to obtain a vnode. We may not be able to do it, because
all vnodes could be already in use and other processes cannot release them,
because they are waiting in "suspfs" state.

In such situation, we allow to allocate a vnode anyway.

This is a temporary fix - there is no backpressure to free vnodes allocated in
those circumstances.

MFC after: 1 week
Reviewed by: tegge
9119bbc087b47818871346f83dd9e07be800f46d 06-Aug-2006 rwatson <rwatson@FreeBSD.org> Improve commenting of vaccess(), making sure to be clear that the ifdef
capabilities code is there for reference and never actually used. Slight
style tweak.
3944e271246bb7ef8706b8e4f1a805af60a8c643 15-Jul-2006 alc <alc@FreeBSD.org> Enable debug.mpsafevfs by default on arm. Since every architecture except
powerpc has debug.mpsafevfs enabled by default, it is shorter to enumerate
the architectures on which debug.mpsafevfs is off.

Tested by: cognet@
95ef2e0daa48aacb3177579c4bfae3346afc4a64 05-Jul-2006 kib <kib@FreeBSD.org> Back out my rev. 1.674. The better fix (rev. 1.637) is already in tree.

Approved by: kan (mentor)
f0555f2de979cc15b2f5899edf00461f6d7ead98 26-Jun-2006 babkin <babkin@FreeBSD.org> Backed out the change by request from rwatson.

PR: kern/14584
3d8be823b0a2fba7792c161abc25de7109e6ecfa 25-Jun-2006 babkin <babkin@FreeBSD.org> The common UID/GID space implementation. It has been discussed on -arch
in 1999, and there are changes to the sysctl names compared to PR,
according to that discussion. The description is in sys/conf/NOTES.
Lines in the GENERIC files are added in commented-out form.
I'll attach the test script I've used to PR.

PR: kern/14584
Submitted by: babkin
241c4b444c281d2bfae64acabb4435969895760e 08-Jun-2006 kib <kib@FreeBSD.org> Fix the LOR that occurs when the MAC compiled into the kernel
and vnode is destroyed.

Reviewed by: rwatson
LOR: 189
MFC after: 2 weeks
Approved by: kan (mentor)
4eb5a7d9ee279be7c2af44ea69f761d7ddb4ddaf 25-May-2006 ups <ups@FreeBSD.org> Do not set B_NOCACHE on buffers when releasing them in flushbuflist().
If B_NOCACHE is set the pages of vm backed buffers will be invalidated.
However clean buffers can be backed by dirty VM pages so invalidating them
can lead to data loss.
Add support for flush dirty page in the data invalidation function
of some network file systems.

This fixes data losses during vnode recycling (and other code paths
using invalbuf(*,V_SAVE,*,*)) for data written using an mmaped file.

Collaborative effort by: jhb@,mohans@,peter@,ps@,ups@
Reviewed by: tegge@
MFC after: 7 days
0f921e0992f543c4aafd5604a99a6edaa059ff36 12-May-2006 jhb <jhb@FreeBSD.org> Remove various bits of conditional Alpha code and fixup a few comments.
abf5b08807b3f6a761ee0e34e92490f2cd8291c8 29-Apr-2006 pjd <pjd@FreeBSD.org> vn_start_write()/vn_finished_write() is not needed here, because
vn_start_write() is always called earlier in the code path and calling
the function recursively may lead to a deadlock.

Confirmed by: tegge
MFC after: 2 weeks
eee673a6a7b12865b2bc6bbecd62f5c98f16fc56 28-Apr-2006 jeff <jeff@FreeBSD.org> - Add a BO_NEEDSGIANT flag to the bufobj. This flag forces all child
buffers to go on the buf daemon's DIRTYGIANT queue.
- Set BO_NEEDSGIANT on ffs's devvp since the ffs_copyonwrite handler
runs in the context of the buf daemon and may require Giant.
275c043cbe67c0778a690932c6ed437ed4af83a3 04-Apr-2006 jeff <jeff@FreeBSD.org> - VFS_LOCK_GIANT when recycling a vnode via getnewvnode. We may be
recycling for an unrelated filesystem. I really don't like potentially
acquiring giant in the context of a giantless filesystem but there
are reasonable objections to removing the recycling from this path.

Sponsored by: Isilon Systems, Inc.
db0836bdc320d1ca56591a60e6b6d18ab2b2ef10 31-Mar-2006 jeff <jeff@FreeBSD.org> - Add an assert to vgone. It is illegal to call vgone without a reference
to the vnode. Without a reference the vnode will never be vdestroy'd
and the memory will never be reclaimed.

Sponsored by: Isilon Systems, Inc.
b9e82e7feff34d930f11589e465848e6fe444742 31-Mar-2006 jeff <jeff@FreeBSD.org> - Hold a reference from the time vfs_busy starts until vfs_unbusy is
- vfs_getvfs has to return a reference to prevent the returned mountpoint
from changing identities.
- Release references acquired via vfs_getvfs.

Discussed with: tegge
Tested by: kris
Sponsored by: Isilon Systems, Inc.
2086f279cf6f34c84b68782651af859bd337032d 31-Mar-2006 jeff <jeff@FreeBSD.org> - Add the B_NEEDSGIANT flag which is only set if the vnode that owns a buf
requires Giant. It is set in bgetvp and cleared in brelvp.
- Create QUEUE_DIRTY_GIANT for dirty buffers that require giant.
- In the buf daemon, only grab giant when processing QUEUE_DIRTY_GIANT and
only if we think there are buffers in that queue.

Sponsored by: Isilon Systems, Inc.
1a9351b430b5dff12778dbdd564057d6120e6e15 19-Mar-2006 jeff <jeff@FreeBSD.org> - Correct an assert in vop_rename_pre. fdvp may be locked if it is either
the target directory or file. This case should fail in the filesystem
anyway and perhaps kern_rename() should catch it.

Sponsored by: Isilon Systems, Inc.
2e0e03c06ff6c78e7d6f269c98e20d8e2eeb58dc 08-Mar-2006 tegge <tegge@FreeBSD.org> Use vn_start_secondary_write() and vn_finished_secondary_write() as a
replacement for vn_write_suspend_wait() to better account for secondary write

Close race where secondary writes could be started after ffs_sync() returned
but before the file system was marked as suspended.

Detect if secondary writes or softdep processing occurred during vnode sync
loop in ffs_sync() and retry the loop if needed.
774f51ad2c9890088551e2fef8c7c2ec8bc1e446 02-Mar-2006 tegge <tegge@FreeBSD.org> Eliminate a deadlock when creating snapshots. Blocking vn_start_write() must
be called without any vnode locks held. Remove calls to vn_start_write() and
vn_finished_write() in vnode_pager_putpages() and add these calls before the
vnode lock is obtained to most of the callers that don't already have them.
0c56ddfb5dac85b58bec183d384224465e9315b4 02-Mar-2006 tegge <tegge@FreeBSD.org> Don't try to show marker nodes.
0951f797b292b58c87659eb994f19c7bca41f2db 02-Mar-2006 jeff <jeff@FreeBSD.org> - Move softdep from using a global worklist to per-mount worklists. This
has many positive effects including improved smp locking, reducing
interdependencies between mounts that can lead to deadlocks, etc.
- Add the softdep worklist and various counters to the ufsmnt structure.
- Add a mount pointer to the workitem and remove mount pointers from the
various structures derived from the workitem as they are now redundant.
- Remove the poor-man's semaphore protecting softdep_process_worklist and
softdep_flushworklist. Several threads may now process the list
- Add softdep_waitidle() to block the thread until all pending
dependencies being operated on by other threads have been flushed.
- Use softdep_waitidle() in unmount and snapshots to block either
operation until the fs is stable.
- Remove softdep worklist processing from the syncer and move it into the
softdep_flush() thread. This thread processes all softdep mounts
once each second and when it is called via the new softdep_speedup()
when there is a resource shortage. This removes the softdep hook
from the kernel and various hacks in header files to support it.

Reviewed by/Discussed with: tegge, truckman, mckusick
Tested by: kris
63c47d3ba4ae5bc60372b5af574be6fb12f7eec0 23-Feb-2006 jeff <jeff@FreeBSD.org> - Release the mount ref once the vnode has been recycled rather than once
the last reference is dropped. I forgot that vnodes can stick around
for a very long time until processes discover that they are dead. This
means that a vnode reference is not sufficient to keep the mount
referenced and even more code will be required to ref mount points.

Discovered by: kris
d099befc573acb0eb5b095b7f0d22beb38abc067 22-Feb-2006 jeff <jeff@FreeBSD.org> - Grab a mnt ref in vfs_busy() before dropping the interlock. This will
prevent the mount point from going away while we're waiting on the lock.
The ref does not need to persist once we have the lock because the
lock prevents the mount point from being unmounted.

MFC After: 1 week
4c912bf42a10c895b13ee13c4eb058800b35f9c4 06-Feb-2006 jeff <jeff@FreeBSD.org> - Add a ref count to the mount structure. Sleep for up to 3 seconds in
vfs_mount_destroy waiting for this ref to hit 0. We don't print an
error if we are rebooting as the root mount always retains some refernces
by init proc.
- Acquire a mnt ref for every vnode allocated to a mount point. Drop this
ref only once vdestroy() has been called and the mount has been freed.
- No longer NULL the v_mount pointer in delmntque() so that we may release
the ref after vgone() has been called. This allows us to guarantee
that the mount point structure will be valid until the last vnode has
lost its last ref.
- Fix a few places that rely on checking v_mount to detect recycling.

Sponsored by: Isilon Systems, Inc.
MFC After: 1 week
47857ecfe15e8979cdb376b6904ce987b0668a83 01-Feb-2006 jeff <jeff@FreeBSD.org> - Solve a race where we could lose a call to VOP_INACTIVE. If vget() waiting
on a lock held the last usecount ref on a vnode and the lock failed we
would not call INACTIVE. Solve this by only holding a holdcnt to prevent
the vnode from disappearing while we wait on vn_lock. Other callers
may now VOP_INACTIVE while we are waiting on the lock, however this race
is acceptable, while losing INACTIVE is not.

Discussed with: kan, pjd
Tested by: kkenn
Sponsored by: Isilon Systems, Inc.
MFC After: 1 week
a70f9992d4d20ade14008ebc80a7b04f20f54a8e 28-Jan-2006 kris <kris@FreeBSD.org> Back out r1.653; it turns out that the race (or at least the printf) is
actually not hard to trigger, and it can cause a lot of console spam.

Approved by: kan
f04c2fbb7d0d318d5909779a7923fc936892e9cf 21-Jan-2006 rwatson <rwatson@FreeBSD.org> Convert remaining functions in vfs_subr.c from K&R prototypes to ANSI C
prototypes, as the majority of new functions added have been in this
style. Changing prototype style now results in gcc noticing that the
implementation of vn_pollrecord() has a 'short' argument instead of
'int' as prototyped in vnode.h, so correct that definition. In practice
this didn't matter as only poll flags in the lower 16 bits are used.

MFC after: 1 week
d344c1186100b5323aefd8a08d585a95db5aa73c 09-Jan-2006 tegge <tegge@FreeBSD.org> Add marker vnodes to ensure that all vnodes associated with the mount point are
iterated over when using MNT_VNODE_FOREACH.

Reviewed by: truckman
2cf01da41229a069a9e67de3d163a840b308d5da 29-Dec-2005 pjd <pjd@FreeBSD.org> Print a warning when we miss vinactive() call, because of race in vget().
The race is very real, but conditions needed for triggering it are rather
hard to meet now.
When gjournal will be committed (where it is quite easy to trigger) we need
to fix it.

For now, verify if it is really hard to trigger.

Discussed with: kan
0bcdf7c0332746573fc0de7a81621778fcca937b 09-Nov-2005 dwhite <dwhite@FreeBSD.org> This is a workaround for a complicated issue involving VFS cookies and devfs.
The PR and patch have the details. The ultimate fix requires architectural
changes and clarifications to the VFS API, but this will prevent the system
from panicking when someone does "ls /dev" while running in a shell under the

This issue affects HEAD and RELENG_6 only.

PR: 88249
Submitted by: "Devon H. O'Dell" <dodell@ixsystems.com>
MFC after: 3 days
be4f357149ecc68e1bf349f69f702cad430aec97 31-Oct-2005 rwatson <rwatson@FreeBSD.org> Normalize a significant number of kernel malloc type names:

- Prefer '_' to ' ', as it results in more easily parsed results in
memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion. Similar changes are required for UMA zone names.
4bb62bb5637d6ee2a9592bb52cfa46d1aa1cf7fb 14-Oct-2005 kris <kris@FreeBSD.org> mpsafevm has been stable and defaulted to 1 on sparc64 for over 6 months,
so we are ready for mpsafevfs=1 by default on sparc64 too. I have been
running this on all my sparc64 machines for over 6 months, and have not
encountered MD problems.

MFC after: 1 week
0fb2e655fdaa468d5188d6e87ef07cb383149b5b 12-Oct-2005 dds <dds@FreeBSD.org> Move execve's access time update functionality into a new
vfs_mark_atime() function, and use the new function for
performing efficient atime updates in mmap().

Reviewed by: bde
MFC after: 2 weeks
414043e88df9287115e6133263316c8346532605 30-Sep-2005 truckman <truckman@FreeBSD.org> Un-staticize runningbufwakeup() and staticize updateproc.

Add a new private thread flag to indicate that the thread should
not sleep if runningbufspace is too large.

Set this flag on the bufdaemon and syncer threads so that they skip
the waitrunningbufspace() call in bufwrite() rather than than
checking the proc pointer vs. the known proc pointers for these two
threads. A way of preventing these threads from being starved for
I/O but still placing limits on their outstanding I/O would be

Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from
blocking on the runningbufspace check while holding snaplk. This
prevents snaplk from being held for an arbitrarily long period of
time if runningbufspace is high and greatly reduces the contention
for snaplk. The disadvantage is that ffs_copyonwrite() can start
a large amount of I/O if there are a large number of snapshots,
which could cause a deadlock in other parts of the code.

Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace
before attempting to grab snaplk so that I/O requests waiting on
snaplk are not counted in runningbufspace as being in-progress.
Increment runningbufspace again before actually launching the
original I/O request.

Prior to the above two changes, the system could deadlock if enough
I/O requests were blocked by snaplk to prevent runningbufspace from
falling below lorunningspace and one of the bawrite() calls in
ffs_copyonwrite() blocked in waitrunningbufspace() while holding

See <http://www.holm.cc/stress/log/cons143.html>
63fab0fe2d627784fac90b843f392304beef139d 16-Sep-2005 tegge <tegge@FreeBSD.org> Break out of loop if next buffer pointer has become invalid while flushing
current buffer.

Reviewed by: kan
f2fa5d310d516a21d7e2249d27ccea17fbce92f1 12-Sep-2005 rwatson <rwatson@FreeBSD.org> In vfs_kqfilter(), return EINVAL instead of 1 (EPERM) when an unsupported
kqueue filter type is requested on a vnode.

MFC after: 3 days
57e487868544fc44467f97d70c98087f44f9b24a 12-Sep-2005 jkim <jkim@FreeBSD.org> use monotonic `time_uptime' instead of `time_second'

Approved by: anholt (mentor)
Discussed on: arch
4e50b9ebd8baff78363d11817588ca53e4dbf062 12-Sep-2005 phk <phk@FreeBSD.org> Introduce vfs_read_dirent() which can help VOP_READDIR() implementations
by handling all the cookie stuff.
3041058fada4e43fce4a29b8186c157e2a2e516f 28-Aug-2005 ssouhlal <ssouhlal@FreeBSD.org> Fix a typo in vop_rename_pre() where we ended up using vholdl()
instead of vhold(), even though the vnode interlock is unlocked.

MFC after: 3 days
aa31faa377ed01e569a534e5afd8b5ea3264a292 23-Aug-2005 truckman <truckman@FreeBSD.org> Back out the removal of LK_NOWAIT from the VOP_LOCK() call in
vlrureclaim() in vfs_subr.c 1.636 because waiting for the vnode
lock aggravates an existing race condition. It is also undesirable
according to the commit log for 1.631.

Fix the tiny race condition that remains by rechecking the vnode
state after grabbing the vnode lock and grabbing the vnode interlock.

Fix the problem of other threads being starved (which 1.636 attempted
to fix by removing LK_NOWAIT) by calling uio_yield() periodically
in vlrureclaim(). This should be more deterministic than hoping
that VOP_LOCK() without LK_NOWAIT will block, which may not happen
in this loop.

Reviewed by: kan
MFC after: 5 days
867f71548b0982d44cdb47ab501c79b90e8f95b9 20-Aug-2005 rwatson <rwatson@FreeBSD.org> Silence "busy" warnings when unmounting devfs at system shutdown. This
is a workaround for non-symetric teardown of the file systems at
shutdown with respect to the mount order at boot. The proper long term
fix is to properly detach devfs from the root mount before unmounting
each, and should be implemented, but since the problem is non-harmful,
this temporary band-aid will prevent false positive bug reports and
unnecessary error output for 6.0-RELEASE.

MFC after: 3 days
Tested by: pav, pjd
f94807ecebdf54659f933acc85c0143f819e1b07 13-Aug-2005 marcel <marcel@FreeBSD.org> Make mpsafe_vfs=1 the default on ia64.
9590889861c0a63193bf1d1cbd589aaea09674e3 10-Aug-2005 kan <kan@FreeBSD.org> Do not drop the vnode interlock if vdropl is called on already doomed vnode.
vdropl callers expect it to return with interlock still being held.

MFC after: 2 days
1f4d3e95efcd3e9a56a016f0c3586048991fce25 06-Aug-2005 ssouhlal <ssouhlal@FreeBSD.org> Holding a vnode doesn't prevent v_mount from disappearing (when the
vnode is inactivated), possibly leading to a NULL dereference when
checking if the mount wants knotes to be activated in the VOP hooks.
So, we add a new vnode flag VV_NOKNOTE that is only set in getnewvnode(),
if necessary, and check it when activating knotes.
Since the flags are not erased when a vnode is being held, we can safely
read them.

Reviewed by: kris@
MFC after: 3 days
df3babd63b330e2298807a8f27c8388297836f6f 03-Aug-2005 jeff <jeff@FreeBSD.org> - Unlock before we call mac_destroy_vnode to prevent a lock order reversal.

Found by: trhodes
1b2743636cf7931f817712daaed5e6d2e3701222 20-Jul-2005 jeff <jeff@FreeBSD.org> - Allow vnlru to drop giant if the filesystem does not require it. The
vnlru proc is extremely inefficient, potentially iteration over tens of
thousands of vnodes without blocking. Droping Giant allows other threads
to preempt us although we should revisit the algorithm to fix the runtime
problems especially since this may hold up all vnode allocations.
- Remove the LK_NOWAIT from the VOP_LOCK in vlrureclaim. This provides
a natural blocking point to help alleviate the situation described above
although it may not technically be desirable.
- yield after we make a pass on all mount points to prevent us from
blocking other threads which require Giant.

MFC after: 2 weeks
38bf7eadf9f55b6db1943612df38342cb56bf7a4 05-Jul-2005 pjd <pjd@FreeBSD.org> Fix one "wrong b_bufobj" panic in reassignbuf() by moving VI_UNLOCK(vp)
below KASSERT()s, which means there was no real problem here, we just
needed better locking for assertions.

OK'ed by: jeff
Approved by: re (scottl)
efe31cd3da51660534ea5ec76bd1566fe89689d2 01-Jul-2005 ssouhlal <ssouhlal@FreeBSD.org> Fix the recent panics/LORs/hangs created by my kqueue commit by:

- Introducing the possibility of using locks different than mutexes
for the knlist locking. In order to do this, we add three arguments to
knlist_init() to specify the functions to use to lock, unlock and
check if the lock is owned. If these arguments are NULL, we assume
mtx_lock, mtx_unlock and mtx_owned, respectively.

- Using the vnode lock for the knlist locking, when doing kqueue operations
on a vnode. This way, we don't have to lock the vnode while holding a
mutex, in filt_vfsread.

Reviewed by: jmg
Approved by: re (scottl), scottl (mentor override)
Pointyhat to: ssouhlal
Will be happy: everyone
59704179663d0e0437703c8e6cf359e319c1e73c 18-Jun-2005 jeff <jeff@FreeBSD.org> - Try to catch the wrong bufobj panics a little earlier. I believe they
are actually caused by a buf with both VNCLEAN and VNDIRTY set. In
the traces it is clear that the buf is removed from the dirty queue while
it is actually on the clean queue which leaves the tail pointer set.
Assert that both flags are not set in buf_vlist_add and buf_vlist_remove.

Sponsored by: Isilon Systems, Inc.
Approved by: re (blanket vfs)
ca07a9f0126c41d3a69a8abee199c9153ae92cfb 16-Jun-2005 jeff <jeff@FreeBSD.org> - Change holdcnt use around vnode recycling. We now always keep a holdcnt
ref while we're calling vgone(). This prevents transient refs from
re-adding us to the free list. Previously, a vfree() triggered via
vinvalbuf() getting rid of all of a vnode's pages could place a partially
destructed vnode on the free list where vtryrecycle() could find it. The
first call to vtryrecycle would hang up on the vnode lock, but when it
failed it would place a now dead vnode onto the free list, and another
call to vtryrecycle() would free an already free vnode. There were many
complications of having a zero ref count while freeing which can now go
- Change vdropl() to release the interlock before returning. All callers
now respect this, so vdropl() directly frees VI_DOOMED vnodes once the
last ref is dropped. This means that we'll never have VI_DOOMED vnodes
on the free list.
- Seperate v_incr_usecount() into v_incr_usecount(), v_decr_usecount() and
v_decr_useonly(). The incr/decr split is so that incr usecount can
return with the interlock still held while decr drops the interlock so
it can call vdropl() which will potentially free the vnode. The calling
function can't drop the lock of an already free'd node. v_decr_useonly()
drops a usecount without droping the hold count. This is done so the
usecount reaches zero in vput() before we recycle, however the holdcount
is still 1 which prevents any new references from placing the vnode
back on the free list.
- Fix vnlrureclaim() to vhold the vnode since it doesn't do a vget(). We
wouldn't want vnlrureclaim() to bump the usecount since this has
different semantics. Also change vnlrureclaim() to do a NOWAIT on the
vn_lock. When this function runs we're usually in a desperate situation
and we wouldn't want to wait for any specific vnode to be released.
- Fix a bunch of misc comments to reflect the new behavior.
- Add vhold() and vdrop() to vflush() for the same reasons that we do in
vlrureclaim(). Previously we held no reference and a vnode could have
been freed while we were waiting on the lock.
- Get rid of vlruvp() and vfreehead(). Neither are used. vlruvp() should
really be rethought before it's reintroduced.
- vgonel() always returns with the vnode locked now and never puts the
vnode back on a free list. The vnode will be freed as soon as the last
reference is released.

Sponsored by: Isilon Systems, Inc.
Debugging help from: Kris Kennaway, Peter Holm
Approved by: re (blanket vfs)
909b5b7c58acbee4d16abcaa2d0b153bbee659c0 14-Jun-2005 jeff <jeff@FreeBSD.org> - In reassignbuf() add many asserts to validate the head and tail pointers
of the clean and dirty lists. This is in an attempt to catch the wrong
bufobj problem sooner.
- In vgonel() don't acquire an extra reference in the active case, the
vnode lock and VI_DOOMED protect us from recursively cleaning.
- Also in vgonel() clean up some stale comments.

Sponsored by: Isilon Systems, Inc.
Approved by: re (blanket vfs)
7a825fb4571a4a801e6216d71f9ef2f76fe66ff9 13-Jun-2005 jeff <jeff@FreeBSD.org> - Don't make vgonel() globally visible, we want to change its prototype
anyway and it's not used outside of vfs_subr.c.
- Change vgonel() to accept a parameter which determines whether or not
we'll put the vnode on the free list when we're done.
- Use the new vgonel() parameter rather than VI_DOOMED to signal our
intentions in vtryrecycle().
- In vgonel() return if VI_DOOMED is already set, this vnode has already
been reclaimed.

Sponsored by: Isilon Systems, Inc.
2ef7df2a1a2c79347a9ead2d4db6a54318f55032 13-Jun-2005 jeff <jeff@FreeBSD.org> - Add KTR_VFS events to vdestroy, vtruncbuf, vinvalbuf, vfreehead.

Sponsored by: Isilon Systems, Inc.
306b180d66894696f556a1750064802cb3bb7f71 11-Jun-2005 jeff <jeff@FreeBSD.org> - Assert that we're not in the name cache anymore in vdestroy().

Sponsored by: Isilon Systems, Inc.
3625e8746bd4a897a2b444c15432fe7efa9e0751 11-Jun-2005 jeff <jeff@FreeBSD.org> - Add KTR_VFS tracing to track the life of vnodes. Eventually KTR_VFS
events could be added to cover other interesting details.
- Add some VNASSERTs to discover places where we access vnodes after
they have been uma_zfree'd before we try to free them again.
- Add a few more VNASSERTs to vdestroy() to be certain that the vnode is
really unused.

Sponsored by: Isilon Systems, Inc.
0835f7b4a9a7e80823912ce250d4082b5a23a401 09-Jun-2005 ssouhlal <ssouhlal@FreeBSD.org> Allow EVFILT_VNODE events to work on every filesystem type, not just
UFS by:
- Making the pre and post hooks for the VOP functions work even when
DEBUG_VFS_LOCKS is not defined.
- Moving the KNOTE activations into the corresponding VOP hooks.
- Creating a MNTK_NOKNOTE flag for the mnt_kern_flag field of struct
mount that permits filesystems to disable the new behavior.
- Creating a default VOP_KQFILTER function: vfs_kqfilter()

My benchmarks have not revealed any performance degradation.

Reviewed by: jeff, bde
Approved by: rwatson, jmg (kqueue changes), grehan (mentor)
4a9af33a3f3f7f1b7338de4a17a6db62a0a0aaf4 07-Jun-2005 jeff <jeff@FreeBSD.org> - Clear OWEINACT prior to calling VOP_INACTIVE to remove the possibility
of a vget causing another call to INACTIVE before we're finished.
e513415af91332f009d9733c7156e1a4a60c244f 06-May-2005 cperciva <cperciva@FreeBSD.org> If we are going to
1. Copy a NULL-terminated string into a fixed-length buffer, and
2. copyout that buffer to userland,
we really ought to
0. Zero the entire buffer

Security: FreeBSD-SA-05:08.kmem
92f17d1e6acf7fd66c58215578dbc6bdb5f51829 03-May-2005 jeff <jeff@FreeBSD.org> - A vnode may have made its way onto the free list while it was being
vgone'd. We must remove it from the freelist before returning in
vtryrecycle() or we may get a duplicate free.

Reported by: kkenn
431f1afe8c934f27b93fcad229dec8c37ac7ac88 02-May-2005 csjp <csjp@FreeBSD.org> Since it is not possible for curthread to be NULL in this context,
drop the check+initialization for a straight initialization. Also
assert that curthread will never be NULL just to be sure.

Discussed with: rwatson, peter
MFC after: 1 week
dd41538cd8ddfa93239188813c57c2febbfd398f 01-May-2005 jeff <jeff@FreeBSD.org> - All buffers should either be clean or dirty. If neither of these flags
are set when we attempt to remove a buffer from a queue we should panic.
Hopefully this will catch the source of the wrong bufobj panics.

Sponsored by: Isilon Systems, Inc.
7354fc5e28617865d36977353eed2a1cca71ad54 30-Apr-2005 jeff <jeff@FreeBSD.org> - In vnlru_free() remove the vnode from the free list before we call
vtryrecycle(). We could sometimes get into situations where two threads
could try to recycle the same vnode before this.
- vtryrecycle() is now responsible for returning the vnode to the free list
if it fails and someone else hasn't done it.
- Make a new function vfreehead() which moves a vnode to the head of the
free list and use it in vgone() to clean up that code a bit.

Sponsored by: Isilon Systems, Inc.
Reported by: pho, kkenn
0e56b01ed6ecd47900b8028cae6e77660470d18c 27-Apr-2005 jeff <jeff@FreeBSD.org> - Don't vgonel() via vgone() or vrecycle() if the vnode is already doomed.
This fixes forced unmounts via nullfs.

Reported by: kkenn
Sponsored by: Isilon Systems, Inc.
a80bbe799ec94bb1dd35b8102daecbf774fd7452 27-Apr-2005 jeff <jeff@FreeBSD.org> - Stop setting vxthread, we've asserted that it was useless for several
weeks now.
31cfb7f24206d3a671772a4a3cd12167a7eb8543 22-Apr-2005 jeff <jeff@FreeBSD.org> - Disable code which allows getnewvnode() to fail. Many ffs_vget() callers
do not correctly deal with failures. This presently risks deadlock
problems if dependency processing is held up by failures to allocate
a vnode, however, this is better than the situation with the failures.

Sponsored by: Isilon Systems, Inc.
4bd811c8dd42f8aa4e81dbf1c945e21e08bb4cd1 18-Apr-2005 phk <phk@FreeBSD.org> Initialize mountlist_mtx with an MTX_SYSINIT(), we need it to be ready
5642885b84d3a8dfdbf202dfbab02e5c4a93576f 13-Apr-2005 jeff <jeff@FreeBSD.org> - Change vop_lookup_post assertions to reflect recent vfs_lookup changes.

Sponsored by: Isilon Systems, Inc.
b391d2675bcd83d9558c90ff9d22227a862de870 11-Apr-2005 jeff <jeff@FreeBSD.org> - Enable ASSERT_VOP_ELOCKED and assert_vop_elocked() now that vnode_if.awk
uses it.

Sponsored by: Isilon Systems, Inc.
17be4cbfa047a243a155d8a89ab2d2d73385453b 11-Apr-2005 jeff <jeff@FreeBSD.org> - Change the VOP_LOCK UPGRADE in vput() to do a LK_NOWAIT to avoid a
potential lock order reversal. Also, don't unlock the vnode if this
fails, lockmgr has already unlocked it for us.
- Restructure vget() now that vn_lock() does all of VI_DOOMED checking
for us and also handles the case where there is no real lock type.
- If VI_OWEINACT is set, we need to upgrade the lock request to EXCLUSIVE
so that we can call inactive. It's not legal to vget a vnode that hasn't
had INACTIVE called yet.

Sponsored by: Isilon Systems, Inc.
60d07eec30ec71879fa602ff8f7bccf45bf7b57c 06-Apr-2005 jeff <jeff@FreeBSD.org> - Assert that the bufobj matches in flushbuflists. I still haven't gotten
to root cause on exactly how this happens.
- If the assert is disabled, we presently try to handle this case, but the
BUF_UNLOCK was missing. Thus, if this condition ever hit we would leak
a buf lock.

Many thanks to Peter Holm for all his help in finding this bug. He really
put more effort into it than I did.
d42252c15869835b05a5453280cdff3c5b98bb60 05-Apr-2005 jeff <jeff@FreeBSD.org> - Move NDFREE() from vfs_subr to vfs_lookup where namei() is.
d8b17b2eac0bc6aa7fcc63f8f2aed9016c3a885d 04-Apr-2005 jeff <jeff@FreeBSD.org> - Add a missing unlock of the vnode_free_list_mtx.

Spotted by: Antoine Brodin
b6f8b968c23aec5369c008801a56e489183a2792 04-Apr-2005 jeff <jeff@FreeBSD.org> - Instead of waiting forever to get a vnode in getnewvnode() wait for
one to become available for one second and then return ENFILE. We
can run out of vnodes, and there must be a hard limit because without
one we can quickly run out of KVA on x86. Presently the system can
deadlock if there are maxvnodes directories in the namecache. The
original 4.x BSD behavior was to return ENFILE if we reached the max,
but 4.x BSD did not have the vnlru proc so it was less profitable to
e4d4b610ba46c67741445dba97004a3acf77973d 31-Mar-2005 jeff <jeff@FreeBSD.org> - Disable vfs shared locks by default. They must be specifically enabled
on filesystems which safely support them. It appears that many
network filesystems specifically are not shared lock safe.

Sponsored by: Isilon Systems, Inc.
97c40ebd4979f7f9f856c27d894bb4b0e30c5c1c 31-Mar-2005 jeff <jeff@FreeBSD.org> - LK_NOPAUSE is a nop now.

Sponsored by: Isilon Systems, Inc.
d3e0f098bea6d2a53ef921866861466d1d128145 30-Mar-2005 das <das@FreeBSD.org> Eliminate v_id and v_ddid. The name cache now holds references to
vnodes whose names it caches, so we no longer need a `generation
number' to tell us if a referenced vnode is invalid. Replace the use
of the parent's v_id in the hash function with the address of the
parent vnode.

Tested by: Peter Holm
Glanced at by: jeff, phk
b82462f0081a350485e15d373e6c0db663abaeb1 29-Mar-2005 jeff <jeff@FreeBSD.org> - Dont clear OWEINACT in vbusy(), we still owe an inactive call if someone
vhold()s us.
- Avoid an extra mutex acquire and release in the common case of vgonel()
by checking for OWEINACT at the start of the function.
- Fix the case where we set OWEINACT in vput(). LK_EXCLUPGRADE drops our
shared lock if it fails.

Sponsored by: Isilon Systems, Inc.
2059b48294bea78c7f475d3982eb36101387d76c 29-Mar-2005 jeff <jeff@FreeBSD.org> - Don't initial v_dd here, let cache_purge() do it for us.

Sponsored by: Isilon Systems, Inc.
8c749eb801f645d30dfa06ba195c308a01dd541d 28-Mar-2005 jeff <jeff@FreeBSD.org> - Move code that should probably be an assert above the main body of
vrele so that we can decrease the indentation of the real work and
make things slightly more clear.

Sponsored by: Isilon Systems, Inc.
b25a47299357437c7d4212884ceaecdab1369edc 28-Mar-2005 jeff <jeff@FreeBSD.org> - Adjust asserts in vop_lookup_post() to match the new post PDIRUNLOCK

Sponsored by: Isilon Systems, Inc.
eac95420b80054c258d4e34487b087258410d1f5 27-Mar-2005 phk <phk@FreeBSD.org> Remove another ';' after if().

Also spotted by: bz
4b5dfbb1ae0878d1137c31291fa6f3b372a324de 27-Mar-2005 phk <phk@FreeBSD.org> Remove extra ; at end of if().

Found by: bz
6d72a7bd6045cc1f68678453d98166ef7d09756a 25-Mar-2005 jeff <jeff@FreeBSD.org> - Don't recycle vnodes anymore. Free them once they are dead. getnewvnode
now always allocates a new vnode.
- Define a new function, vnlru_free, which frees vnodes from the free list.
It takes as a parameter the number of vnodes to free, which is
wantfreevnodes - freevnodes when called from vnlru_proc or 1 when
called from getnewvnode(). For now, getnewvnode() still tries to reclaim
a free vnode before creating a new one when we are near the limit.
- Define a function, vdestroy, which handles the actual release of memory
and teardown of locks, etc. This could become a uma_dtor() routine.
- Get rid of minvnodes. Now wantfreevnodes is 1/4th the max vnodes. This
keeps more unreferenced vnodes around so that files which have only
been stat'd are less likely to be kicked out of the system before we
have a chance to read them, etc. These vnodes may still be freed via
the normal vnlru_proc() routines which may some day become a real lru.
0210925e420b2f938988d2f44a571ed45d741fd8 24-Mar-2005 jeff <jeff@FreeBSD.org> - Pass LK_EXCLUSIVE to VFS_ROOT() to satisfy the new flags argument. For
now, all calls to VFS_ROOT() should still acquire exclusive locks.

Sponsored by: Isilon Systems, Inc.
bf2e6f43e88a9251233204f4226aba4d6956c835 24-Mar-2005 jeff <jeff@FreeBSD.org> - If vput() is called with a shared lock it must upgrade to an exclusive
before it can call VOP_INACTIVE(). This must use the EXCLUPGRADE path
because we may violate some lock order with another locked vnode if
we drop and reacquire the lock. If EXCLUPGRADE fails, we mark the
vnode with VI_OWEINACT. This case should be very rare.
- Clear VI_OWEINACT in vinactive() and vbusy().
- If VI_OWEINACT is set in vgone() do the VOP_INACTIVE call here as well.

Sponsored by: Isilon Systems, Inc.
d289cc6b5d146ab20a01f7ed37ce664f92f49af4 15-Mar-2005 jeff <jeff@FreeBSD.org> - Now that there are no external users of vfree() make it static.
no one else attempts to grow a dependency on them.
- Now that objects with pages hold the vnode we don't have to do unlocked
checks for the page count in the vm object in VSHOULDFREE. These three
macros could simply check for holdcnt state transitions to determine
whether the vnode is on the free list already, but the extra safety
the flag affords us is probably worth the minimal cost.
- The leafonly sysctl and code have been dead for several years now,
remove the sysctl and the code that employed it from vtryrecycle().
- vtryrecycle() also no longer has to check the object's page count as
the object holds the vnode until it reaches 0.

Sponsored by: Isilon Systems, Inc.
2115694bbc6bb40d45f659eef89bac4b98ad6585 15-Mar-2005 jeff <jeff@FreeBSD.org> - Expose vholdl() so it may be used outside of vfs_subr.c
3fcb9112fb1915c98e275295e779c38aa7a0b32a 14-Mar-2005 jeff <jeff@FreeBSD.org> - Increment the holdcnt once for each usecount reference. This allows us
to use only the holdcnt to determine whether a vnode may be recycled,
simplifying the V* macros as well as vtryrecycle(), etc.

Sponsored by: Isilon Systems, Inc.
2a81e8df21b1d63b9464460805910362704b6076 14-Mar-2005 jeff <jeff@FreeBSD.org> - We do not have to check the object's ref_count in VSHOULDFREE or
vtryrecycle(). All obj refs also ref the vnode.
- Consistently use v_incr_usecount() to increment the usecount. This will
be more important later.

Sponsored by: Isilon Systems, Inc.
bb63517e7e0f7af6a62d67c01c83cb9c5cb2f8f7 14-Mar-2005 jeff <jeff@FreeBSD.org> - Slightly rearrange vrele() to move the common case in one indentation

Sponsored by: Isilon Systems, Inc.
a307ec6ef809153ac619d169f0bf6680db24e4ea 14-Mar-2005 jeff <jeff@FreeBSD.org> - Rework vget() so we drop the usecount in two failure cases that were
missed by my last commit.

Sponsored by: Isilon Systems, Inc.
d29b61a36562eb17d64543ec501e3ee4b5c98e57 13-Mar-2005 jeff <jeff@FreeBSD.org> - Remove vx_lock, vx_unlock, vx_wait, etc.
- Add a vn_start_write/vn_finished_write around vlrureclaim so we don't do
writing ops without suspending. This could suspend the vlruproc which
should not be a problem under normal circumstances.
- Manually implement VMIGHTFREE in vlrureclaim as this was the only instance
where it was used.
- Acquire a lock before calling vgone() as it now requires it.
- Move the acquisition of the vnode interlock from vtryrecycle() to
getnewvnode() so that if it fails we don't drop and reacquire the
- Check for a usecount or holdcount at the end of vtryrecycle() in case
someone grabbed a ref while we were recycling. Abort the recycle, and
on the final ref drop this vnode will be placed on the head of the free
- Move the redundant VOP_INACTIVE protection code into the local
vinactive() routine to avoid code bloat.
- Keep the vnode lock held across calls to vgone() in several places.
- vgonel() no longer uses XLOCK, instead callers must hold an exclusive
vnode lock. The VI_DOOMED flag is set to allow other threads to detect
a vnode which is no longer valid. This flag is set until the last
reference is gone, and there are no chances for a new ref. vgonel()
holds this lock across the entire function, which greatly simplifies
_ Only vfree() in one place in vgone() not three.
- Adjust vget() to check the VI_DOOMED flag prior to waiting on the lock
in the LK_NOWAIT case. In other cases, check after we have slept and
acquired an exlusive lock. This will simulate the old vx_wait()

Sponsored by: Isilon Systems, Inc.
d2fecffa39eea996b742a72ed14b73f73cf0217d 23-Feb-2005 jeff <jeff@FreeBSD.org> - Enable SMP VFS by default on current. More users are needed to turn up
any remaining bugs. Anyone inconvenienced by this can still disable it
in the loader.

Sponsored by: Isilon Systems, Inc.
cd66df18cccbb54d9499965f62bdb5ffab891dae 23-Feb-2005 jeff <jeff@FreeBSD.org> - Only the xlock holder should be calling VOP_LOCK on a vp once VI_XLOCK
has been set. Assert that this is the case so that we catch filesystems
who are using naked VOP_LOCKs in illegal cases.

Sponsored by: Isilon Systems, Inc.
0d71606b28f40348cf50e0975609ef1438285b64 22-Feb-2005 jeff <jeff@FreeBSD.org> - Add a check for xlock in vop_lock_assert. Presently the xlock is
considered to be as good as an exclusive lock, although there is still a
possibility of someone acquiring a VOP LOCK while xlock is held.

Sponsored by: Isilon Systems, Inc.
31dd38da62328e545a0c4e79198b1b866d43b1d3 22-Feb-2005 phk <phk@FreeBSD.org> Zero the v_un container field to make sure everything is gone.
f1d058e0327e1be6ec9de3aae8a83b20ad9a7325 22-Feb-2005 phk <phk@FreeBSD.org> Reap more benefits from DEVFS:

List devfs_dirents rather than vnodes off their shared struct cdev, this
saves a pointer field in the vnode at the expense of a field in the
devfs_dirent. There are often 100 times more vnodes so this is bargain.
In addition it makes it harder for people to try to do stypid things like
"finding the vnode from cdev".

Since DEVFS handles all VCHR nodes now, we can do the vnode related
cleanup in devfs_reclaim() instead of in dev_rel() and vgonel().
Similarly, we can do the struct cdev related cleanup in dev_rel()
instead of devfs_reclaim().

rename idestroy_dev() to destroy_devl() for consistency.

Add LIST_ENTRY de_alias to struct devfs_dirent.
Remove v_specnext from struct vnode.
Change si_hlist to si_alist in struct cdev.
String new devfs vnodes' devfs_dirent on si_alist when
we create them and take them off in devfs_reclaim().

Fix devfs_revoke() accordingly. Also don't clear fields
devfs_reclaim() will clear when called from vgone();

Let devfs_reclaim() call dev_rel() instead of vgonel().

Move the usecount tracking from dev_rel() to devfs_reclaim(),
and let dev_rel() take a struct cdev argument instead of vnode.

Destroy SI_CHEAPCLONE devices in dev_rel() (instead of
devfs_reclaim()) when they are no longer used. (This
should maybe happen in devfs_close() instead.)
cd21b2e10ce05cf965e2989f557834034cf53854 22-Feb-2005 phk <phk@FreeBSD.org> Remove vfinddev(), it is generally bogus when faced with jails and
chroot and has no legitimate use(r)s in the tree.
66dfd6396149e353342cd31080e6b88469aec335 19-Feb-2005 phk <phk@FreeBSD.org> Try to unbreak the vnode locking around vop_reclaim() (based mostly on
patch from kan@).

Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on
close. This is not yet a generally safe function, but for this very
specific use it is safe. This solves the problem with buffers not
being flushed by unmount or after failed mount attempts.
1fe081e9547b3378fe56a3029993aeec6de9339f 18-Feb-2005 phk <phk@FreeBSD.org> Make sure to drop the VI_LOCK in vgonel();

Spotted by: Taku YAMAMOTO <taku@tackymt.homeip.net>
af1fa2025c1adbd3aa43852fd421e5de95e7e48a 17-Feb-2005 phk <phk@FreeBSD.org> Introduce vx_wait{l}() and use it instead of home-rolled versions.
b6768ad7ab5e57fa2022ae70c6a59bf4b89bdfb8 17-Feb-2005 phk <phk@FreeBSD.org> Convert KASSERTS to VNASSERTS
5dd8d305754a018754a81424da7eb8d297b122eb 10-Feb-2005 phk <phk@FreeBSD.org> Make various vnode related functions static
5d1652b89d501106b492b3ed096ffadf96ecb325 10-Feb-2005 phk <phk@FreeBSD.org> Don't pass NULL to vprint()
06f7a532e93276cb96108556ec3b75f092592f21 08-Feb-2005 jeff <jeff@FreeBSD.org> - Add a new assert in the getnewvnode(). Assert that the usecount is still
0 to detect getnewvnode() races.
- Add the vnode address to a few panics near by to help in debugging.

Sponsored by: Isilon Systems, Inc.
628952636c9d2900c378289f5f819c2c9cbf7888 07-Feb-2005 phk <phk@FreeBSD.org> Access vmobject via the bufobj instead of the vnode
d2bbb620e902ad472a90d5c5d1fc3286b192a0a2 07-Feb-2005 phk <phk@FreeBSD.org> Don't call VOP_DESTROYVOBJECT(), trust that VOP_RECLAIM() did what
was necessary.
4f73d0b6fc7bb30749ac302949df13ac9a76c817 28-Jan-2005 phk <phk@FreeBSD.org> Remove unused argument to vrecycle()
f8b1ba904fabe423559402854cc1b025ada4641b 28-Jan-2005 phk <phk@FreeBSD.org> Integrate vclean() into vgonel().

Various associated polishing.
eaf84397bbb9ac3fb2d027ddc3c0146b4d1398f3 28-Jan-2005 phk <phk@FreeBSD.org> Remove register keyword
796d435574629a3a293e13d786e313d9d473a134 25-Jan-2005 phk <phk@FreeBSD.org> Don't use VOP_GETVOBJECT, use vp->v_object directly.
1d63b12e222b2e353a1eaa08ab50ba0423b04ae4 24-Jan-2005 phk <phk@FreeBSD.org> Eliminate the constant flags argument to vclean()
dc1cfea3cd94826bb9e1670c047671ff34d6ec7d 24-Jan-2005 phk <phk@FreeBSD.org> Change vprint() to vn_printf() which takes varargs.
Add #define for vprint() to call vn_printf().
d5c135375c366bd87eec9c632eece3157076af35 24-Jan-2005 phk <phk@FreeBSD.org> Kill the VV_OBJBUF and test the v_object for NULL instead.
8cb539567897815e8f65bc05a1f410ce7ce6f5a1 24-Jan-2005 jeff <jeff@FreeBSD.org> - Add the tunable and sysctl for the mpsafevfs. It currently defaults
to off.
- Protect access to mnt_kern_flag with the mointpoint mutex.
- Remove some KASSERTs which are not legal checks without the appropriate
locks held.
- Use VCANRECYCLE() rather than rolling several slightly different
checks together.
- Return from vtryrecycle() with a recycled vnode rather than a locked
vnode. This simplifies some locking.
- Remove several GIANT_REQUIRED lines.
- Add a few KASSERTs to help with INACT debugging.

Sponsored By: Isilon Systems, Inc.
d3b1b2cc99641538c58e8937d2f63e3a9aa5b867 16-Jan-2005 phk <phk@FreeBSD.org> Fix a bug I introduced in 1.561 which has caused considerable filesystem
unhappiness lately.

As far as I can tell, no files that have made it safely to disk
have been endangered, but stuff in transit has been in peril.

Pointy hat: phk
cc0cbc6b34a1e285642b1061afc3cf87a868e769 14-Jan-2005 phk <phk@FreeBSD.org> Eliminate unused and unnecessary "cred" argument from vinvalbuf()
3760addae23efed2c59b081d8b911fffafea8c14 13-Jan-2005 phk <phk@FreeBSD.org> Ditch vfs_object_create() and make the callers call VOP_CREATEVOBJECT()
5a497775d6a793f5afeceff71192b3980f5a8ad2 11-Jan-2005 phk <phk@FreeBSD.org> Add BO_SYNC() and add a default which uses the secret vnode pointer
and VOP_FSYNC() for now.
437e41e06166553a5fe9ae4ea95f3fc1ca07fff1 11-Jan-2005 phk <phk@FreeBSD.org> More vnode -> bufobj migration.
649a01e1a513c0ebe5f2c4840ed4bea292e7a9c4 11-Jan-2005 phk <phk@FreeBSD.org> Give flushbuflist() a struct bufv as first argument and avoid home-rolling

Loose the error pointer argument and return any errors the normal way.

Return EAGAIN for the case where more work needs to be done.
da2718f1af898ee94e792d508153bc47de407fe3 11-Jan-2005 phk <phk@FreeBSD.org> Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC().

I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:

The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.

Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.

If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.

Discussed with: rwatson
20280f143170ee08a1e2cbd8871550105b276674 06-Jan-2005 imp <imp@FreeBSD.org> /* -> /*- for copyright notices, minor format tweaks as necessary
f2581a224ff97d3615d0cfa99b7f2cc70775057d 04-Jan-2005 phk <phk@FreeBSD.org> Since we do not support forceful unmount of DEVFS we can do away with
the partially implemented vnode-readoption code in vgonechrl().
0faeb292ed219df549f255578f74531a32900d18 20-Dec-2004 phk <phk@FreeBSD.org> We can only ever get to vgonechrl() from a devfs vnode, so we do not
need to reassign the vp->v_op to devfs_specops, we know that is the
value already.

Make devfs_specops private to devfs.
4a639d6164f667049cce0046d22e760ca1aad3f2 07-Dec-2004 phk <phk@FreeBSD.org> The remaining part of nmount/omount/rootfs mount changes. I cannot sensibly
split the conversion of the remaining three filesystems out from the root
mounting changes, so in one go:

Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()

Convert to nmount (the simple way, mount_nfs(8) is still necessary).
Add omount compat shims.
Drop COMPAT_PRELITE2 mount arg compatibility.

Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()

Remove vfs_omount() method, all filesystems are now converted.

Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem
task, and they all do it now.

Change rootmounting to use DEVFS trampoline:

Mount devfs on /. Devfs needs no 'from' so this is clean.
symlink /dev to /. This makes it possible to lookup /dev/foo.
Mount "real" root filesystem on /.
Surgically move the devfs mountpoint from under the real root
filesystem onto /dev in the real root filesystem.

Remove now unnecessary getdiskbyname().

Don't do devfs mounting and rootvnode assignment here, it was
already handled by vfs_mount.c.

Remove now unused bdevvp(), addaliasu() and addalias(). Put the
few necessary lines in devfs where they belong. This eliminates the
second-last source of bogo vnodes, leaving only the lemming-syncer.

Remove rootdev variable, it doesn't give meaning in a global context and
was not trustworth anyway. Correct information is provided by
4b1a11443659e50cf4dd8cb4be481621cde3db2c 03-Dec-2004 phk <phk@FreeBSD.org> Improve vprint() a little bit: break long lines, reduce indent and tell
if the VI_LOCK() is held.
59f305606cbc120b44978581149ef1a3e62bf3b4 01-Dec-2004 phk <phk@FreeBSD.org> Back when VOP_* was introduced, we did not have new-style struct
initializations but we did have lofty goals and big ideals.

Adjust to more contemporary circumstances and gain type checking.

Replace the entire vop_t frobbing thing with properly typed
structures. The only casualty is that we can not add a new
VOP_ method with a loadable module. History has not given
us reason to belive this would ever be feasible in the the
first place.

Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc.

Give coda correct prototypes and function definitions for
all vop_()s.

Generate a bit more data from the vnode_if.src file: a
struct vop_vector and protype typedefs for all vop methods.

Add a new vop_bypass() and make vop_default be a pointer
to another struct vop_vector.

Remove a lot of vfs_init since vop_vector is ready to use
from the compiler.

Cast various vop_mumble() to void * with uppercase name,
for instance VOP_PANIC, VOP_NULL etc.

Implement VCALL() by making vdesc_offset the offsetof() the
relevant function pointer in vop_vector. This is disgusting
but since the code is generated by a script comparatively
safe. The alternative for nullfs etc. would be much worse.

Fix up all vnode method vectors to remove casts so they
become typesafe. (The bulk of this is generated by scripts)
ca008fe1710db893e5a8b7a8793ff2c5a0e2d9ee 15-Nov-2004 phk <phk@FreeBSD.org> Move pbgetvp() and pbrelvp() to vm_pager.c with the rest of the pbuf stuff.
d632cb3d949eaa28f904c39dadbb55d0f61d9e3f 14-Nov-2004 phk <phk@FreeBSD.org> Move the bit of the syncer which deals with vnodes into a separate
b99ff6bbb678feaf9573724b84ffc2c5584a84e0 13-Nov-2004 phk <phk@FreeBSD.org> Eliminate vop_revoke() function now that devfs_revoke() does the entire job.
9fdf02798896bafcd92b649de16e4cdfa56a96eb 10-Nov-2004 phk <phk@FreeBSD.org> Slim vnodes by another four bytes by eliminating the (now) unused field
cea50000575c49a4d1ad8fc023e281ede89058aa 10-Nov-2004 phk <phk@FreeBSD.org> Remove vn_todev()
7bef55364a66de53ef29f35f1db7f570128b78d2 09-Nov-2004 phk <phk@FreeBSD.org> Remove vnode->v_cachedfs.

It was only used for the highly dangerous "export all vnodes with a sysctl"
1e4caea88cc5cd144f57b486e15c73b10e77f111 04-Nov-2004 phk <phk@FreeBSD.org> Remove buf->b_dev field.
34a530853df4fb3d027482ec079ec1aa17b0dffa 03-Nov-2004 phk <phk@FreeBSD.org> Always initialize bo_private along with bo_ops in getnewvnode().

Spotted by: tegge
c928cc4c546f34011ddc8bb00e0f6c9e4fab7a46 29-Oct-2004 phk <phk@FreeBSD.org> Loose vfs_mountedon()
12ca46b3fce383c1db9e62c046b3d12002e45219 29-Oct-2004 phk <phk@FreeBSD.org> Give the bufobj a private __bo_vnode for now to keep the syncer floating [1]
At some point later the syncer will unlearn about vnodes and the filesystems
method called by the syncer will know enough about what's in bo_private to
do the right thing.

[1] Ok, I know, but I couldn't resist the pun.
56a7ee8e7fb7df8976226828bc05891fcd3065f5 27-Oct-2004 phk <phk@FreeBSD.org> Move the syncer linkage from vnode to bufobj.

This is not quite a perfect separation: the syncer still think it knows
that everything is a vnode.
c66aa10c8e6167eefdae7f218765e3e5f146074a 26-Oct-2004 phk <phk@FreeBSD.org> Put the I/O block size in bufobj->bo_bsize.

We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open(). The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.
0e87ab8bc6e542c845f82c2bb526208587b200ad 25-Oct-2004 phk <phk@FreeBSD.org> Loose the v_dirty* and v_clean* alias macros.

Check the count field where we just want to know the full/empty state,
rather than using TAILQ_EMPTY() or TAILQ_FIRST().
3a8a530155f4b380ed48fd29c1dd0b435ff9f272 25-Oct-2004 phk <phk@FreeBSD.org> Remove vnode->v_bsize. This was a dead-end.
4ba53ec41b6cea4fe1d5f96985036c8d0e1e994b 25-Oct-2004 phk <phk@FreeBSD.org> Collapse vnode->v_object and buf->b_object into bufobj->bo_object.
1b25a5988640ac862e8e964c30aaccfd83e128cf 24-Oct-2004 phk <phk@FreeBSD.org> Move the buffer method vector (buf->b_op) to the bufobj.

Extend it with a strategy method.

Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY
song and dance.

Rename ibwrite to bufwrite().

Move the two NFS buf_ops to more sensible places, add bufstrategy
to them.

Add inlines for bwrite() and bstrategy() which calls through

Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().
f046b536920c8b21efc158ad1ec9050595be4676 22-Oct-2004 rwatson <rwatson@FreeBSD.org> When MAC is enabled, warn if getnewvnode() is asked to produce a vnode
without a mountpoint. In this scenario, there's no useful source for
a label on the vnode, since we can't query the mountpoint for the
labeling strategy or default label.
2c3e47b66843948385e7152055338c67f3c6a0a0 22-Oct-2004 phk <phk@FreeBSD.org> Alas, poor SPECFS! -- I knew him, Horatio; A filesystem of infinite
jest, of most excellent fancy: he hath taught me lessons a thousand
times; and now, how abhorred in my imagination it is! my gorge rises
at it. Here were those hacks that I have curs'd I know not how
oft. Where be your kludges now? your workarounds? your layering
violations, that were wont to set the table on a roar?

Move the skeleton of specfs into devfs where it now belongs and
bury the rest.
52a089c5262ab46beaee3c8aaedbd0c47da5b403 22-Oct-2004 phk <phk@FreeBSD.org> Add b_bufobj to struct buf which eventually will eliminate the need for b_vp.

Initialize b_bufobj for all buffers.

Make incore() and gbincore() take a bufobj instead of a vnode.

Make inmem() local to vfs_bio.c

Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj)
also VI_MTX() to BO_MTX(),

Make buf_vlist_add() take a bufobj instead of a vnode.

Eliminate other uses of bp->b_vp where bp->b_bufobj will do.

Various minor polishing: remove "register", turn panic into KASSERT,
use new function declarations, TAILQ_FOREACH_SAFE() etc.
3833976d1250bf118a46939f409012d87e558de6 21-Oct-2004 phk <phk@FreeBSD.org> Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT

Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj. Bufobj_wdrop() replaces vwakeup().

Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.

Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.
fdf614c0bad1564664cac5105be08477d1bcdd72 21-Oct-2004 phk <phk@FreeBSD.org> Add BO_* macros parallel to VI_* macros for manipulating the bo_mtx.

Initialize the bo_mtx when we allocate a vnode i getnewvnode() For
now we point to the vnodes interlock mutex, that retains the exact
same locking sematics.

Move v_numoutput from vnode to bufobj. Add renaming macro to
postpone code sweep.
b436dad078f20cb7f18ead7553923d45b0d6fcdf 21-Oct-2004 phk <phk@FreeBSD.org> Polish vtruncbuf() to improve readability and style a bit.
350f8121036f9f84eb11edc8cf3bd5435ffd9b6b 21-Oct-2004 phk <phk@FreeBSD.org> Simplify buf_vlist_remove().

Now that we have encapsulated the splaytree related information
into a structure we can eliminate the half of this function.
152055d94ba686525c26e82a33494a33654b2f85 06-Oct-2004 grog <grog@FreeBSD.org> vtryrecycle: Don't rely on type VBAD alone to mean that we don't need
to clean the vnode. If v_data is set, we still need to
clean it. This code change should catch all incidents of
the previous commit (INVARIANTS only).
882d69104e8714af21885a5c123811a74609b35b 06-Oct-2004 grog <grog@FreeBSD.org> getnewvnode: Weaken the panic "cleaned vnode isn't" to a warning.

Discussion: this panic (or waning) only occurs when the kernel is
compiled with INVARIANTS. Otherwise the problem (which means that
the vp->v_data field isn't NULL, and represents a coding error and
possibly a memory leak) is silently ignored by setting it to NULL
later on.

Panicking here isn't very helpful: by this time, we can only find
the symptoms. The panic occurs long after the reason for "not
cleaning" has been forgotten; in the case in point, it was the
result of severe file system corruption which left the v_type field
set to VBAD. That issue will be addressed by a separate commit.
2e5b8b98835509cddd3b2345fbdc9d31180a21d9 01-Oct-2004 phk <phk@FreeBSD.org> Fix a LOR relating to freeing cdevs.
5536d5757bcdba9f55338d879f2757ac320e7a43 24-Sep-2004 phk <phk@FreeBSD.org> Hold dev_lock and check for NULL devsw pointer when we determine
if a vnode is a disk.
3947e54e89a67263df7cb82659d2a60b22e36d66 23-Sep-2004 phk <phk@FreeBSD.org> Do not refcount the cdevsw, but rather maintain a cdev->si_threadcount
of the number of threads which are inside whatever is behind the
cdevsw for this particular cdev.

Make the device mutex visible through dev_lock() and dev_unlock().
We may want finer granularity later.

Replace spechash_mtx use with dev_lock()/dev_unlock().
02df7323ee036c2be0e7631cf9409b60dfd8c639 15-Sep-2004 phk <phk@FreeBSD.org> Remove unused B_WRITEINPROG flag
1912367ebb1a5029d72a6b3b028c32f0af41f0b5 07-Sep-2004 phk <phk@FreeBSD.org> Create simple function init_va_filerev() for initializing a va_filerev

Replace three instances of longhaired initialization va_filerev fields.

Added XXX comment wondering why we don't use random bits instead of
uptime of the system for this purpose.
54d23a34f60fa406f7b0eebdfec259936c967881 20-Aug-2004 truckman <truckman@FreeBSD.org> Don't attempt to trigger the syncer thread final sync code in the
shutdown_pre_sync state if the RB_NOSYNC flag is set. This is the
likely cause of hangs after a system panic that are keeping crash
dumps from being done.

This is a MFC candidate for RELENG_5.

MFC after: 3 days
8f41b4e870a29aa1c1ce19d2e1440014b71d0078 16-Aug-2004 obrien <obrien@FreeBSD.org> s/MAX_SAFE_MAXVNODES/MAXVNODES_MAX/g
bc1805c6e871c178d0b6516c3baa774ffd77224a 15-Aug-2004 jmg <jmg@FreeBSD.org> Add locking to the kqueue subsystem. This also makes the kqueue subsystem
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.

Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).

Reviewed by: green, rwatson (both earlier versions)
371cf09cf75eb31267e0d4f07e35e6fba755d416 11-Aug-2004 rwatson <rwatson@FreeBSD.org> In v_addpollinfo(), we allocate storage to back vp->v_pollinfo. However,
we may sleep when doing so; check that we didn't race with another thread
allocating storage for the vnode after allocation is made to a local
pointer, and only update the vnode pointer if it's still NULL. Otherwise,
accept that another thread got there first, and release the local storage.

Discussed with: jmg
7e21ce666c82537fdaf6354b41e681ecfd85640c 10-Aug-2004 njl <njl@FreeBSD.org> Skip the syncing disks loop if there are no dirty buffers. Remove a
variable used to flag the initial printf.

Submitted by: truckman (earlier version)
47f728c0bc17917916748b0d7360860e771bcccf 02-Aug-2004 obrien <obrien@FreeBSD.org> Put a cap on the auto-tuning of kern.maxvnodes.

Cap value chosen by: scottl
774b91783e24d4a1d9f72bc1f9c4ba25ce398ffe 30-Jul-2004 njl <njl@FreeBSD.org> Minor message cleanup.
8c9258b82e736184b4a7b3976b0ed47bc9fc245f 27-Jul-2004 phk <phk@FreeBSD.org> Convert the vfsconf list to a TAILQ.

Introduce vfs_byname() function to find things on it.

Staticize vfs_nmount() function under the name vfs_donmount().

Various cleanups.
d9fecc83c80e01cb4e66d1fd0a02e96b1fcbcbc5 26-Jul-2004 cperciva <cperciva@FreeBSD.org> Rename suser_cred()'s PRISON_ROOT flag to SUSER_ALLOWJAIL. This is
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.

The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)

Discussed with: rwatson, scottl
Requested by: jhb
5297516e026506cc0daa09237372bf903023155d 25-Jul-2004 phk <phk@FreeBSD.org> Eliminate unused second argument to reassignbuf() and simplify it
c2337759f354111ae03587516e05d1bfe0a7e383 21-Jul-2004 alfred <alfred@FreeBSD.org> put several of the options for DEBUG_VFS_LOCKS under control of sysctls.
bcc5104ce1c31b23fe0d246ee43347f0673fcb5c 15-Jul-2004 alfred <alfred@FreeBSD.org> Cleanup shutdown output.
0389c188dede7411146ec973b6345a8fffff458a 15-Jul-2004 alfred <alfred@FreeBSD.org> Tidy up system shutdown.
8a1713aada9c142d3c2096e4857ff30970d9b1d0 12-Jul-2004 alfred <alfred@FreeBSD.org> Make VFS_ROOT() and vflush() take a thread argument.
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.
45c9fe9da751a37578e52b51cf2e863c1d025643 12-Jul-2004 alfred <alfred@FreeBSD.org> Dump the actual bad values when this assertion is tripped.
c20ced5cd2addb28ba6e7b3edfc3b473021124fa 10-Jul-2004 marcel <marcel@FreeBSD.org> Update for the KDB framework:
o Call kdb_enter() instead of Debugger().
b65386ecc32a78cdeb0fade38d6516b729100e05 08-Jul-2004 alfred <alfred@FreeBSD.org> fixup sysctl by fsid node
e0a5f530c25100eec4a2a3dfa0ac270bdff214b7 06-Jul-2004 alfred <alfred@FreeBSD.org> Introduce vfs_suser(), used to test if a user should have special privs
for a mount.
97a6f04270d14e570225b6e17134c79ed8796903 06-Jul-2004 alfred <alfred@FreeBSD.org> NFS mobility PHASE I, II & III (phase VI, and V pending):

Rebind the client socket when we experience a timeout. This fixes
the case where our IP changes for some reason.

Signal a VFS event when NFS transitions from up to down and vice

Add a placeholder vfs_sysctl where we will put status reporting

Make down NFS mounts return EIO instead of EINTR when there is a
soft timeout or force unmount in progress.
690b842bc55872b2b6a648925842b29f58daf63c 05-Jul-2004 truckman <truckman@FreeBSD.org> Unconditionally set last_work_seen while in the SYNCER_RUNNING state
so that last_work_seen has a reasonable value at the transition
to the SYNCER_SHUTTING_DOWN state, even if net_worklist_len happened
to be zero at the time.

Initialize last_work_seen to zero as a safety measure in case the
syncer never ran in the SYNCER_RUNNING state.

Tested by: phk
471ab74bb24fc86706d4119e21c90d73160f3881 05-Jul-2004 truckman <truckman@FreeBSD.org> Rework syncer termination code:

Speed up the syncer when shutting down by sleeping for a shorter
period of time instead of cranking up rushjob and using the
normal one second sleep.

Skip empty worklist slots when shutting down to avoid lengthy
intervals of inactivity.

Give I/O more time to complete between steps by not speeding the
syncer quite as much.

Terminate the syncer after one full pass through the worklist
plus one second with the worklist containing nothing but syncer

Print an indication of shutdown progress to the console.

Add a sysctl, vfs.worklist_len, to allow the size of the syncer worklist
to be monitored.
49a3aa211e5f1ef71675dd052fba2feea97f0978 04-Jul-2004 phk <phk@FreeBSD.org> Give synthetic root filesystem device vnodes a v_bsize of DEV_BSIZE.
95f8f1e08906224cb7baf7d920a0142b89d31df9 04-Jul-2004 alfred <alfred@FreeBSD.org> Pass the operation in with the fsidctl.
Remove some fsidctls that we will not be using.
Correct prototypes for fs sysctls.
59c88fd71a233093e6ea7e8afb491fd5803ae5a7 04-Jul-2004 phk <phk@FreeBSD.org> Make the last commit handle non-phk root devices better.
b52c81e5db821f2344b775a5b98a58a7797433e4 04-Jul-2004 phk <phk@FreeBSD.org> Blocksize for I/O should be a property of the vnode and not found by groping
around in the vnodes surroundings when we allocate a block.

Assign a blocksize when we create a vnode, and yell a warning (and ignore it)
if we got the wrong size.

Please email all such warnings to me.
bbaa6c3ec045b7de225f726d3c9367510b287184 04-Jul-2004 alfred <alfred@FreeBSD.org> Introduce a new kevent filter. EVFILT_FS that will be used to signal
generic filesystem events to userspace. Currently only mount and unmount
of filesystems are signalled. Soon to be added, up/down status of NFS.

Introduce a sysctl node used to route requests to/from filesystems
based on filesystem ids.

Introduce a new vfsop, vfs_sysctl(mp, req) that is used as the callback/
entrypoint by the sysctl code to change individual filesystems.
4a61cff009f9e19ec6eca6369d42e3b6a57ebd63 04-Jul-2004 alfred <alfred@FreeBSD.org> Revision 1.496 would not boot on my system due to
ffs_mount -> bdevvp -> getnewvnode(..., mp = NULL, ...) ->
insmntqueue(vp, mp = NULL) -> KASSERT -> panic

Make getnewvnode() only call insmntqueue() if the mountpoint parameter
is not NULL.
070a613a48a55587e5a08b240d1c6b21a8256e76 04-Jul-2004 phk <phk@FreeBSD.org> When we traverse the vnodes on a mountpoint we need to look out for
our cached 'next vnode' being removed from this mountpoint. If we
find that it was recycled, we restart our traversal from the start
of the list.

Code to do that is in all local disk filesystems (and a few other
places) and looks roughly like this:

for (vp = TAILQ_FIRST(&mp...);
(vp = nvp) != NULL;
nvp = TAILQ_NEXT(vp,...)) {
if (vp->v_mount != mp)
goto loop;

The code which takes vnodes off a mountpoint looks like this:

TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes);
vp->v_mount = something;

(Take a moment and try to spot the locking error before you read on.)

On a SMP system, one CPU could have removed nvp from our mountlist
but not yet gotten to assign a new value to vp->v_mount while another
CPU simultaneously get to the top of the traversal loop where it
finds that (vp->v_mount != mp) is not true despite the fact that
the vnode has indeed been removed from our mountpoint.


Introduce the macro MNT_VNODE_FOREACH() to traverse the list of
vnodes on a mountpoint while taking into account that vnodes may
be removed from the list as we go. This saves approx 65 lines of
duplicated code.

Split the insmntque() which potentially moves a vnode from one mount
point to another into delmntque() and insmntque() which does just
what the names say.

Fix delmntque() to set vp->v_mount to NULL while holding the
mountpoint lock.
9ed03e6eb30a4961f22d094c43eb9702ef6841aa 01-Jul-2004 truckman <truckman@FreeBSD.org> When shutting down the syncer kernel thread, first tell it to run
faster and iterate to over its work list a few times in an attempt
to empty the work list before the syncer terminates. This leaves
fewer dirty blocks to be written at the "syncing disks" stage and
keeps the the "giving up on N buffers" problem from being triggered
by the presence of a large soft updates work list at system shutdown
time. The downside is that the syncer takes noticeably longer to

Tested by: "Arjan van Leeuwen" <avleeuwen AT piwebs DOT com>
Approved by: mckusick
40dd98a3bd2049465e7644b361b60da41a46efa0 17-Jun-2004 phk <phk@FreeBSD.org> Second half of the dev_t cleanup.

The big lines are:
udev_t -> dev_t
udev2dev() -> findcdev()

Various minor adjustments including handling of userland access to kernel
space struct cdev etc.
dfd1f7fd50fffaf75541921fcf86454cd8eb3614 16-Jun-2004 phk <phk@FreeBSD.org> Do the dreaded s/dev_t/struct cdev */
Bump __FreeBSD_version accordingly.
ba2d5c49372a310a32b231271d4240226147a0ab 14-Jun-2004 phk <phk@FreeBSD.org> Remove a left over from userland buffer-cache access to disks.
afc098b3e1d0431a14fa7a4d4046e47b5ea60666 31-May-2004 rwatson <rwatson@FreeBSD.org> Assert Giant in vrele().
79217d1505106247a4cf6498b5062cac0af7a376 11-Apr-2004 mux <mux@FreeBSD.org> Put deprecated sysctl code inside BURN_BRIDGES.
74cf37bd00b1e09a0b991b7b1edd335d8e0c2355 05-Apr-2004 imp <imp@FreeBSD.org> Remove advertising clause from University of California Regent's license,
per letter dated July 22, 1999.

Approved by: core
5e91995e52c8d411afa15de8d01b17016bfabc6f 29-Mar-2004 peter <peter@FreeBSD.org> Kill some XXXKSE's. vnlru/syncer are single threaded.
2a5e157787a7e1d72f1b7fd7d1ecb1e91b9e5251 11-Mar-2004 phk <phk@FreeBSD.org> Properly vector all bwrite() and BUF_WRITE() calls through the same path
and s/BUF_WRITE()/bwrite()/ since it now does the same as bwrite().
eeb7579130ee3380d0348cb447d9c484c6c4b45e 11-Mar-2004 phk <phk@FreeBSD.org> Remove unused second arg to vfinddev().
Don't call addaliasu() on VBLK nodes.
e795b7939d0e6a89ca34a05f6658f74df6c7644c 06-Mar-2004 kan <kan@FreeBSD.org> Always call vn_finished_write after vn_start_write was called. All
occurences of 'goto done' after vn_start_write invocation were cleaning
up incompletely.
d25301c8586567f23a4a1420292fec042e6496e1 27-Feb-2004 jhb <jhb@FreeBSD.org> Switch the sleep/wakeup and condition variable implementations to use the
sleep queue interface:
- Sleep queues attempt to merge some of the benefits of both sleep queues
and condition variables. Having sleep qeueus in a hash table avoids
having to allocate a queue head for each wait channel. Thus, struct cv
has shrunk down to just a single char * pointer now. However, the
hash table does not hold threads directly, but queue heads. This means
that once you have located a queue in the hash bucket, you no longer have
to walk the rest of the hash chain looking for threads. Instead, you have
a list of all the threads sleeping on that wait channel.
- Outside of the sleepq code and the sleep/cv code the kernel no longer
differentiates between cv's and sleep/wakeup. For example, calls to
abortsleep() and cv_abort() are replaced with a call to sleepq_abort().
Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and
cv_waitq_remove() have been replaced with calls to sleepq_remove().
- The sched_sleep() function no longer accepts a priority argument as
sleep's no longer inherently bump the priority. Instead, this is soley
a propery of msleep() which explicitly calls sched_prio() before
- The TDF_ONSLEEPQ flag has been dropped as it was never used. The
associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been
dropped and replaced with a single explicit clearing of td_wchan.
TD_SET_ONSLEEPQ() would really have only made sense if it had taken
the wait channel and message as arguments anyway. Now that that only
happens in one place, a macro would be overkill.
1de257deb3229812024de5861eb0aaa41e471448 26-Feb-2004 truckman <truckman@FreeBSD.org> Split the mlock() kernel code into two parts, mlock(), which unpacks
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.

Enable the RLIMIT_MEMLOCK checking code in kern_mlock().

Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.

Nuke the vslock() and vsunlock() implementations, which are no
longer used.

Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.

Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.

Modify the callers of sysctl_wire_old_buffer() to look for the
error return.

Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.

Reviewed by: bms
711ff67b9052cfc5036ea0c293d33008f3f1a4aa 21-Feb-2004 phk <phk@FreeBSD.org> Check for NODEV return from udev2dev()
5551e292d8b6185cfd720b86e124a883103cfc15 21-Feb-2004 phk <phk@FreeBSD.org> Device megapatch 6/6:

This is what we came here for: Hang dev_t's from their cdevsw,
refcount cdevsw and dev_t and generally keep track of things a lot
better than we used to:

Hold a cdevsw reference around all entrances into the device driver,
this will be necessary to safely determine when we can unload driver

Hold a dev_t reference while the device is open.

KASSERT that we do not enter the driver on a non-referenced dev_t.

Remove old D_NAG code, anonymous dev_t's are not a problem now.

When destroy_dev() is called on a referenced dev_t, move it to
dead_cdevsw's list. When the refcount drops, free it.

Check that cdevsw->d_version is correct. If not, set all methods
to the dead_*() methods to prevent entrance into driver. Print
warning on console to this effect. The device driver may still
explode if it is also incompatible with newbus, but in that case
we probably didn't get this far in the first place.
39fb4aef3d058a6d6726a689f365d2c8a3333178 21-Feb-2004 phk <phk@FreeBSD.org> Device megapatch 5/6:

Remove the unused second argument from udev2dev().

Convert all remaining users of makedev() to use udev2dev(). The
semantic difference is that udev2dev() will only locate a pre-existing
dev_t, it will not line makedev() create a new one.

Apart from the tiny well controlled windown in D_PSEUDO drivers,
there should no longer be any "anonymous" dev_t's in the system
now, only dev_t's created with make_dev() and make_dev_alias()
443a35f2bfabaf0a5203b2f894cf8f7476ce4008 05-Jan-2004 kan <kan@FreeBSD.org> More style fixes.

Obtained from: bde
2ffbc2464e12bb7cf32b64345eb2c12bd2fa7cc6 05-Jan-2004 kan <kan@FreeBSD.org> style(9):

Add empty line before first code line in functions with no local
Properly terminate comment sentences.
Indent lines which are longer that 80 characters.
Move v_addpollinfo closer to the rest of poll-related functions.
Move DEBUG_VFS_LOCKS ifdefed block to the end of file.

Obtained from: bde (partly)
28729219677869590f99511a1349b0460ae536c8 04-Jan-2004 kan <kan@FreeBSD.org> Cosmetics: strip '\n' from a string passed to Debugger().
7d9162647744b033ca2d175b0117a245d5092e2e 28-Dec-2003 bde <bde@FreeBSD.org> v_vxproc was a bogus name for a thread (pointer).
5443bd4c65a5c33099df3c11c47c7dcb4acbfeac 16-Dec-2003 jeff <jeff@FreeBSD.org> - In vget() if LK_NOWAIT is specified we should return EBUSY and not ENOENT.

Submitted by: Stephan Uphoff <ups@stups.com>
aa712bc6e42a435add89c314511c2cba31f0ea8c 16-Dec-2003 jeff <jeff@FreeBSD.org> - When doing a forced unmount, VFS attempts to keep VCHR vnodes valid by
reassigning their v_ops field to specfs, detaching from the mountpoint, etc.
However, this is not sufficient. If we vclean() the vnode the pages owned
by the vnode are lost, potentially while buffers reference them. Implement
parts of vclean() seperately in vgonechrl() so that the pages and bufs
associated with a device vnode are not destroyed while in use.
e35dcab926fd190efecffaa0c03ff365770b53cc 30-Nov-2003 jeff <jeff@FreeBSD.org> - Don't forget to unlock the vnode interlock in the LK_NOWAIT case.

Submitted by: Stephan Uphoff <ups@stups.com>
Approved by: re (rwatson)
7eade05dfa5c79c8765c89ae76635f31451fe886 09-Nov-2003 tanimura <tanimura@FreeBSD.org> - Implement selwakeuppri() which allows raising the priority of a
thread being waken up. The thread waken up can run at a priority as
high as after tsleep().

- Replace selwakeup()s with selwakeuppri()s and pass appropriate

- Add cv_broadcastpri() which raises the priority of the broadcast
threads. Used by selwakeuppri() if collision occurs.

Not objected in: -arch, -current
36d60f3bb735f38bbec69f4cc40ef27a24629c54 05-Nov-2003 kan <kan@FreeBSD.org> Remove mntvnode_mtx and replace it with per-mountpoint mutex.
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.

Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.

Discussed with: jeff
dabc1a332d9183829807d43facdf8e41c5005396 23-Oct-2003 wollman <wollman@FreeBSD.org> Add appropriate const poisoning to the assert_*locked() family so that I can
call ASSERT_VOP_LOCKED(vp, __func__) without a diagnostic.

Inspired by: the evil and rude OpenAFS cache manager code
512489f3013cd39afcf15c4934ce2af09940d38e 20-Oct-2003 alc <alc@FreeBSD.org> Initialize the buf's b_object in pbgetvp(). Clear it in pbrelvp(). (This
facilitates synchronization of the vm page's valid field using the
vm object's lock.)

Suggested by: tegge
888092f3177624c3c2e4fefc32d70472e7640a0f 17-Oct-2003 phk <phk@FreeBSD.org> Simplify count_dev()
edaf9214c4b4dc701135ce636be7e682e3922740 12-Oct-2003 phk <phk@FreeBSD.org> Simplify vn_isdisk() a bit.
5b9cc4b22e6e819a64dfbf8705d2602d2bab874a 11-Oct-2003 jeff <jeff@FreeBSD.org> - Fix a typo, I meant & and not |. This was causing lockups from the syncer
looping forever due to list corruption.

Solved by: tegge
2c3fea92c8d026d10b303e99639cc0b6fcad3e56 05-Oct-2003 jeff <jeff@FreeBSD.org> - Fix an XXX. Check the error of vn_lock() in vflush(). Don't specify
LK_RETRY either, we don't want this vnode if it turns into another.
- Remove the code that checks the mount point after acquiring the lock
we are guaranteed to either fail or get the vnode that we wanted.
6b8324e8418c1bac6312ec7563ead5159b038fb4 05-Oct-2003 jeff <jeff@FreeBSD.org> - Rename vcanrecycle() to vtryrecycle() to reflect its new role.
- In vtryrecycle() try to vgonel the vnode if all of the previous checks
passed. We won't vgonel if someone has either acquired a hold or usecount
or started the vgone process elsewhere. This is because we may have been
removed from the free list while we were inspecting the vnode for
- The VI_TRYLOCK stops two threads from entering getnewvnode() and recycling
the same vnode. To further reduce the likelyhood of this event, requeue
the vnode on the tail of the list prior to calling vtryrecycle(). We can
not actually remove the vnode from the list until we know that it's
going to be recycled because other interlock holders may see the VI_FREE
flag and try to remove it from the free list.
- Kill a bogus XXX comment. If XLOCK is set we shouldn't wait for it
regardless of MNT_WAIT because the vnode does not actually belong to
this filesystem.
e15704d590a022640849657ff004d6a6c4ad8fb6 05-Oct-2003 jeff <jeff@FreeBSD.org> - Don't cache_purge() in getnewvnode. It's done in vclean(). With this
purge, the purge in vclean, and the filesystems purge, we had 3 purges
per vnode.
- Move the insmntque(vp, 0) to vclean() so that we may remove it from the
two vgone() functions and reduce the number of lock operations required.
8d0f78003f6eab88720d169d6ea2e52de66708f0 05-Oct-2003 jeff <jeff@FreeBSD.org> - Solve a LOR with the sync_mtx by using the VI_ONWORKLST flag to determine
whether or not the sync failed. This could potentially get set between
the time that we VOP_UNLOCK and VI_LOCK() but the race would harmelssly
lead to the sync being delayed by an extra 30 seconds. If we do not move
the vnode it could cause an endless loop if it continues to fail to sync.
- Use vhold and vdrop to stop the vnode from changing identities while we
have it unlocked. Other internal vfs lists are likely to follow this
8d72c437638263277c823c617471854787c92fc7 05-Oct-2003 jeff <jeff@FreeBSD.org> - Move the xlock 'locking' code into vx_lock() and vx_unlock().
- Create a new function, vgonechrl(), which performs vgone for an in-use
character device. Move the code from vflush() that did this into
- Hold the xlock across the entirety of vgonel() and vgonechrl() so that
at no point will an invalid vnode exist on any list without XLOCK set.
- Move the xlock code out of vclean() now that it is in the vgone*()
55547647ecb82bcf7ff95e5f63008eba907da828 04-Oct-2003 jeff <jeff@FreeBSD.org> - In sched_sync() test our preconditions prior to dropping the sync_mtx.
This is so that we may grab the interlock while still holding the
sync_mtx. We have to VI_TRYLOCK() because in all other cases the lock
order runs the other way.
- If we don't meet any of the preconditions, reinsert the vp into the
list for the next second.
- We don't need to panic if we fail to sync here because each FSYNC
function handles this case. Removing this redundant code also
simplifies locking.
bb400139122e7bbd3615ad11a450cc9b8428cb92 04-Oct-2003 jeff <jeff@FreeBSD.org> - In a Giantless world, the vn_lock() in vcanrecycle() could legitimately
fail. Remove the panic from that case and document why it might fail.
- Document the reason for calling cache_purge() on a newly created vnode.
- In insmntque() order the operations so that we can call mtx_unlock()
one fewer times. This makes the code somewhat clearer as well.
- Add XXX comments in sched_sync() and vflush().
- In vget(), do not sleep while waiting for XLOCK to clear if LK_NOWAIT is
- In vclean() we don't need to acquire a lock around a single TAILQ_FIRST
call. It's ok if we race here, the vinvalbuf will just do nothing.
- Increase the scope of the lock in vgonel() to reduce the number of lock
operations that are performed.
517dcea6c887615d6df5b301e114c77ba2c25a87 20-Sep-2003 jeff <jeff@FreeBSD.org> - In reassignbuf() don't unlock vp and lock newvp if they are the same.
Doing so creates a race where the buf is on neither list.
- Only vfree() in an error case in vclean() if VSHOULDFREE() thinks we
- Convert the error case in vclean() to INVARIANTS from DIAGNOSTIC as this
really should not happen and is fast to check.
45f3b1b270fcb49574340e636d50c7a1b645a115 19-Sep-2003 jeff <jeff@FreeBSD.org> - Remove spls(). The locking that has replaced them is in place and they
no longer serve as guidelines for future work.
cf77f9f005002ec560d7731f3cf05a20f573f95c 19-Sep-2003 kan <kan@FreeBSD.org> Eliminate one case of VI_UNLOCK followed by an immediate
37641f86f1a209d796b3679ab72c92f2ace89fb7 07-Aug-2003 jhb <jhb@FreeBSD.org> Consistently use the BSD u_int and u_short instead of the SYSV uint and
ushort. In most of these files, there was a mixture of both styles and
this change just makes them self-consistent.

Requested by: bde (kern_ktrace.c)
eb30c92e4910efd8f5aa99168e8b74d8414f3eb1 22-Jul-2003 phk <phk@FreeBSD.org> Revert stuff which accidentally ended up in the previous commit.
c4a9334fa698660a5dd1a0c4fddb61ed0893fc58 22-Jul-2003 phk <phk@FreeBSD.org> Don't attempt to inline large functions mb_alloc() and mb_free(),
it more than doubles the text size of this file.

GCC has wisely ignored us on this previously
3b8fff9e4cedc4d9df3fb1ff39f5b668abdb9676 11-Jun-2003 obrien <obrien@FreeBSD.org> Use __FBSDID().
ddccb1d2874e518630a10fcca97810a4d6805d37 31-May-2003 phk <phk@FreeBSD.org> Remove unused variable and now unbalanced call to splbio();

Found by: FlexeLint
53638c7027cec977f4b8b80b0c575119b1a1af75 23-May-2003 alc <alc@FreeBSD.org> Make the maximum number of vnodes a function of both the physical memory
size and the kernel's heap size, specifically, vm_kmem_size. This
function allows a maximum of 40% of the vm_kmem_size to be used for
vnodes and vm objects. This is a conservative bound based upon recent
problem reports. (In other words, a slight increase in this percentage
may be safe.)

Finally, machines with less than ~3GB of RAM should be unaffected
by this change, i.e., the maximum number of vnodes should remain
the same. If necessary, machines with 3GB or more of RAM can increase
the maximum number of vnodes by increasing vm_kmem_size.

Desired by: scottl
Tested by: jake
Approved by: re (rwatson,scottl)
80040f21a37a2d2eef35514413b78a71a3e10e1a 16-May-2003 truckman <truckman@FreeBSD.org> Detect that a vnode has been reclaimed while vflush() was waiting to lock
the vnode and restart the loop. Vflush() is vulnerable since it does not
hold a reference to the vnode and it holds no other locks while waiting
for the vnode lock. The vnode will no longer be on the list when the
loop is restarted.

Approved by: re (rwatson)
0422418ef409d77459414f521175563ac8f3100f 13-May-2003 alc <alc@FreeBSD.org> Optimize the use of splay in gbincore(). During a "make buildworld" the
desired buffer is found at one of the roots more than 60% of the time.
Thus, checking both roots before performing either splay eliminates
unnecessary splays on the first tree splayed.

Approved by: re (jhb)
c21d149f29c38c8ba5f7b5a459b8087f179b6e0c 12-May-2003 rwatson <rwatson@FreeBSD.org> Remove bogus locking from DDB's "show lockedvnods" command: using
synchronization primitives from inside DDB is generally a bad idea,
and in this case it frequently results in panics due to DDB commands
being executed from the sio fast interrupt context on a serial
console. Replace the locking with a note that a lack of locking
means that DDB may get see inconsistent views of the mount and vnode
lists, which could also result in a panic. More frequently,
though, this avoids a panic than causes it.

Discussed with ages ago: bde
Approved by: re (scottl)
410b675ed9777017eff651997cc5bcd53a747cb2 03-May-2003 alc <alc@FreeBSD.org> - Revert kern/vfs_subr.c revision 1.444. The vm_object's size isn't
trustworthy for vnode-backed objects.
- Restore the old behavior of vm_object_page_remove() when the end
of the given range is zero. Add a comment to vm_object_page_remove()
regarding this behavior.

Reported by: iedowse
e9c4374a870ff015fdecd6820a2b46caaa97bb4f 01-May-2003 alc <alc@FreeBSD.org> Lock accesses to the vm_object's ref_count and resident_page_count.
d5ac0bc4531af5962cb46af2eb16a5cc61170717 26-Apr-2003 alc <alc@FreeBSD.org> Various changes to vm_object_page_remove():
- Eliminate an odd, special-case feature:
if start == end == 0 then all pages are removed. Only one caller
used this feature and that caller can trivially pass the object's
- Assert that the vm_object is locked on entry; don't bother testing
for a NULL vm_object.
- Style: Fix lines that are longer than 80 characters.
373b18b5c3e31cb4f3cb8521afee7c1c65ef6143 26-Apr-2003 alc <alc@FreeBSD.org> - Convert vm_object_pip_wait() from using tsleep() to msleep().
- Make vm_object_pip_sleep() static.
- Lock the vm_object when performing vm_object_pip_wait().
87da2c3cf37429dd7554a13d8b70a046ccc3edcf 24-Apr-2003 alc <alc@FreeBSD.org> - Acquire the vm_object's lock when performing vm_object_page_clean().
- Add a parameter to vm_pageout_flush() that tells vm_pageout_flush()
whether its caller has locked the vm_object. (This is a temporary
measure to bootstrap vm_object locking.)
83fe46be1839e24062ca7d8914d86a3a2142e114 18-Apr-2003 alc <alc@FreeBSD.org> Update locking around vm_object_page_remove() to use the new macros.
227f7746c40cbcd9836592987094174c1cc11b90 13-Apr-2003 alc <alc@FreeBSD.org> Use vm_object_pip_wait() rather than reimplementing it.
5e14826743f78c1b683ca05407f54b99a075c1de 26-Mar-2003 tegge <tegge@FreeBSD.org> Adjust the number of vnodes scanned by vlrureclaim() according to the
size of the vnode list.
f9968b4d9f2a3eda78b16752959ad5181f4eed5c 22-Mar-2003 yar <yar@FreeBSD.org> We shouldn't assert that a vode is locked in vop_lock_post()
if VOP_LOCK() has failed.

Reviewed by: jeff
ec5374265bec154d9330f861f2a089de4212a69d 13-Mar-2003 jeff <jeff@FreeBSD.org> - Remove a dead check for bp->b_vp == vp in vtruncbuf(). This has not been
possible for some time.
- Lock the buf before accessing fields. This should very rarely be locked.
- Assert that B_DELWRI is set after we acquire the buf. This should always
be the case now.
ae3c8799daed67832db23a2957d5f5e47250cad9 13-Mar-2003 jeff <jeff@FreeBSD.org> - Remove a race between fsync like functions and flushbufqueues() by
requiring locked bufs in vfs_bio_awrite(). Previously the buf could
have been written out by fsync before we acquired the buf lock if it
weren't for giant. The cluster_wbuild() handles this race properly but
the single write at the end of vfs_bio_awrite() would not.
- Modify flushbufqueues() so there is only one copy of the loop. Pass a
parameter in that says whether or not we should sync bufs with deps.
- Call flushbufqueues() a second time and then break if we couldn't find
any bufs without deps.
c50367da676eaa300253058b5594f783a3db948b 06-Mar-2003 alc <alc@FreeBSD.org> Remove ENABLE_VFS_IOOPT. It is a long unfinished work-in-progress.

Discussed on: arch@
5a225ad93319945ddf1088461e0b574dd6daf1f5 03-Mar-2003 njl <njl@FreeBSD.org> Finish cleanup of vprint() which was begun with changing v_tag to a string.
Remove extraneous uses of vop_null, instead defering to the default op.
Rename vnode type "vfs" to the more descriptive "syncer".
Fix formatting for various filesystems that use vop_print.
8e95e9172221aabf5978f7f50ae2308a105d6f6c 02-Mar-2003 jeff <jeff@FreeBSD.org> - Hold the vnode interlock across calls to bgetvp instead of acquiring it
internally. This is required to stop multiple bufs from being associated
with a single lblkno.
98d7696db02c68546b76fc3c542787bde97d6fe7 01-Mar-2003 jeff <jeff@FreeBSD.org> - gc USE_BUFHASH. The smp locking of the buf cache renders this useless.
6e9f6f2d6d91a3e8de65a21ee6585d75c52d8d56 25-Feb-2003 mckusick <mckusick@FreeBSD.org> Prevent large files from monopolizing the system buffers. Keep
track of the number of dirty buffers held by a vnode. When a
bdwrite is done on a buffer, check the existing number of dirty
buffers associated with its vnode. If the number rises above
vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one
of the other (hopefully older) dirty buffers associated with
the vnode is written (using bawrite). In the event that this
approach fails to curb the growth in it the vnode's number of
dirty buffers (due to soft updates rollback dependencies),
the more drastic approach of doing a VOP_FSYNC on the vnode
is used. This code primarily affects very large and actively
written files such as snapshots. This change should eliminate
hanging when taking snapshots or doing background fsck on
very large filesystems.

Hopefully, one day it will be possible to cache filesystem
metadata in the VM cache as is done with file data. As it
stands, only the buffer cache can be used which limits total
metadata storage to about 20Mb no matter how much memory is
available on the system. This rather small memory gets badly
thrashed causing a lot of extra I/O. For example, taking a
snapshot of a 1Tb filesystem minimally requires about 35,000
write operations, but because of the cache thrashing (we only
have about 350 buffers at our disposal) ends up doing about
237,540 I/O's thus taking twenty-five minutes instead of four
if it could run entirely in the cache.

Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.
9e4c9a6ce908881b1e6f83cbb906a9fce08dd3ab 25-Feb-2003 jeff <jeff@FreeBSD.org> - Add an interlock argument to BUF_LOCK and BUF_TIMELOCK.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.

Reviewed by: arch, mckusick
af9c7adfc31b63e403f969e5d819ab67553f7d6f 23-Feb-2003 phk <phk@FreeBSD.org> Bracket the kern.vnode sysctl in #ifdef notyet because it results
in massive locking issues on diskless systems.

It is also not clear that this sysctl is non-dangerous in its
requirements for locked down memory on large RAM systems.
cf874b345d0f766fb64cf4737e1c85ccc78d2bee 19-Feb-2003 imp <imp@FreeBSD.org> Back out M_* changes, per decision of the TRB.

Approved by: trb
bf8e8a6e8f0bd9165109f0a258730dd242299815 21-Jan-2003 alfred <alfred@FreeBSD.org> Remove M_TRYWAIT/M_WAITOK/M_WAIT. Callers should use 0.
Merge M_NOWAIT/M_DONTWAIT into a single flag M_NOWAIT.
1ec5f03e6d974f74f3960788e6bf478e5cce3bf1 29-Dec-2002 iedowse <iedowse@FreeBSD.org> Add a new vnode flag VI_DOINGINACT to indicate that a VOP_INACTIVE
call is in progress on the vnode. When vput() or vrele() sees a
1->0 reference count transition, it now return without any further
action if this flag is set. This flag is necessary to avoid recursion
into VOP_INACTIVE if the filesystem inactive routine causes the
reference count to increase and then drop back to zero. It is also
used to guarantee that an unlocked vnode will not be recycled while
blocked in VOP_INACTIVE().

There are at least two cases where the recursion can occur: one is
that the softupdates code called by ufs_inactive() via ffs_truncate()
can call vput() on the vnode. This has been reported by many people
as "lockmgr: draining against myself" panics. The other case is
that nfs_inactive() can call vget() and then vrele() on the vnode
to clean up a sillyrename file.

Reviewed by: mckusick (an older version of the patch)
2eae537376d77fb65bf39a558e5f769a5dac8ca7 29-Dec-2002 phk <phk@FreeBSD.org> Use a timeout of one second while we wait for the vnode washer,
this prevents a potential race and makes the system a little bit
less jerky under extreme loads.
90510abb6e701df9b6cdee3cd888c20b2e81daba 29-Dec-2002 phk <phk@FreeBSD.org> Vnodes pull in 800-900 bytes these days, all things counted, so we need
to treat desiredvnodes much more like a limit than as a vague concept.

On a 2GB RAM machine where desired vnodes is 130k, we run out of
kmem_map space when we hit about 190k vnodes.

If we wake up the vnode washer in getnewvnode(), sleep until it is done,
so that it has a chance to offer us a washed vnode. If we don't sleep
here we'll just race ahead and allocate yet a vnode which will never
get freed.

In the vnodewasher, instead of doing 10 vnodes per mountpoint per
rotation, do 10% of the vnodes distributed evenly across the
1496f0639dbcd32325c411769b38b1ad0442b134 28-Dec-2002 phk <phk@FreeBSD.org> KASSERT that vop_revoke() gets a VCHR.
6be448f26468f28f11e6cc26927f2108360832cf 15-Dec-2002 alc <alc@FreeBSD.org> Perform vm_object_lock() and vm_object_unlock() around
a7482ae294664d142373a8b5ad6022c247d63528 08-Dec-2002 alc <alc@FreeBSD.org> To avoid lock order reversals in getnewvnode(), the call to uma_zfree()
must be delayed until the vnode interlock is released.

Reported by: kris@
Approved by: re (jhb)
b4c9e243034607978f64c555ada7cb0d0ee455b4 27-Nov-2002 robert <robert@FreeBSD.org> Do not set a variable (vp->p_pollinfo) to NULL if we know
it already has that value.

Approved by: re
312cab0dee67b902f2b3c5b4d8873b978e5f0191 26-Oct-2002 rwatson <rwatson@FreeBSD.org> Slightly change the semantics of vnode labels for MAC: rather than
"refreshing" the label on the vnode before use, just get the label
right from inception. For single-label file systems, set the label
in the generic VFS getnewvnode() code; for multi-label file systems,
leave the labeling up to the file system. With UFS1/2, this means
reading the extended attribute during vfs_vget() as the inode is
pulled off disk, rather than hitting the extended attributes
frequently during operations later, improving performance. This
also corrects sematics for shared vnode locks, which were not
previously present in the system. This chances the cache
coherrency properties WRT out-of-band access to label data, but in
an acceptable form. With UFS1, there is a small race condition
during automatic extended attribute start -- this is not present
with UFS2, and occurs because EAs aren't available at vnode
inception. We'll introduce a work around for this shortly.

Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
7320ebcf72f1e644f52ab27e6940a1d528ecd0f4 25-Oct-2002 phk <phk@FreeBSD.org> In vrele() we can actually have a VCHR with v_rdev == NULL if we
came from the bottom of addaliasu(). Don't panic.
6b1611bd949afb58a84ad54e4bbcf960b0ef28b1 25-Oct-2002 mckusick <mckusick@FreeBSD.org> Within ufs, the ffs_sync and ffs_fsync functions did not always
check for and/or report I/O errors. The result is that a VFS_SYNC
or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in
the presence of a hard error writing a disk sector or in a filesystem
full condition. This patch ensures that I/O errors will always be
checked and returned. This patch also ensures that every call to
VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes
appropriate action when an error is returned.

Sponsored by: DARPA & NAI Labs.
a3766d9d16bc89805a360820b49262aa915d639b 24-Oct-2002 phk <phk@FreeBSD.org> Fix the spechash lock order reversal by keeping an updated sum
of v_usecount in the dev_t which vcount() can return without
locking any vnodes.

Seen by: jhb
9d00a4a781eb15e7fa43415462c01f82087ded69 14-Oct-2002 mckusick <mckusick@FreeBSD.org> When scanning the freelist looking for candidate vnodes to recycle,
be sure to exit the loop with vp == NULL if no candidates are found.
Formerly, this bug would cause the last vnode inspected to be used,
even if it was not available. The result was a panic "vn_finished_write:
neg cnt".

Sponsored by: DARPA & NAI Labs.
05ff8976a7d08218b1fd02fb650366f2deaa8765 14-Oct-2002 mckusick <mckusick@FreeBSD.org> Unconditionally reset vp->v_vnlock back to the default in the
vclean() function (e.g., vp->v_vnlock = &vp->v_lock) rather
than requiring filesystems that use alternate locks to do so
in their vop_reclaim functions. This change is a further cleanup
of the vop_stdlock interface.

Submitted by: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sponsored by: DARPA & NAI Labs.
25230d4c6a8ce0a2007e1b2694fcc4ff0869e15c 14-Oct-2002 mckusick <mckusick@FreeBSD.org> Regularize the vop_stdlock'ing protocol across all the filesystems
that use it. Specifically, vop_stdlock uses the lock pointed to by
vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to
reference vp->v_lock. Filesystems that wish to use the default
do not need to allocate a lock at the front of their node structure
(as some still did) or do a lockinit. They can simply start using
vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks,
but still use the vop_stdlock functions (such as nullfs) can simply
replace vp->v_vnlock with a pointer to the lock that they wish to
have used for the vnode. Such filesystems are responsible for
setting the vp->v_vnlock back to the default in their vop_reclaim
routine (e.g., vp->v_vnlock = &vp->v_lock).

In theory, this set of changes cleans up the existing filesystem
lock interface and should have no function change to the existing
locking scheme.

Sponsored by: DARPA & NAI Labs.
16ad96c43ce9e270c0bf2f3b58686de00fc36391 11-Oct-2002 mckusick <mckusick@FreeBSD.org> When considering a vnode for reuse in getnewvnode, we call
vcanrecycle to check a free vnode's availability. If it is
available, vcanrecycle returns an error code of zero and the
vnode in question locked. The getnewvnode routine then used
to call vn_start_write with the V_NOWAIT flag. If the filesystem
was suspended while taking a snapshot, the vn_start_write would
fail but getnewvnode would fail to unlock the vnode, instead
leaving it locked on the freelist. The result would be that the
vnode would be locked forever and would eventually hang the
system with a race to the root when it was attempted to recycle
it. This fix moves the vn_start_write check into vcanrecycle
where it will properly unlock the vnode if it is unavailable
for recycling due to filesystem suspension.

Sponsored by: DARPA & NAI Labs.
18d9db4bb5e43510810dd40acde3d990da838bf3 05-Oct-2002 sobomax <sobomax@FreeBSD.org> Fix problem introduced in rev.1.406, which can cause already unlocked
mutex being unlocked again causing system panic.
b55fa4540e1934439af02c5970859d779bba4360 01-Oct-2002 phk <phk@FreeBSD.org> Fix some harmless mis-indents.

Spotted by: FlexeLint
5d5060bddf1bf4c263e0d232b8a8c4f352c2f5f1 30-Sep-2002 rwatson <rwatson@FreeBSD.org> Move vnode MAC label initialization to after the release of the vnode
interlock in getnewvnode() to avoid possible sleeps while holding
the mutex. Note that the warning from Witness is a slight false
positive since we know there will be no contention on the interlock
since we haven't made the vnode available for use yet, but the theory
is not a bad one.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
1dfc2c167f0c0ee95c98088b2c05b50350b97ddb 28-Sep-2002 phk <phk@FreeBSD.org> Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by: FlexeLint warning #512
31b1ddae746006fc46febc4b1e86b3771da6163c 26-Sep-2002 jeff <jeff@FreeBSD.org> - Move ASSERT_VOP_*LOCK* functionality into functions in vfs_subr.c
- Make the VI asserts more orthogonal to the rest of the asserts by using a
new, common vfs_badlock() function and adding a 'str' arg.
- Adjust generated ASSERTS to match the new prototype.
- Adjust explicit ASSERTS to match the new prototype.
ee7cd9172dd5eea53b5dbaba9e95161c06ddb2cc 25-Sep-2002 jeff <jeff@FreeBSD.org> - Lock down the syncer with sync_mtx.
- Enable vfs_badlock_mutex by default.
- Assert that the vp is locked in VOP_UNLOCK.
- Use standard interlock macros in remaining code.
- Correct a race in getnewvnode().
- Lock access to v_numoutput with interlock.
- Lock access to buf lists and splay tree with interlock.
- Add VOP and VI asserts.
- Lock b_vnbufs with the vnode interlock.
- Add vrefcnt() for callers who want to retreive the vnode ref without
holding a lock. Add a comment that describes when this is safe.
- Add vholdl() and vdropl() so that callers who already own the interlock
can avoid race conditions and unnecessary unlocking.
- Move the VOP_GETATTR() in vflush() into the WRITECLOSE conditional case.
- Hold the interlock before droping the mntlist_mtx in vflush() to avoid
a race.
- Fix locking in vfs_msync().
00c79f5c92eb1bef381fbc7e267b6e746fab0f9c 18-Sep-2002 njl <njl@FreeBSD.org> Remove any VOP_PRINT that redundantly prints the tag.
Move lockmgr_printinfo() into vprint() for everyone's benefit.

Suggested by: bde
0590c43070aac7fb636a1f4c4b94469046a317a0 14-Sep-2002 njl <njl@FreeBSD.org> Remove all use of vnode->v_tag, replacing with appropriate substitutes.
v_tag is now const char * and should only be used for debugging.

1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.

Suggested by: phk
Reviewed by: bde, rwatson (earlier version)
06f500f894a5fa364ff68ee549577b947f480478 11-Sep-2002 julian <julian@FreeBSD.org> Indentation does not make a block.. need curly braces too.
Submitted by: Eagle-eyes evans <bde@freebsd.org>
5702a380a53c99a00275cb7e2836033a7497bef8 11-Sep-2002 julian <julian@FreeBSD.org> Completely redo thread states.

Reviewed by: davidxu@freebsd.org
3303b3f624a2e38f031911047956bf9b836ed368 05-Sep-2002 phk <phk@FreeBSD.org> Fix an inherited style bug: compare with NOCRED instead of NULL.

Sponsored by: DARPA & NAI Labs.
55be95d1615c2305ac667d206ff3231fca09b5d8 05-Sep-2002 phk <phk@FreeBSD.org> Introduce new extattr_check_cred() function which implements the canonical
crential washing for extended attributes.

Sponsored by: DARPA & NAI Labs.
7dd9d470599f145845572ac1f0d4b621c19c1cdb 25-Aug-2002 charnier <charnier@FreeBSD.org> Replace various spelling with FALLTHROUGH which is lint()able
da601a39acf6f61f64f7bc6f86c7e7a2493eefd9 23-Aug-2002 jeff <jeff@FreeBSD.org> - Fix a mistake in my last few commits. The PDROP flag stops msleep from
re-acquiring the mutex.

Pointy hat to: me
Noticed by: tegge
6c5497f47a5a81903384421e5b55e10c76905d75 22-Aug-2002 jeff <jeff@FreeBSD.org> - Make vn_lock() vget() and VOP_LOCK() all behave the same way WRT
LK_INTERLOCK. The interlock will never be held on return from these
functions even when there is an error. Errors typically only occur when
the XLOCK is held which means this isn't the vnode we want anyway. Almost
all users of these interfaces expected this behavior even though it was
not provided before.
1e39ba86206301d7f569b3f402577d596d55faa0 22-Aug-2002 jeff <jeff@FreeBSD.org> - Fix interlock handling in vn_lock(). Previously, vn_lock() could return
with interlock held in error conditions when the caller did not specify
- Add several comments to vn_lock() describing the rational behind the code
flow since it was not immediately obvious.
275611472a7bb4a878b0f78838649ad45de960ec 21-Aug-2002 jeff <jeff@FreeBSD.org> - Document two cases, one in vget and the other in vn_lock, where the state
of interlock on exit is not consistent. There are probably several bugs
relating to this.
ca5f1feb36236af0004e071de7f9232aa3893c56 21-Aug-2002 jeff <jeff@FreeBSD.org> - If vn_lock fails with the LK_INTERLOCK flag set, interlock will not be
released. vcanrecycle() failed to unlock interlock under this condition.
- Remove an extra VOP_UNLOCK from a failure case in vcanrecycle().

Pointed out by: rwatson
2fc7835d260e2782c1016810b29130ac6cfbf26e 21-Aug-2002 jeff <jeff@FreeBSD.org> - Add two new debugging macros: ASSERT_VI_LOCKED and ASSERT_VI_UNLOCKED
- Use the new VI asserts in place of the old mtx_assert checks.
- Add the VI asserts to the automated lock checking in the VOP calls. The
interlock should not be held across vops with a few exceptions.
- Add the vop_(un)lock_{pre,post} functions to assert that interlock is held
when LK_INTERLOCK is set.
d18378e088f5d3ff05499097373f285b9b675338 13-Aug-2002 jeff <jeff@FreeBSD.org> - Extend the vnode_free_list_mtx to cover numvnodes and freevnodes. This
was done only some of the time before, and now it is uniformly applied.
f43070c32510080cd90cfad7384d00f798c8d852 10-Aug-2002 mux <mux@FreeBSD.org> - Introduce a new struct xvfsconf, the userland version of struct vfsconf.
- Make getvfsbyname() take a struct xvfsconf *.
- Convert several consumers of getvfsbyname() to use struct xvfsconf.
- Correct the getvfsbyname.3 manpage.
- Create a new vfs.conflist sysctl to dump all the struct xvfsconf in the
kernel, and rewrite getvfsbyname() to use this instead of the weird
existing API.
- Convert some {set,get,end}vfsent() consumers to use the new vfs.conflist
- Convert a vfsload() call in nfsiod.c to kldload() and remove the useless
vfsisloadable() and endvfsent() calls.
- Add a warning printf() in vfs_sysctl() to tell people they are using
an old userland.

After these changes, it's possible to modify struct vfsconf without
breaking the binary compatibility. Please note that these changes don't
break this compatibility either.

When bp will have updated mount_smbfs(8) with the patch I sent him, there
will be no more consumers of the {set,get,end}vfsent(), vfsisloadable()
and vfsload() API, and I will promptly delete it.
f91961bfedba6d6ec499013b1e1be16adc87734e 05-Aug-2002 jeff <jeff@FreeBSD.org> - Move some logic from getnewvnode() to a new function vcanrecycle()
- Unlock the free list mutex around vcanrecycle to prevent a lock order
02517b6731ab2da44ce9b49260429744cf0114d5 04-Aug-2002 jeff <jeff@FreeBSD.org> - Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
- Many functions in vfs_subr.c were restructured to provide for stronger

Idea stolen from: BSD/OS
a5dcc1fd3d5e9f463223566c3815b15cd9f304cc 01-Aug-2002 rwatson <rwatson@FreeBSD.org> Include file cleanup; mac.h and malloc.h at one point had ordering
relationship requirements, and no longer do.

Reminded by: bde
2ca172b7258cc03a502ae5ba4fcc74b22ec663fa 31-Jul-2002 des <des@FreeBSD.org> Nit in previous commit: the correct sysctl type is "S,xvnode"
9c7ec035025bb78115873c0b6219d41bb80cd997 31-Jul-2002 des <des@FreeBSD.org> Initialize v_cachedid to -1 in getnewvnode().
Reintroduce the kern.vnode sysctl and make it export xvnodes rather than

Sponsored by: DARPA, NAI Labs
6bb9b1da05cc9bdaeba479dadad0f58b3ed8cd59 31-Jul-2002 rwatson <rwatson@FreeBSD.org> Note that the privilege indicating flag to vaccess() originally used
by the process accounting system is now deprecated.
261170743ff711ddf5d9f5130927a9c19cc94385 31-Jul-2002 rwatson <rwatson@FreeBSD.org> Introduce support for Mandatory Access Control and extensible
kernel access control.

Invoke the necessary MAC entry points to maintain labels on vnodes.
In particular, initialize the label when the vnode is allocated or
reused, and destroy the label when the vnode is going to be released,
or reused. Wow, an object where there really is exactly one place
where it's allocated, and one other where it's freed. Amazing.

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
5dce00d8f10cc8069a67ca4bc4a8100a64145dd4 29-Jul-2002 jeff <jeff@FreeBSD.org> - Backout the patch made in revision 1.75 of vfs_mount.c. The vputs here
were hiding the real problem of the missing unlock in sync_inactive.
- Add the missing unlock in sync_inactive.

Submitted by: iedowse
b1555a27432152cc7817d0c592bbcebf95fdfd19 28-Jul-2002 truckman <truckman@FreeBSD.org> Wire the sysctl output buffer before grabbing any locks to prevent
SYSCTL_OUT() from blocking while locks are held. This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
7be639a7c001f8dc5f8e7b6d1951845156bb997d 22-Jul-2002 rwatson <rwatson@FreeBSD.org> Teach discretionary access control methods for files about VAPPEND

Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
b44cb5787c401c1aaf5bbd0e42c211579b8efc31 19-Jul-2002 mckusick <mckusick@FreeBSD.org> Add support to UFS2 to provide storage for extended attributes.
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).

The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.

Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:

if (ioflags & IO_SYNC)
flags |= BA_SYNC;

For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).

I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.

Sponsored by: DARPA & NAI Labs.
3abb526f86a27b005f352fb91f605228876fa8f7 17-Jul-2002 mckusick <mckusick@FreeBSD.org> Change utimes to set the file creation time (for filesystems that
support creation times such as UFS2) to the value of the
modification time if the value of the modification time is older
than the current creation time. See utimes(2) for further details.

Sponsored by: DARPA & NAI Labs.
da4e111a550227f9bab8d24999ea76a11cc5b780 10-Jul-2002 dillon <dillon@FreeBSD.org> Replace the global buffer hash table with per-vnode splay trees using a
methodology similar to the vm_map_entry splay and the VM splay that Alan
Cox is working on. Extensive testing has appeared to have shown no
increase in overhead.

Dirties more cache lines during lookups.

Not as fast as a hash table lookup (but still N log N and optimal
when there is locality of reference).

vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem
syncer operate more efficiently.

I get to rip out all the old hacks (some of which were mine) that tried
to keep the v_dirtyblkhd tailq sorted.

The per-vnode splay tree should be easier to lock / SMPng pushdown on
vnodes will be easier.

This commit along with another that Alan is working on for the VM page
global hash table will allow me to implement ranged fsync(), optimize
server-side nfs commit rpcs, and implement partial syncs by the
filesystem syncer (aka filesystem syncer would detect that someone is
trying to get the vnode lock, remembers its place, and skip to the
next vnode).

Note that the buffer cache splay is somewhat more complex then other splays
due to special handling of background bitmap writes (multiple buffers with
the same lblkno in the same vnode), and B_INVAL discontinuities between the
old hash table and the existence of the buffer on the v_cleanblkhd list.

Suggested by: alc
fe9018671a29cc47f4fa9d86da09f152434be386 09-Jul-2002 jeff <jeff@FreeBSD.org> - Use standard locking functions in syncer's opv
- vput instead of vrele syncer vnodes in vfs_mount
- Add vop_lookup_{pre,post} to verify locking in VOP_LOOKUP
cca3a0ef3d4f089d8381461b007fc544f7c4026a 07-Jul-2002 jeff <jeff@FreeBSD.org> - Don't hold the vn lock while calling VOP_CLOSE in vclean().
8bf1a039cbc4ffad07555a6fbdf870dab6537d9a 07-Jul-2002 jeff <jeff@FreeBSD.org> - BUF_REFCNT() seems to be the preferred method for verifying a locked buf.
Tell vop_strategy_pre() to use this instead.
- Ignore B_CLUSTER bufs. Their components are locked but they don't really
exist so they don't have to be. This isn't ideal but it is safe.
f1b0400267ed5a6ad5d01b82e0b2c2d50adc97e4 06-Jul-2002 jeff <jeff@FreeBSD.org> Fix a mistake in my last commit. Don't grab an extra reference to the object
in bp->b_object.
0dd7645264ca202a719af56ac11adbf83a664a12 06-Jul-2002 jeff <jeff@FreeBSD.org> Fixup uses of GETVOBJECT.

- Cache a pointer to the vnode's object in the buf.
- Hold a reference to that object in addition to the vnode's reference just
to be consistent.
- Cleanup code that got the object indirectly through the vp and VOP calls.

This fixes at least one case where we were calling GETVOBJECT without a lock.
It also avoids an expensive layered call at the cost of another pointer in
struct buf.
908b0eb9a78bba073426ca3522bc7b515b842766 06-Jul-2002 jeff <jeff@FreeBSD.org> - Add vop_strategy_pre to validate VOP_STRATEGY locking.
- Disable original vop_strategy lock specification.
- Switch to the new vop_strategy_pre for lock validation.

VOP_STRATEGY requires only that the buf is locked UNLESS the block numbers need
to be translated. There may be other reasons, but as long as the underlying
layer uses a VOP to perform the operations they will be caught later.
3bce786a77bacda7dff2776820c46922a5505154 06-Jul-2002 jeff <jeff@FreeBSD.org> Add "vop_rename_pre" to do pre rename lock verification. This is enabled only
4f6ffa4183037ce887b9b69664f8eed5e8020022 03-Jul-2002 mux <mux@FreeBSD.org> Move vfs_rootmountalloc() in vfs_mount.c and remove lite2_vfs_mountroot()
which was #if 0'd and is not likely to be used now.
eb5a0f4a7e3bc12038c158f02c57a11ec098b5a8 02-Jul-2002 mux <mux@FreeBSD.org> Move every code related to mount(2) in a new file, vfs_mount.c.
The file vfs_conf.c which was dealing with root mounting has
been repo-copied into vfs_mount.c to preserve history.
This makes nmount related development easier, and help reducing
the size of vfs_syscalls.c, which is still an enormous file.

Reviewed by: rwatson
Repo-copy by: peter
4416f8270661dceb52114032c53cec62128437ba 01-Jul-2002 iedowse <iedowse@FreeBSD.org> Use indirect function pointer hooks instead of #ifdef SOFTUPDATES
direct calls for the two places where the kernel calls into soft
updates code. Set up the hooks in softdep_initialize() and NULL
them out in softdep_uninitialize(). This change allows soft updates
to function correctly when ufs is loaded as a module.

Reviewed by: mckusick
4db8ac83cb8d19620d8dc0773dc0592713a2a5ac 29-Jun-2002 obrien <obrien@FreeBSD.org> Rename the db command lockedvnodes to lockedvnods so that it fits on the
help screen and one doens't think we have a lockedvnodesmap command.
708aac7550fba9994c498217bfe6f0446ff9f962 28-Jun-2002 alfred <alfred@FreeBSD.org> nuke caddr_t.
1ed9e0f3759b38531dd3735e14ab04f36afb0941 28-Jun-2002 jeff <jeff@FreeBSD.org> Improve the VOP locking asserts

- Add vfs_badlock_print to control whether or not we print lock violations
- Add vfs_badlock_panic to control whether we panic on lock violations

Both default to on to mimic the original behavior if DEBUG_VFS_LOCKS is on.
62d02a6b93413e38cea255ac4d89ff70ce01c5bc 28-Jun-2002 green <green@FreeBSD.org> Fix a case where a vnode got explicitly unlocked after the pointer to it
got set to NULL.

Revision 1.355: in the box
3770ca4156031fb85ea4a9d09d50e4de46db4236 20-Jun-2002 mux <mux@FreeBSD.org> Change the way we internally store the mount options to
a linked list. This is to allow the merging of the mount
options in the MNT_UPDATE case, as the current data structure
is unsuitable for this.

There are no functional differences in this commit.

Reviewed by: phk
49532dbc776e0ef91949a1d08a7ebcb85b7ec928 14-Jun-2002 mux <mux@FreeBSD.org> Change vfs_copyopt() so that the length argument passed to it
must be the exact same size as the mount option. This makes
vfs_copyopt() much more useful.
936333132d6fa08743617b6d66cfa51768bd6821 06-Jun-2002 des <des@FreeBSD.org> Move some sysctls from the debug tree to the vfs tree.
8aef2ace20ad053d0da4d8eb0cb07c51c0f8798e 06-Jun-2002 des <des@FreeBSD.org> Gratuitous whitespace cleanup.
28d42899b766c395e5a6476f5bfa88b1481a08c0 16-May-2002 trhodes <trhodes@FreeBSD.org> More s/file system/filesystem/g
84d9baf797fd0cb8bf24b083c62619d68490b6f3 16-May-2002 mux <mux@FreeBSD.org> o Fix vfs_copyopt(), the first argument to bcopy() is the source,
not the destination.
o Remove some code from vfs_getopt() which was making the interface
more complicated to use for a very slight gain.
74069a30eeaf885605bdc1fd76302640ecf6ae2e 07-May-2002 jeff <jeff@FreeBSD.org> Switch from just holding the interlock to holding the standard lock throughout
getnewvnode(). This is safer. In the future, we should investigate requiring
only the interlock to get the vnode object.
bfe0870a5677e80b9c46e2321657e424a6a20614 06-May-2002 jeff <jeff@FreeBSD.org> Hold the currently selected vnode's lock across the call to VOP_GETVOBJECT.

Don't try to create a vm object before the file system has a chance to finish
initializing it. This is incorrect for a number of reasons. Firstly, that
VOP requires a lock which the file system may not have initialized yet. Also,
open and others will create a vm object if it is necessary later.
5020d62430941b5ce79ce6a4fdb7a1d4f03cb13f 05-May-2002 phk <phk@FreeBSD.org> Expand the one-line function pbreassignbuf() the only place it is or could
be used.
226cd40e3da30c9b459363d3da47dc3f4a309bf9 04-May-2002 dillon <dillon@FreeBSD.org> Remove obsolete code (that was already #if 0'd out).
Requested by: Hiten Pandya <hitmaster2k@yahoo.com>
db9aa81e239bb1c46b3b7ba560474cd954b78bf3 04-Apr-2002 jhb <jhb@FreeBSD.org> Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on: i386, alpha, sparc64
dc2e474f79c1287592679cd5e0c4c2307feccd60 01-Apr-2002 jhb <jhb@FreeBSD.org> Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on: smp@
124c6d3a2681db60183934e41d92ed78ad107c78 26-Mar-2002 mux <mux@FreeBSD.org> As discussed in -arch, add the new nmount(2) system call and the
new vfs_getopt()/vfs_copyopt() API. This is intended to be used
later, when there will be filesystems implementing the VFS_NMOUNT
operation. The mount(2) system call will disappear when all
filesystems will be converted to the new API. Documentation will
be committed in a while.

Reviewed by: phk
318cbeeecf54d416eb936f4bb65c00b18aab686b 20-Mar-2002 jeff <jeff@FreeBSD.org> Remove references to vm_zone.h and switch over to the new uma API.

Also, remove maxsockets. If you look carefully you'll notice that the old
zone allocator never honored this anyway.
357e37e023059920b1f80494e489797e2f69a3dd 19-Mar-2002 alfred <alfred@FreeBSD.org> Remove __P.
8d5b7b21f31ca8890b15763fcb98bc38d3f0cc26 05-Mar-2002 rwatson <rwatson@FreeBSD.org> Three p_ucred -> td_ucred's missed in jhb's earlier pass; all appear to
be safe.
3706cd350927f08fa8742cce9448c9ba8e4d6b2c 27-Feb-2002 jhb <jhb@FreeBSD.org> Simple p_ucred -> td_ucred changes to start using the per-thread ucred
68389bd8ba20d29b81d779b5922db3b6423ef1fc 18-Feb-2002 phk <phk@FreeBSD.org> Make v_addpollinfo() visible and non-inline.
Have callers only call it as needed.
Add necessary call in ufs_kqfilter().

Test-case found by: Andrew Gallatin <gallatin@cs.duke.edu>
68320d04d1ce77058a870d6070d61892172ef0ea 18-Feb-2002 phk <phk@FreeBSD.org> Remove yet a redundant VN_KNOTE() macro.
c2a47cdbe88de50d484d2cdb605874e1168626dc 17-Feb-2002 phk <phk@FreeBSD.org> Move the stuff related to select and poll out of struct vnode.
The use of the zone allocator may or may not be overkill.
There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need
to revisit.

This shaves about 60 bytes of struct vnode which on my laptop means
600k less RAM used for vnodes.
4289b50433ef09f1bfa03246de59392ffe6bb601 07-Feb-2002 peter <peter@FreeBSD.org> Fix a couple of style bugs introduced (or touched by) previous commit.
b5eb64d6f0fccb72419da5552deee22cb6117fac 07-Feb-2002 julian <julian@FreeBSD.org> Pre-KSE/M3 commit.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.

Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
ca79facdf4005810aaa1b757a00e3a8428d8cfd8 02-Feb-2002 mckusick <mckusick@FreeBSD.org> In the routines vrele() and vput(), we must lock the vnode and
call VOP_INACTIVE before placing the vnode back on the free list.
Otherwise there is a race condition on SMP machines between
getnewvnode() locking the vnode to reclaim it and vrele()
locking the vnode to inactivate it. This window of vulnerability
becomes exaggerated in the presence of filesystems that have
been suspended as the inactive routine may need to temporarily
release the lock on the vnode to avoid deadlock with the syncer
f51ea914df90df0fcf2b355661d7c44837f8a331 19-Jan-2002 dillon <dillon@FreeBSD.org> Remove 'VXLOCK: interlock avoided' warnings. This can now occur in normal
operation. The vgonel() code has always called vclean() but until we
started proactively freeing vnodes it would never actually be called with
a dirty vnode, so this situation did not occur prior to the vnlru() code.
Now that we proactively free vnodes when kern.maxvnodes is hit, however,
vclean() winds up with work to do and improperly generates the warnings.

Reviewed by: peter
Approved by: re (for MFC)
MFC after: 1 day
b8d6599e4cce8ed2f9fb198c5a189d50da1e8353 15-Jan-2002 mckusick <mckusick@FreeBSD.org> When downgrading a filesystem from read-write to read-only, operations
involving file removal or file update were not always being fully
committed to disk. The result was lost files or corrupted file data.
This change ensures that the filesystem is properly synced to disk
before the filesystem is down-graded.

This delta also fixes a long standing bug in which a file open for
reading has been unlinked. When the last open reference to the file
is closed, the inode is reclaimed by the filesystem. Previously,
if the filesystem had been down-graded to read-only, the inode could
not be reclaimed, and thus was lost and had to be later recovered
by fsck. With this change, such files are found at the time of the
down-grade. Normally they will result in the filesystem down-grade
failing with `device busy'. If a forcible down-grade is done, then
the affected files will be revoked causing the inode to be released
and the open file descriptors to begin failing on attempts to read.

Submitted by: "Sam Leffler" <sam@errno.com>
05b2183d53796b7c69ccc99b1d9abb1c582934d9 10-Jan-2002 dillon <dillon@FreeBSD.org> Add vlruvp() routine - implements LRU operation for vnode recycling.

We calculate a trigger point that both guarentees we will find a
sufficient number of vnodes to recycle and prevents us from recycling
vnodes with lots of resident pages. This particular section of
code is designed to recycle vnodes, not do unnecessary frees of
cached VM pages.
91aada8d5feb4ccebe48ea3ab211000016c9aeaf 25-Dec-2001 dillon <dillon@FreeBSD.org> Fix type-o in previous commit (tsleep was using wrong rendezvous point)
ac9876d609290ddd585a1e5a67550061f01c20dd 20-Dec-2001 dillon <dillon@FreeBSD.org> Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget()
against VM_WAIT in the pageout code. Both fixes involve adjusting
the lockmgr's timeout capability so locks obtained with timeouts do not
interfere with locks obtained without a timeout.

Hopefully MFC: before the 4.5 release
d6d1e90f254dad69e8c6ce5e6d65d5ea3fe0331d 19-Dec-2001 peter <peter@FreeBSD.org> Do not initialize static/global variables to 0. Use bss instead of
taking up space in the data section.
12f2610cb5fd215c51d36de2cbe51871b7b98f5d 19-Dec-2001 peter <peter@FreeBSD.org> Use a different mechanism to get the vnlru process to wake up and notice
the shutdown request at reboot/halt time.
Disable the printf 'vnlru process getting nowhere, pausing...' and instead
export the count to the debug.vnlru_nowhere sysctl.
1750942f6f64d20cc8853d2d1a60a3daaeeb1110 18-Dec-2001 dillon <dillon@FreeBSD.org> This is a forward port of Peter's vlrureclaim() fix, with some minor mods
by me to make it more efficient. The original code had serious balancing
problems and could also deadlock easily. This code relegates the vnode
reclamation to its own kproc and relaxes the vnode reclamation requirements
to better maintain kern.maxvnodes. This code still doesn't balance as well
as it could, but it does a much better job then the original code.

Approved by: re@freebsd.org
Obtained from: ps, peter, dillon
MFS Assuming: Assuming no problems crop up in Yahoo testing
MFC after: 7 days
8e6d2fbcbd6632194a3b13da9a3450e36b23cfbb 14-Dec-2001 dillon <dillon@FreeBSD.org> A slightly different version of the vlrureclaim fix.

Reported by: peter, ps
a194c4400184a010d8a5bf8c6196b5072c814c22 13-Dec-2001 peter <peter@FreeBSD.org> If we were called to allocate a vnode that is not associated with a
mount point, do not dereference the NULL mp argument.
c9a56085cebe87767458dd48f7738acbe105619e 04-Nov-2001 dillon <dillon@FreeBSD.org> Add mnt_reservedvnlist so we can MFC to 4.x, in order to make all mount
structure changes now rather then piecemeal later on. mnt_nvnodelist
currently holds all the vnodes under the mount point. This will eventually
be split into a 'dirty' and 'clean' list. This way we only break kld's once
rather then twice. nvnodelist will eventually turn into the dirty list
and should remain compatible with the klds.
25f3ce60105a1f164488faca541f53d4baeb8cdf 02-Nov-2001 rwatson <rwatson@FreeBSD.org> Merge from POSIX.1e Capabilities development tree:

o POSIX.1e capabilities authorize overriding of VEXEC for VDIR based
appropriate conditionals to vaccess() to take that into account.
o Synchronization cap_check_xxx() -> cap_check() change.

Obtained from: TrustedBSD Project
b37309f7648547e5c69ae317a1d73d8b631bd55f 27-Oct-2001 dillon <dillon@FreeBSD.org> syncdelay, filedelay, dirdelay, metadelay are ints, not time_t's,
and can also be made static.
f883ef447af57985b21cde8cd13232ca845190a4 26-Oct-2001 dillon <dillon@FreeBSD.org> Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a
real effect.

Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.

Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.

(more optimization work is needed on top of these fixes)

MFC after: 1 week
306854a5691be5ea2534e3be14401148c97eaeb9 25-Oct-2001 dillon <dillon@FreeBSD.org> Add missing TAILQ_INSERT_TAIL's which somehow didn't get comitted with
the recent vnode cleanup.
45a6fabe87ae3342c49f2e351d044e60daf8dfb3 23-Oct-2001 dillon <dillon@FreeBSD.org> Change the vnode list under the mount point from a LIST to a TAILQ
in preparation for an implementation of limiting code for kern.maxvnodes.

MFC after: 3 days
74303ee776672cc16514cd0828b98d3204b41a07 16-Oct-2001 dillon <dillon@FreeBSD.org> fix minor bug in kern.minvnodes sysctl. Use OID_AUTO.
414efe2875a2395686b51eff9af5c163057f3af7 08-Oct-2001 dillon <dillon@FreeBSD.org> WS Cleanup
34563ec4a5862b7ab60cf9005844d61552c0edcf 05-Oct-2001 dillon <dillon@FreeBSD.org> vinvalbuf() was only waiting for write-I/O to complete. It really has to
wait for both read AND write I/O to complete. Only NFS calls vinvalbuf()
on an active vnode (when the server indicates that the file is stale), so
this bug fix only effects NFS clients.

MFC after: 3 days
5a5b9f79f48be499ceca11f3d54c0525935b9ac1 01-Oct-2001 dillon <dillon@FreeBSD.org> After extensive testing it has been determined that adding complexity
to avoid removing higher level directory vnodes from the namecache has
no perceivable effect and will be removed. This is especially true
when vmiodirenable is turned on, which it is by default now. ( vmiodirenable
makes a huge difference in directory caching ). The vfs.vmiodirenable and
vfs.nameileafonly sysctls have been left in to allow further testing, but
I expect to rip out vfs.nameileafonly soon too.

I have also determined through testing that the real problem with numvnodes
getting too large is due to the VM Page cache preventing the vnode from
being reclaimed. The directory stuff made only a tiny dent relative
to Poul's original code, enough so that some tests succeeded. But tests
with several million small files show that the bigger problem is the VM Page
cache. This will have to be addressed by a future commit.

MFC after: 3 days
5596676e6c6c1e81e899cd0531f9b1c28a292669 12-Sep-2001 julian <julian@FreeBSD.org> KSE Milestone 2
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after: ha ha ha ha
4b437abe7829ed9af8a9434c256082b546c84682 27-Aug-2001 peter <peter@FreeBSD.org> If a file has been completely unlinked, stop automatically syncing the
file. ffs will discard any pending dirty pages when it is closed,
so we may as well not waste time trying to clean them. This doesn't
stop other things from writing it out, eg: pageout, fsync(2) etc.
6ca5d5c5c5540c58e8a58e1bb2685ff2faad6260 27-Jul-2001 peter <peter@FreeBSD.org> Revert previous accidental commit. FWIW, it was part of enabling
VM caching of disks through mmap() and stopping syncing of open files
that had their last reference in the fs removed (ie: their unsync'ed
pages get discarded on close already, so I made it stop syncing too).
18bc463cb61ab140a748f3aa099e2d859eae1e37 27-Jul-2001 peter <peter@FreeBSD.org> Fix cut/paste blunder. Serves me right for doing a last minute tweak
to what I had for some time.

Submitted by: bde
e028603b7e3e4fb35cdf00aab533f3965f4a13cc 04-Jul-2001 dillon <dillon@FreeBSD.org> With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
34fab2d86c2062d223b078ede44db24474823334 28-Jun-2001 jhb <jhb@FreeBSD.org> - Fix a mntvnode and vnode interlock reversal.
- Protect the mnt_vnode list with the mntvnode lock.
a3f0842419d98da211706f921fc626e160cd960b 19-May-2001 alfred <alfred@FreeBSD.org> Introduce a global lock for the vm subsystem (vm_mtx).

vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb
dafd513732df8c8fa7b8c5069ae3af2203853494 16-May-2001 iedowse <iedowse@FreeBSD.org> Change the second argument of vflush() to an integer that specifies
the number of references on the filesystem root vnode to be both
expected and released. Many filesystems hold an extra reference on
the filesystem root vnode, which must be accounted for when
determining if the filesystem is busy and then released if it isn't
busy. The old `skipvp' approach required individual filesystem
xxx_unmount functions to re-implement much of vflush()'s logic to
deal with the root vnode.

All 9 filesystems that hold an extra reference on the root vnode
got the logic wrong in the case of forced unmounts, so `umount -f'
would always fail if there were any extra root vnode references.
Fix this issue centrally in vflush(), now that we can.

This commit also fixes a vnode reference leak in devfs, which could
result in idle devfs filesystems that refuse to unmount.

Reviewed by: phk, bp
33f5635b7705bc4e5d89ceb058b27b4468ded61a 11-May-2001 iedowse <iedowse@FreeBSD.org> In vrele() and vput(), avoid triggering the confusing "missed vn_close"
KASSERT when vp->v_usecount is zero or negative. In this case, the
"v*: negative ref cnt" panic that follows is much more appropriate.

Reviewed by: mckusick
161a28e7381eda8743b7a7793d1f0af1a776e287 26-Apr-2001 phk <phk@FreeBSD.org> vfs_subr.c is getting rather fat. The underlying repocopy and this
commit moves the filesystem export handling code to vfs_export.c
cdc83afc7f1e444c4646840f48592b7ff524fbea 25-Apr-2001 phk <phk@FreeBSD.org> Move the netexport structure from the fs-specific mountstructure
to struct mount.

This makes the "struct netexport *" paramter to the vfs_export
and vfs_checkexport interface unneeded.

Consequently that all non-stacking filesystems can use

At the same time, make it a pointer to a struct netexport
in struct mount, so that we can remove the bogus AF_MAX
and #include <net/radix.h> from <sys/mount.h>
1f5de3071891f86c0e7d51efde6705f5b8ac2959 23-Apr-2001 grog <grog@FreeBSD.org> Correct #includes to work with fixed sys/mount.h.
546a3cb874c001d7ef459f0d56d33c3d3c2f89d4 18-Apr-2001 tanimura <tanimura@FreeBSD.org> Reclaim directory vnodes held in namecache if few free vnodes are

Only directory vnodes holding no child directory vnodes held in
v_cache_src are recycled, so that directory vnodes near the root of
the filesystem hierarchy remain in namecache and directory vnodes are
not reclaimed in cascade.

The period of vnode reclaiming attempt and the number of vnodes
attempted to reclaim can be tuned via sysctl(2).

Suggested by: tegge
Approved by: phk
378e561228360a3c0ad7f34be404abec95457c90 17-Apr-2001 phk <phk@FreeBSD.org> This patch removes the VOP_BWRITE() vector.

VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.

This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.

The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine. For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.
58f9dcd6ced7e1926fc1dae75119651b0913fca2 23-Feb-2001 jlemon <jlemon@FreeBSD.org> Add a NOTE_REVOKE flag for vnodes, which is triggered from within vclean().
Use this to tell a filter attached to a vnode that the underlying vnode is
no longer valid, by returning EV_EOF.

PR: kern/25309, kern/25206
18d474781ff1acbc67429e2db4fa0cf9a0d3c51e 18-Feb-2001 green <green@FreeBSD.org> Switch to using a struct xucred instead of a struct xucred when not
actually in the kernel. This structure is a different size than
what is currently in -CURRENT, but should hopefully be the last time
any application breakage is caused there. As soon as any major
inconveniences are removed, the definition of the in-kernel struct
ucred should be conditionalized upon defined(_KERNEL).

This also changes struct export_args to remove dependency on the
constantly-changing struct ucred, as well as limiting the bounds
of the size fields to the correct size. This means: a) mountd and
friends won't break all the time, b) mountd and friends won't crash
the kernel all the time if they don't know what they're doing wrt
actual struct export_args layout.

Reviewed by: bde
f364d4ac3621ae2689a3cc1b82c73eb491475a24 09-Feb-2001 bmilekic <bmilekic@FreeBSD.org> Change and clean the mutex lock interface.

mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:


The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)
e87f7a15ad62e1dd25061ddb301662e809692c2c 04-Feb-2001 phk <phk@FreeBSD.org> Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)
85976e83491a410e25a72d8dc20f376938d46c5d 31-Jan-2001 bp <bp@FreeBSD.org> Properly lock new vnode.

Reminded by: tegge
8d2ec1ebc4a9454e2936c6fcbe29a5f1fd83504f 24-Jan-2001 jasone <jasone@FreeBSD.org> Convert all simplelocks to mutexes and remove the simplelock implementations.
a7fc696a51f8152e18bc2ed4cb11aea0a55e9434 23-Jan-2001 rwatson <rwatson@FreeBSD.org> o The move to using VADMIN under vaccess() resulted in some system
calls returning EACCES instead of EPERM. This patch modifies vaccess()
to return EPERM instead of EACCES if VADMIN is among the requested
rights. This affects functions normally limited to the owners of
a file, such as chmod(), as EPERM is the error indicating that
privilege would allow the operation, rather than a chance in mandatory
or discretionary rights.

Reported by: bde
cdfe59aac6cf43a9d742f76d7486a953d0f0d060 15-Dec-2000 jhb <jhb@FreeBSD.org> Stick the kthread API in a kthread_* namespace, and the specialized kproc
functions in a kproc_* namespace.

Reviewed by: -arch
8fb19aa301bf05a6d0e411ceb787bf9d53acf8a7 13-Dec-2000 mckusick <mckusick@FreeBSD.org> Use proper mutex locking when calling setrunnable from speedup_syncer().

Submitted by: Tor.Egge@fast.no
dd75d1d73b4f3034c1d9f621a49fff58b1d71eb1 08-Dec-2000 dwmalone <dwmalone@FreeBSD.org> Convert more malloc+bzero to malloc+M_ZERO.

Submitted by: josh@zipperup.org
Submitted by: Robert Drehmel <robd@gmx.net>
eb5dd3d06e8165ca67e70519b14c37d90a173e86 06-Dec-2000 peter <peter@FreeBSD.org> Untangle vfsinit() a bit. Use seperate sysinit functions rather than
having a super-function calling bits all over the place.
397a29f11740a2c64b34ccd9e0b76d2f51c24df6 02-Dec-2000 gallatin <gallatin@FreeBSD.org> Correct int/long type mismatch in the proper place this time. freevnodes
and numvnodes are longs in the kernel. They should remain longs in systat,
what really needs to change is that they should be using SYSCTL_LONG rather
than SYSCTL_INT. I also changed wantfreevnodes to SYSCTL_LONG because I
happened to notice it.

I wish there was a way to find all of these automatically..

Pointed out by: bde
c91f8bd1fb2414d034600b374cbf8f241621b7f8 01-Dec-2000 jhb <jhb@FreeBSD.org> Use msleep() instead of mtx_exit()/tsleep() so that we release the lock and
go to sleep as an "atomic" operation.
4a644ba189de81a8b2af885e4eb1e0749e088dcc 30-Nov-2000 mckusick <mckusick@FreeBSD.org> Get rid of a bogus mtx_exit (it was attempting to release an
already released mutex).

Submitted by: "Chris Knight" <chris@aims.com.au>
2ace35208525bb250b47fe7af60ec2ce681c6c92 18-Nov-2000 dillon <dillon@FreeBSD.org> Implement a low-memory deadlock solution.

Removed most of the hacks that were trying to deal with low-memory
situations prior to now.

The new code is based on the concept that I/O must be able to function in
a low memory situation. All major modules related to I/O (except
networking) have been adjusted to allow allocation out of the system
reserve memory pool. These modules now detect a low memory situation but
rather then block they instead continue to operate, then return resources
to the memory pool instead of cache them or leave them wired.

Code has been added to stall in a low-memory situation prior to a vnode
being locked.

Thus situations where a process blocks in a low-memory condition while
holding a locked vnode have been reduced to near nothing. Not only will
I/O continue to operate, but many prior deadlock conditions simply no
longer exist.

Implement a number of VFS/BIO fixes

(found by Ian): in biodone(), bogus-page replacement code, the loop
was not properly incrementing loop variables prior to a continue
statement. We do not believe this code can be hit anyway but we
aren't taking any chances. We'll turn the whole section into a
panic (as it already is in brelse()) after the release is rolled.

In biodone(), the foff calculation was incorrectly
clamped to the iosize, causing the wrong foff to be calculated
for pages in the case of an I/O error or biodone() called without
initiating I/O. The problem always caused a panic before. Now it
doesn't. The problem is mainly an issue with NFS.

Fixed casts for ~PAGE_MASK. This code worked properly before only
because the calculations use signed arithmatic. Better to properly
extend PAGE_MASK first before inverting it for the 64 bit masking

In brelse(), the bogus_page fixup code was improperly throwing
away the original contents of 'm' when it did the j-loop to
fix the bogus pages. The result was that it would potentially
invalidate parts of the *WRONG* page(!), leading to corruption.

There may still be cases where a background bitmap write is
being duplicated, causing potential corruption. We have identified
a potentially serious bug related to this but the fix is still TBD.
So instead this patch contains a KASSERT to detect the problem
and panic the machine rather then continue to corrupt the filesystem.
The problem does not occur very often.. it is very hard to
reproduce, and it may or may not be the cause of the corruption
people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
8e9f33e1ce2a8ec7d527ae256a6424a60e6e1c3b 02-Nov-2000 tegge <tegge@FreeBSD.org> Clear the VFREE flag when the vnode is removed from the free list in
getnewvnode(). Otherwise routines called from VOP_INACTIVE() might
attempt to remove the vnode from a free list the vnode isn't on,
causing corruption.
PR: 18012
4e063f553471c4c75d185894232cce516553505b 02-Nov-2000 phk <phk@FreeBSD.org> Take VBLK devices further out of their missery.

This should fix the panic I introduced in my previous commit on this topic.
d944886e4df0e6d88970157ff46e8c0ab0c6b2c8 20-Oct-2000 jhb <jhb@FreeBSD.org> Catch up to moving headers:
- machine/ipl.h -> sys/ipl.h
- machine/mutex.h -> sys/mutex.h