History log of /freebsd-head/sys/cddl/contrib/opensolaris/uts/common/sys/
Revision Date Author Comments (<<< Hide modified files) (Show modified files >>>)
5ea17928487a7e4727ac88353536af293baa0daf 25-Mar-2020 freqlabs <freqlabs@FreeBSD.org> MFOpenZFS: ZVOLs should not be allowed to have children

zfs create, receive and rename can bypass this hierarchy rule. Update
both userland and kernel module to prevent this issue and use pyzfs
unit tests to exercise the ioctls directly.

Note: this commit slightly changes zfs_ioc_create() ABI. This allow to
differentiate a generic error (EINVAL) from the specific case where we
tried to create a dataset below a ZVOL (ZFS_ERR_WRONG_PARENT).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>

Approved by: mav (mentor)
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
48b94864c51253da92e4444f0074eec36ef391fb 03-Feb-2020 imp <imp@FreeBSD.org> Remove sparc64 kernel support

Remove all sparc64 specific files
Remove all sparc64 ifdefs
Removee indireeect sparc64 ifdefs
8f9d69492c3da3a8c1ea7fa1bc82b7639cc3064b 21-Nov-2019 avg <avg@FreeBSD.org> MFV r354382,r354385: 10601 10757 Pool allocation classes


10601 Pool allocation classes
illumos port of ZoL Pool allocation classes. Includes at least these two
441709695 Pool allocation classes misplacing small file blocks
cc99f275a Pool allocation classes

10757 Add -gLp to zpool subcommands for alt vdev names
Port from ZoL of
d2f3e292d Add -gLp to zpool subcommands for alt vdev names
Note that a subsequent ZoL commit changed -p to -P
a77f29f93 Change full path subcommand flag from -p to -P

Portions contributed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Portions contributed by: Håkan Johansson <f96hajo@chalmers.se>
Portions contributed by: Richard Yao <ryao@gentoo.org>
Portions contributed by: Chunwei Chen <david.chen@nutanix.com>
Portions contributed by: loli10K <ezomori.nozomu@gmail.com>
Author: Don Brady <don.brady@delphix.com>

11541 allocation_classes feature must be enabled to add log device


After the allocation_classes feature was integrated, one can no longer add a
log device to a pool unless that feature is enabled. There is an explicit check
for this, but it is unnecessary in the case of log devices, so we should handle
this better instead of forcing the feature to be enabled.

Author: Jerry Jelinek <jerry.jelinek@joyent.com>

FreeBSD notes.
I faithfully added the new -g, -L, -P flags, but only -g does something:
vdev GUIDs are displayed instead of device names. -L, resolve symlinks,
and -P, display full disk paths, do nothing at the moment.
The use of special vdevs is backward compatible for read-only access, so
root pools should be bootable, but exercise caution.

MFC after: 4 weeks
f3b212b1a5919ad51b644a9021cb3ef22e27caec 18-Nov-2019 avg <avg@FreeBSD.org> MFV r354378,r354379,r354386: 10499 Multi-modifier protection (MMP)

10499 Multi-modifier protection (MMP)
Port the following ZFS commits from ZoL to illumos.
379ca9cf2 Multi-modifier protection (MMP)
bbffb59ef Fix multihost stale cache file import
0d398b256 Do not initiate MMP writes while pool is suspended

10701 Correct lock ASSERTs in vdev_label_read/write
Port of ZoL commit:
0091d66f4e Correct lock ASSERTs in vdev_label_read/write
At a minimum, this fixes a blown assert during an MMP test run when running on
a DEBUG build.

11770 additional mmp fixes
Port a few additional MMP fixes from ZoL that came in after our
initial MMP port.
4ca457b065 ZTS: Fix mmp_interval failure
ca95f70dff zpool import progress kstat
(only minimal changes from above can be pulled in right now)
060f0226e6 MMP interval and fail_intervals in uberblock

Note from the committer (me).
I do not have any use for this feature and I have not tested it. I only
did smoke testing with multihost=off.
Please be aware.
I merged the code only to make future merges easier.

Portions contributed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Portions contributed by: Tim Chase <tim@chase2k.com>
Portions contributed by: sanjeevbagewadi <sanjeev.bagewadi@gmail.com>
Portions contributed by: John L. Hammond <john.hammond@intel.com>
Portions contributed by: Giuseppe Di Natale <dinatale2@llnl.gov>
Portions contributed by: Prakash Surya <surya1@llnl.gov>
Portions contributed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Olaf Faaland <faaland1@llnl.gov>

MFC after: 4 weeks
facf75e0dd085a5c8170ab2b45e88bf9b172e7c8 07-Nov-2019 avg <avg@FreeBSD.org> MFV r354377: 10554 Implemented zpool sync command


During the port of MMP (illumos bug 10499) from ZoL, I found this
earlier ZoL project is a prerequisite. Here is the original
description. This addition will enable us to sync an open TXG to the
main pool on demand. The functionality is similar to 'sync(2)' but
'zpool sync' will return when data has hit the main storage instead of
potentially just the ZIL as is the case with the 'sync(2)' cmd.

Portions contributed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Author: Alek Pinchuk <apinchuk@datto.com>
MFC after: 3 weeks
Relnotes: possibly
22b0eb13b24b859a3d50a37a0bcf90564d31896a 26-Oct-2019 asomers <asomers@FreeBSD.org> MFZoL: Avoid retrieving unused snapshot props

This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it
to take input parameters that alter the way looping through the list of
snapshots is performed. The idea here is to restrict functions that
throw away some of the snapshots returned by the ioctl to a range of
snapshots that these functions actually use. This improves efficiency
and execution speed for some rollback and send operations.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #8077

MFC after: 2 weeks
dee41b6cfc86346fedea54fdbf6a2d44e95b7d3f 16-Oct-2019 avg <avg@FreeBSD.org> MFV r353630: 10809 Performance optimization of AVL tree comparator functions


Port ZoL ee36c709c3d Performance optimization of AVL tree comparator functions

This is a followup to r337567 that imported the ZoL commit directly into
FreeBSD. It seems that at the time we did not have some of the earlier
changes, so some pieces of the ZoL change were not applicable. Also,
the illumos version got a few style cleanups. Some changes were missed
or incorrectly merged (e.g., vdev_cache_lastused_compare and

Obtained from: ZoL, illumos
MFC after: 25 days
X-MFC after: r353634
9c182fc73e8fa6017d9a2c9df152850e4958faa8 08-Aug-2019 tsoome <tsoome@FreeBSD.org> loader: support com.delphix:removing

We should support removing vdev from boot pool. Update loader zfs reader
to support com.delphix:removing.

Reviewed by: allanjude
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D18901
15ba49a9d03a8a4604ff38fae87f94c44e933929 17-Jul-2019 markj <markj@FreeBSD.org> Fix FASTTRAPIOC_GETINSTR.

This ioctl is used when a breakpoint is encountered while disassembling
a symbol in the target process. Since only one DTrace consumer can
toggle or enumerate fasttrap probes from a given process at time, this
ioctl does not appear to be used in practice.
65c40fbfd497edae99987584c9d34a5a93b908ab 26-Dec-2018 avg <avg@FreeBSD.org> MFV r342532: 5882 Temporary pool names

Note that this commit brings only formatting changes that were done
during the final review of the illumos change, because FreeBSD got the
main changes before illumos.


This is an import of the temporary pool names functionality from ZoL:
It is intended to assist the creation and management of virtual machines
that have their rootfs on ZFS on hosts that also have their rootfs on
ZFS. These situations cause SPA namespace collisions when the standard
name rpool is used in both cases. The solution is either to give each
guest pool a name unique to the host, which is not always desireable, or
boot a VM environment containing an ISO image to install it, which is

MFC after: 1 week
Sponsored by: Panzura
fa1f71dac0e008bd46c37fe7333d9064db07751d 21-Nov-2018 jhibbits <jhibbits@FreeBSD.org> DTrace/powerpc: Fix FBT return probes

The FBT fuction boundary prober was setting one return probe marker value,
but the dtrace handler was expecting another. This causes a hang when
tracing return probes.
639c5f49ef99935a075f98ed192816c869958625 03-Sep-2018 br <br@FreeBSD.org> Add support for 'C'-compressed ISA extension to DTrace FBT provider.

Approved by: re (kib)
Sponsored by: DARPA, AFRL
46cdbe4a7a57c13eb35e99a5abea862ef346524f 12-Aug-2018 mmacy <mmacy@FreeBSD.org> MFV/ZoL: Implement large_dnode pool feature

commit 50c957f702ea6d08a634e42f73e8a49931dd8055
Author: Ned Bass <bass6@llnl.gov>
Date: Wed Mar 16 18:25:34 2016 -0700

Implement large_dnode pool feature


This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be

ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.

Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.


The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.

Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.

The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run

# zfs set dnodesize=auto tank/fish

The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.

The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.

New DMU interfaces:

New ZAP interfaces:

The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.

These are a few noteworthy changes to key functions:

* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.

If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.

* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.

* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.

The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.

For ZIL create log records, zdb will now display the slot count for
the object.

Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.

Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data

ZFS Test Suite
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.

ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.

While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.

For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.

ZIL Replay
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.

Resizing Dnodes
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode

Feature Reference Counting
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
5160573320065c83f812de8a62a0379272e70e37 10-Aug-2018 mmacy <mmacy@FreeBSD.org> Performance optimization of AVL tree comparator functions

commit ee36c709c3d5f7040e1bd11f5c75318aa03e789f
Author: Gvozden Neskovic <neskovic@gmail.com>
Date: Sat Aug 27 20:12:53 2016 +0200

perf: 2.75x faster ddt_entry_compare()
First 256bits of ddt_key_t is a block checksum, which are expected
to be close to random data. Hence, on average, comparison only needs to
look at first few bytes of the keys. To reduce number of conditional
jump instructions, the result is computed as: sign(memcmp(k1, k2)).

Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
which is computed efficiently. Synthetic performance evaluation of
original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
CPU E5-2660 v3:

old 6.85789 s
new 2.49089 s

perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
Compute the result directly instead of using conditionals

perf: zfs_range_compare()
Speedup between 1.1x - 2.5x, depending on compiler version and
optimization level.

perf: spa_error_entry_compare()
`bcmp()` is not suitable for comparator use. Use `memcmp()` instead.

perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare()
perf: 2.8x faster zil_bp_compare()
perf: 2.8x faster mze_compare()
perf: faster dbuf_compare()
perf: faster compares in spa_misc
perf: 2.8x faster layout_hash_compare()
perf: 2.8x faster space_reftree_compare()
perf: libzfs: faster avl tree comparators
perf: guid_compare()
perf: dsl_deadlist_compare()
perf: perm_set_compare()
perf: 2x faster range_tree_seg_compare()
perf: faster unique_compare()
perf: faster vdev_cache _compare()
perf: faster vdev_uberblock_compare()
perf: faster fuid _compare()
perf: faster zfs_znode_hold_compare()

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Richard Elling <richard.elling@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5033
a68b7794d2a2a35fbd22a5a59f6999ae63250d2e 03-Aug-2018 mav <mav@FreeBSD.org> MFV r337223:
9580 Add a hash-table on top of nvlist to speed-up operations


Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
067c6a290ba51cf223b10ff853936295f66af6fc 03-Aug-2018 mav <mav@FreeBSD.org> MFV 337214:
9621 Make createtxg and guid properties public


Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Josh Paetzel <josh@tcbug.org>
7bffd764fe6a13954a274c41690d64e25ea7db2f 31-Jul-2018 mav <mav@FreeBSD.org> MFV r336991, r337001:
9102 zfs should be able to initialize storage devices

The first access to a disk block can incur a performance penalty on some
platforms (e.g. AWS's EBS, VMware VMDKs). Therefore it is recommended that
volumes be "thick provisioned", where supported by the platform (VMware).
Thick provisioning is time consuming and often is ignored. If the thick
provision step is omitted, customers will see suboptimal performance until
we have written to all parts of the LUN. ZFS should be able to initialize
any unused storage to remove any first-write penalty that exists.


Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>
a73aecd5f1b3afb968ba2977945b1a62d47d0353 08-Jun-2018 sef <sef@FreeBSD.org> This originated from ZFS On Linux, as

During scans (scrubs or resilvers), it sorts the blocks in each transaction
group by block offset; the result can be a significant improvement. (On my
test system just now, which I put some effort to introduce fragmentation into
the pool since I set it up yesterday, a scrub went from 1h2m to 33.5m with the
changes.) I've seen similar rations on production systems.

Approved by: Alexander Motin
Obtained from: ZFS On Linux
Relnotes: Yes (improved scrub performance, with tunables)
Differential Revision: https://reviews.freebsd.org/D15562
f42a38887f9cf8de947b71e4419449cb76ccdf6f 12-Apr-2018 avg <avg@FreeBSD.org> allow ZFS pool to have temporary name for duration of current import

The change adds -t <name> option to zpool create and -t option to zpool
import in its form with an old name and a new name. This allows to
import (or create) a pool under a name that's different from its real,
permanent name without affecting that name. This is useful when working
with VM images or images of other physical systems if they happen to
have a ZFS pool with the same name as the host system.

The changes come from ZoL with some small tweaks.
The porting has been done by julian.

The change is being submitted to OpenZFS:

Submitted by: julian
Reviewed by: smh
MFC after: 2 weeks
Sponsored by: Panzura (porting)
Differential Revision: https://reviews.freebsd.org/D14972
e6907ec1f002ac4180725276fb5fc37aae2dd3e9 28-Mar-2018 mav <mav@FreeBSD.org> MFV r331706:
9235 rename zpool_rewind_policy_t to zpool_load_policy_t


We want to be able to pass various settings during import/open of a pool,
which are not only related to rewind. Instead of adding a new policy and
duplicate a bunch of code, we should just rename rewind_policy to a more
generic term like load_policy.

For instance, we'd like to set spa->spa_import_flags from the nvlist,
rather from a flags parameter passed to spa_import as in some cases we want
those flags not only for the import case, but also for the open case. One
such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb) which would
allow zfs to open a pool when logs are missing.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
b1ec8f2d01fd5f696ce4f11f8e2755935e979cc6 28-Mar-2018 mav <mav@FreeBSD.org> MFV r331695, 331700: 9166 zfs storage pool checkpoint


The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with
exactly that. It can be thought of as a “pool-wide snapshot” (or a
variation of extreme rewind that doesn’t corrupt your data). It remembers
the entire state of the pool at the point that it was taken and the user
can revert back to it later or discard it. Its generic use case is an
administrator that is about to perform a set of destructive actions to ZFS
as part of a critical procedure. She takes a checkpoint of the pool before
performing the actions, then rewinds back to it if one of them fails or puts
the pool into an unexpected state. Otherwise, she discards it. With the
assumption that no one else is making modifications to ZFS, she basically
wraps all these actions into a “high-level transaction”.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
7b8847266fd14edbbeb8e603599b21b1089daa58 24-Feb-2018 asomers <asomers@FreeBSD.org> Implement CTASSERT using _Static_assert

Prevents warnings about "unused typedef" with GCC-6

Reported by: GCC-6
MFC after: 18 days
X-MFC-With: 329722
6702021a28e1db00b5a9f9ef70eac98a2e19f6aa 22-Feb-2018 mav <mav@FreeBSD.org> MFV r329793, r329795:
9075 Improve ZFS pool import/load process and corrupted pool recovery


Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:

https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions
https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed
https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's

To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.

The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.

When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.

This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.

With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
Of data, this feature is for now restricted to a read-only import.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
e8f51d4c277a9ca0b98025c8a6b870afb337d1f6 21-Feb-2018 mav <mav@FreeBSD.org> MFV r329753: 8809 libzpool should leverage work done in libfakekernel


Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andrew Stormont <astormont@racktopsystems.com>
6da43894fe0991933503c427611a0fbe34f7cf4d 21-Feb-2018 mav <mav@FreeBSD.org> MFV r329502: 7614 zfs device evacuation/removal


This project allows top-level vdevs to be removed from the storage pool with
“zpool remove”, reducing the total amount of storage in the pool. This
operation copies all allocated regions of the device to be removed onto other
devices, recording the mapping from old to new location. After the removal is
complete, read and free operations to the removed (now “indirect”) vdev must
be remapped and performed at the new location on disk. The indirect mapping
table is kept in memory whenever the pool is loaded, so there is minimal
performance overhead when doing operations on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become “obsolete” because they are no longer used by any block pointers in
the pool. An entry becomes obsolete when all the blocks that use it are
freed. An entry can also become obsolete when all the snapshots that
reference it are deleted, and the block pointers that reference it have been
“remapped” in all filesystems/zvols (and clones). Whenever an indirect block
is written, all the block pointers in it will be “remapped” to their new
(concrete) locations if possible. This process can be accelerated by using
the “zfs remap” command to proactively rewrite all indirect blocks that
reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of the data
that is copied. This makes the process much faster, but if it were used on
redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy
the wrong data, when we have the correct data on e.g. the other side of the
mirror. Therefore, mirror and raidz devices can not be removed.

Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Prashanth Sreenivasa <pks@delphix.com>
d4332083869aabd68424ce6019848c718362b409 21-Feb-2018 mav <mav@FreeBSD.org> MFV r319737: 6939 add sysevents to zfs core for commands


Originally created https://smartos.org/bugview/OS-4489
sysevents should be fired in the kernel from ZFS whenever a command
is run that is logged in zpool history.
Example output
Terminal 1
root - gz sunos ~ # zfs create zones/foobar
root - gz sunos ~ # zfs set quota=10g zones/foobar
root - gz sunos ~ # zfs destroy zones/foobar
Terminal 2
root - gz sunos ~ # sysevent EC_zfs
nvlist version: 0
date = 2016-04-28T14:50:08.964Z
vendor = SUNW
publisher = zfs
class = EC_zfs
subclass = ESC_ZFS_history_event
pid = 0
data = (embedded nvlist)
nvlist version: 0
pool_name = zones
pool_guid = 0x40c964e8f9a7a694
history_record = (embedded nvlist)
nvlist version: 0
dsname = zones/foobar
dsid = 0x1525
history internal str =
internal_name = create
history txg = 0x4c4ef3

Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Joshua M. Clulow <jmc@joyent.com>
Reviewed by: Josh Wilsdon <jwilsdon@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Alan Somers <asomers@gmail.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Dave Eddy <dave@daveeddy.com>
65ce973bc573398c8897a7ac00a13f52ef36df8c 21-Feb-2018 mav <mav@FreeBSD.org> MFV r319736: 6396 remove SVM


LVM = SVM = Solaris Volume Manager
dead code and not using with ZFS based platform.

Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
eb978148c027c50d7f4b3e1b428530ba2d268175 21-Feb-2018 mav <mav@FreeBSD.org> MFV r318941: 7446 zpool create should support efi system partition


Since we support whole-disk configuration for boot pool, we also will need
whole disk support with UEFI boot and for this, zpool create should create efi-
system partition.
I have borrowed the idea from oracle solaris, and introducing zpool create -
B switch to provide an way to specify that boot partition should be created.
However, there is still an question, how big should the system partition be.
For time being, I have set default size 256MB (thats minimum size for FAT32
with 4k blocks). To support custom size, the set on creation "bootsize"
property is created and so the custom size can be set as: zpool create B -
o bootsize=34MB rpool c0t0d0
After pool is created, the "bootsize" property is read only. When -B switch is
not used, the bootsize defaults to 0 and is shown in zpool get output with
value ''. Older zfs/zpool implementations are ignoring this property.

Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Author: Toomas Soome <tsoome@me.com>

This commit makes no sense for FreeBSD, that is why I blocked the option,
but it should be good to stay closer to upstream.
2dd60f22d792c4e7709a4174f40f0c71b442923e 22-Jan-2018 mav <mav@FreeBSD.org> MFV r328251: 8652 Tautological comparisons with ZPROP_INVAL


Clang and GCC prefer to use unsigned ints to store enums. With Clang, that
causes tautological comparison warnings when comparing a zfs_prop_t or
zpool_prop_t variable to the macro ZPROP_INVAL. It's likely that error
handling code is being silently removed as a result.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Alan Somers <asomers@gmail.com>
27fedeb8ad418bc3ebab226a879536c247aeb2db 22-Jan-2018 mav <mav@FreeBSD.org> MFV r328247: 8959 Add notifications when a scrub is paused or resumed


Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Sean Eric Fagan <sef@ixsystems.com>
2700f9ece142c06d1b799eec1066946fd20e436f 21-Jan-2018 mav <mav@FreeBSD.org> MFV r328220: 8677 Open-Context Channel Programs


We want to be able to run channel programs outside of synching context.
This would greatly improve performance of channel program that just gather
information, as we won't have to wait for synching context anymore.

This feature should introduce the following:
- A new command line flag in "zfs program" to specify our intention to
run in open context.
- A new flag/option within the channel program ioctl which selects the
- Appropriate error handling whenever we try a channel program in
open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
1bfc3a6a76928ffcb24f31c3492daee94a9f9491 12-Jan-2018 markj <markj@FreeBSD.org> Add "jid" and "jailname" variables to DTrace.

These return the jail ID and jail name for the traced process,
respectively, and are analogous to "zonename" on Solaris/illumos.
"zonename" is now aliased to "jailname".

Also add some stress tests for the new variables.

Submitted by: Domagoj Stolfa <domagoj.stolfa@gmail.com>
Reviewed by: dteske (previous version)
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D13877
e62792a7ebc7d0c7d2be700a3bf052b5291f3882 24-Dec-2017 dim <dim@FreeBSD.org> Remove obsolete register keyword from opensolaris's sysmacros.h. When
compiling zfsd with recent clang, it leads to a warning about the
register storage class being incompatible with C++17.

MFC after: 3 days
b0b9b4fcf4a7a712c3270a5fde519c199b11fc87 11-Dec-2017 markj <markj@FreeBSD.org> Pass the trap frame to fasttrap hooks.

The DTrace fasttrap entry points expect a struct reg containing the
register values of the calling thread. Perform the conversion in
fasttrap rather than in the trap handler: this reduces the number of
ifdefs and avoids wasting stack space for traps that don't involve

MFC after: 2 weeks
6b319ae2aabfa0a0bc36cb18c2677e01ecf822f8 01-Oct-2017 avg <avg@FreeBSD.org> MFV r323530,r323533,r323534: 7431 ZFS Channel Programs, and followups

7431 ZFS Channel Programs


ZFS channel programs (ZCP) adds support for performing compound ZFS
administrative actions via Lua scripts in a sandboxed environment (with time
and memory limits).
This initial commit includes both base support for running ZCP scripts, and a
small initial library of API calls which support getting properties and
listing, destroying, and promoting datasets.
Testing: in addition to the included unit tests, channel programs have been in
use at Delphix for several months for batch destroying filesystems. The
dsl_destroy_snaps_nvl() call has also been replaced with

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Chris Williamson <chris.williamson@delphix.com>

8552 ZFS LUA code uses floating point math


In the LUA interpreter used by "zfs program", the lua format() function
accidentally includes support for '%f' and friends, which can cause compilation
problems when building on platforms that don't support floating-point math in
the kernel (e.g. sparc). Support for '%f' friends (%f %e %E %g %G) should be
removed, since there's no way to supply a floating-point value anyway (all
numbers in ZFS LUA are int64_t's).

Reviewed by: Yuri Pankov <yuripv@gmx.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

8590 memory leak in dsl_destroy_snapshots_nvl()


In dsl_destroy_snapshots_nvl(), "snaps_normalized" is not freed after it is
added to "arg".

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

FreeBSD notes:
- zfs-program.8 manual page is taken almost as is from the vendor repository,
no FreeBSD-ification done
- fixed multiple instances of NULL being used where an integer is expected
- replaced ETIME and ECHRNG with ETIMEDOUT and EDOM respectively

This commit adds a modified version of Lua 5.2.4 under
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lua, mirroring the
upstream. See README.zfs in that directory for the description of Lua
See zfs-program.8 on how to use the new feature.

MFC after: 5 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D12528
16b20104c426b76ed302fa467895eec5fc045765 26-Sep-2017 avg <avg@FreeBSD.org> MFV r323535: 8585 improve batching done in zil_commit()

FreeBSD notes:
- this MFV reverts FreeBSD commit r314549 to make the merge easier
- at present our emulation of cv_timedwait_hires is rather poor,
so I elected to use cv_timedwait_sbt directly
Please see the differential revision for details.
Unfortunately, I did not get any positive reviews, so there could be
bugs in the FreeBSD-specific piece of the merge.
Hence, the long MFC timeout.


The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:
1. When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.
As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D12355
2766e30c3a2e294c077f8910fb306c918998adc4 09-Sep-2017 avg <avg@FreeBSD.org> MFV r323107: 8414 Implemented zpool scrub pause/resume


FreeBSD note: rather than merging the zpool.8 update I copied the zpool
scrub section from the illumos zpool.1m to FreeBSD zpool.8 almost
verbatim. Now that the illumos page uses the mdoc format, it was an
easier option. Perhaps the change is not in perfect compliance with the
FreeBSD style, but I think that it is acceptible.

This issue tracks the port of scrub pause from ZoL: https://github.com/zfsonlinux/zfs/pull/6167
Currently, there is no way to pause a scrub. Pausing may be useful when
the pool is busy with other I/O to preserve bandwidth.


This patch adds the ability to pause and resume scrubbing. This is achieved
by maintaining a persistent on-disk scrub state. While the state is 'paused'
we do not scrub any more blocks. We do however perform regular scan
housekeeping such as freeing async destroyed and deadlist blocks while paused.

Motivation and Context

Scrub pausing can be an I/O intensive operation and people have been asking
for the ability to pause a scrub for a while. This allows one to preserve scrub
progress while freeing up bandwidth for other I/O.

Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Alek Pinchuk <apinchuk@datto.com>

MFC after: 2 weeks
2f08138bea55f8cda81cf62b501e519c2eeecd69 21-Aug-2017 jhb <jhb@FreeBSD.org> Add a guard around _ILP32 for mips.

This is already done for other architectures in this file and fixes the
build with clang.
3364e8aea957b8f29913ef5da0c80f4081ffad76 07-Aug-2017 br <br@FreeBSD.org> o Replace __riscv__ with __riscv
o Replace __riscv64 with (__riscv && __riscv_xlen == 64)

This is required to support new GCC 7.1 compiler.
This is compatible with current GCC 6.1 compiler.

RISC-V is extensible ISA and the idea here is to have built-in define
per each extension, so together with __riscv we will have some subset
of these as well (depending on -march string passed to compiler):


Reviewed by: ngie
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D11901
b50f2868e54b258b1d74b361d14452c2e01fa4c6 24-Feb-2017 avg <avg@FreeBSD.org> zfs: clean up unused files and definitions

MFC after: 1 month
X-MFC after: r314048
ab48acbf0055b0bca33f59e855c7d4390e8ff142 09-Feb-2017 asomers <asomers@FreeBSD.org> Fix setting birthtime in ZFS

* In zfs_freebsd_setattr, if the caller wants to set the birthtime,
set the bits that zfs_settattr expects

* In zfs_setattr, if XAT_CREATETIME is set, set xoa_createtime,
expected by zfs_xvattr_set. The two levels of indirection seem
excessive, but it minimizes diffs vs OpenZFS.

* In zfs_setattr, check for overflow of va_birthtime (from delphij)

* Remove red herring in zfs_getattr

* Un-booby-trap some macros

New tests are under review at https://github.com/pjd/pjdfstest/pull/6

Reviewed by: avg
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D9353
d879975195aa1d0b434177f3a2f355d74a2d4f10 05-Feb-2017 markj <markj@FreeBSD.org> Use PC-relative relocations for USDT probe sites on i386 and amd64.

When recording probe site addresses in the output DOF file, dtrace -G
needs to emit relocations for the .SUNW_dof section in order to obtain
the addresses of functions containing probe sites. DTrace expects the
addresses to be relative to the base address of the final ELF file,
and the amd64 USDT implementation was relying on some unspecified and
incorrect behaviour in the base system GNU ld to achieve this.

This change reimplements the probe site relocation handling to allow
USDT to be used with lld and newer GNU binutils. Specifically, it
makes use of R_X86_64_PC64/R_386_PC32 relocations to obtain the
probe site address relative to the DOF file address, and adds and uses a
new DOF relocation type which computes the final probe site address using
these relative offsets.

Reported by and discussed with: Rafael Espíndola
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D9374
45b8e2daa4b7f6d309e6dca8239d62de846d6bfd 03-Feb-2017 gnn <gnn@FreeBSD.org> Replace the implementation of DTrace's RAND subroutine for generating
low-quality random numbers with a modern implementation (xoroshiro128+)
that is capable of generating better quality randomness without compromising performance.

Submitted by: Graeme Jenkinson
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9051
8632319c4b8cfd95c0a16024ac1d5f76fe306bb4 23-Jan-2017 markj <markj@FreeBSD.org> Remove the DTRACEHIOC_ADD ioctl.

This ioctl has been considered legacy by upstream since the DTrace code
was first imported, and is unused. The removal also allows some
simplification of dtrace_helper_slurp().

Also remove a bogus copyout in the DTRACEHIOC_ADDDOF handler. Due to a
bug, it would overwrite an in-memory copy of the DOF header rather than
the passed-in DOF helper. Moreover, DTRACEHIOC_ADDDOF already copies the
helper back out automatically since its argument has the IOC_OUT attribute.
792f2dc38d9b4252572b8e4f54f8b6e1bb4ce632 03-Jan-2017 markj <markj@FreeBSD.org> Remove the "unused" DIF subroutine index left after r308582.

These indices are input to a build-time script that generates code to
validate subroutine names.
1bd7e208ec303e9af54ef01882fb9dc8075ce919 27-Dec-2016 markj <markj@FreeBSD.org> Remove an obsolete pragma from dtrace.h.

It triggers a compiler warning and has been removed upstream.

MFC after: 1 week
64ddb8fc277a0b8a37f6d705237fb97a5f3dba28 16-Dec-2016 gnn <gnn@FreeBSD.org> Remove extra DOF_SEC_XLIMPORT from the DOF_SEC_ISLOADABLE macro

MFC after: 2 weeks
Sponsored by: DARPA, AFRL
d3cfa79d83220152aec929f0168999f654530cd7 12-Nov-2016 markj <markj@FreeBSD.org> Remove the DTrace printt and typeref actions.

These are FreeBSD-specific and were added in r178576 to provide the ability
to pretty-print instances of compound types. However, the print action has
long since been augmented to provide this functionality with a simpler

Discussed with: gnn
Differential Revision: https://reviews.freebsd.org/D8478
c573dff5b0503e03cf400af8bcaffebab17df3a4 29-Oct-2016 avg <avg@FreeBSD.org> zfsbootcfg: a simple tool to set next boot (one time) options for zfsboot

(gpt)zfsboot will read one-time boot directives from a special ZFS pool
area. The area was previously described as "Boot Block Header", but
currently it is know as Pad2, marked as reserved and is zeroed out on
pool creation. The new code interprets data in this area, if any, using
the same format as boot.config. The area is immediately wiped out.
Failure to parse the directives results in a reboot right after the
cleanup. Otherwise the boot sequence proceeds as usual.

zfsbootcfg writes zfsboot arguments specified on its command line to the
Pad2 area of a disk identified by vfs.zfs.boot.primary_pool and
vfs.zfs.boot.primary_vdev kenv variables that are set by loader during
boot. Please see the manual page for more.

Thanks to all who reviewed, contributed and made suggestions! There are
many potential improvements to the feature, please see the review for

Reviewed by: wblock (docs)
Discussed with: jhb, tsoome
MFC after: 3 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D7612
3bf1abe8c9a430eff44e60b5970f866fb8fa2708 03-Sep-2016 mav <mav@FreeBSD.org> MFV r304155: 7090 zfs should improve allocation order and throttle allocations


When write I/Os are issued, they are issued in block order but the ZIO pipelin
will drive them asynchronously through the allocation stage which can result i
blocks being allocated out-of-order. It would be nice to preserve as much of
the logical order as possible.
In addition, the allocations are equally scattered across all top-level VDEVs
but not all top-level VDEVs are created equally. The pipeline should be able t
detect devices that are more capable of handling allocations and should
allocate more blocks to those devices. This allows for dynamic allocation
distribution when devices are imbalanced as fuller devices will tend to be
slower than empty devices.
The change includes a new pool-wide allocation queue which would throttle and
order allocations in the ZIO pipeline. The queue would be ordered by issued
time and offset and would provide an initial amount of allocation of work to
each top-level vdev. The allocation logic utilizes a reservation system to
reserve allocations that will be performed by the allocator. Once an allocatio
is successfully completed it's scheduled on a given top-level vdev. Each top-
level vdev maintains a maximum number of allocations that it can handle
(mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels *
mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab
groups and round robin across all eligible metaslab groups to distribute the
work. As top-levels complete their work, they receive additional work from the
pool-wide allocation queue until the allocation queue is emptied.

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: George Wilson <george.wilson@delphix.com>
f67e4fd9fa2a974f05732015318b69a61c90f147 01-Sep-2016 mav <mav@FreeBSD.org> MFV r302993: 7104 increase indirect block size


The current default indirect block size is 16KB. We can improve
performance by increasing it to 128KB. This is especially helpful for
any workload that needs to read most of the metadata, e.g.
scrub/resilver, file deletion, filesystem deletion, and zfs send.
We also need to fix a few space estimation errors to make the tests

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
ae8d156b04255228f9b8210d720c611cb52609dc 01-Sep-2016 mav <mav@FreeBSD.org> MFV r302660: 6314 buffer overflow in dsl_dataset_name


Callers of dsl_dataset_name pass a buffer of size ZFS_MAXNAMELEN, but
dsl_dataset_name copies the datasets' name PLUS the snapshot name to it,
resulting in a max of 2 * ZFS_MAXNAMELEN + '@'.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
3cbf3d5ed6f6b2f966a9c41554b17b932ea3e0b8 01-Sep-2016 mav <mav@FreeBSD.org> MFV r302647: 6922 Emit ESC_ZFS_VDEV_REMOVE_AUX after removing an aux device


ZFS does not do a config_sync after removing an aux (spare, log, or cache)
device. AFAICT this isn't being done because it is slow and was deemed
unnecessary. However, it should be such a rare operation that speed doesn't
matter, and not doing it results in two problems:
1) It is theoretically possible to remove an aux device from one pool and
attach it to another, then lose power. When power is restored, both pools woul
think that they own the aux device.
2) Removal of the aux device doesn't send any useful sysevents to userland.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alan Somers <asomers@gmail.com>
b1b1e507c60c8a82e60416af94187304916512f4 16-Aug-2016 markj <markj@FreeBSD.org> MFV r301526:
7035 string-related subroutines should validate input earlier

Reviewed by: Alex Wilson <alex.wilson@joyent.com>
Reviewed by: Bryan Cantrill <bryan@joyent.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Patrick Mooney <pmooney@pfmooney.com>


MFC after: 2 weeks
1167f0cf7c8bd5edd44a200ca669108af15612ec 02-Aug-2016 br <br@FreeBSD.org> Update RISC-V port to Privileged Architecture Version 1.9.

Sponsored by: DARPA, AFRL
Sponsored by: HEIF5
b281d573a9937cf7f6f8aebd506d4b39a067209c 30-Jun-2016 mav <mav@FreeBSD.org> Revert r299454 and r299448.

Those changes were found confusing FreeBSD libc ACL code, that doesn't
differentiate ACL for directories and files, and report ACLs for all
directories created after those patches as non-trivial. On the other
side these changes were considered wrong from POSIX and NFSv4 points of
view. Until further investigation done upstream, revert those changes
locally in preparation for FreeBSD 11.0 release.

Approved by: re (hrs)
10478a3feb7f1a3aedf6ef9f8967caf7f0ff1c38 01-Jun-2016 asomers <asomers@FreeBSD.org> Improve the English in a comment

Improve the english in a comment. No functional changes

Submitted by: gibbs
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
996c4295dbb6ebc1b8025f370bf680d8242f9f8f 29-May-2016 bdrewery <bdrewery@FreeBSD.org> Avoid more literal-suffix errors with C++11
501a9d9525962ad8c39b12d142394c7aa1df0f76 24-May-2016 br <br@FreeBSD.org> Add initial DTrace support for RISC-V.

Sponsored by: DARPA, AFRL
Sponsored by: HEIF5
a1e682042fe1826faaa599ca3e649842e63a48a9 11-May-2016 mav <mav@FreeBSD.org> MFV r299442: 6762 POSIX write should imply DELETE_CHILD on directories - and
some additional considerations

Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Author: Kevin Crowe <kevin.crowe@nexenta.com>

21bf786fbfe5ba284acd9884680610fe7942df64 11-May-2016 mav <mav@FreeBSD.org> MFV r299440: 6736 ZFS per-vdev ZAPs

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Joe Stein <joe.stein@delphix.com>

229f59004efd5514c40b23e3e6a5dd5d5dda54fd 05-May-2016 br <br@FreeBSD.org> Implement FBT provider (MD part) for DTrace on MIPS.
Tested on MIPS64.

Sponsored by: DARPA, AFRL
Sponsored by: HEIF5
9db9137b3450e1bb5aa65f8045c3e1d4226d6d53 25-Apr-2016 markj <markj@FreeBSD.org> Increase DTRACE_FUNCNAMELEN from 128 to 192.

This allows for the long function components encountered in www/firefox.
This constant is part of DTrace's userland ABI, so this change may not be

PR: 207735
2a471649019b98c148ff2ee6e1ab87cc96a3a845 22-Apr-2016 avg <avg@FreeBSD.org> MFV r298471: 6052 decouple lzc_create() from the implementation details


At the moment type parameter of lzc_create() is of dmu_objset_type_t type.
That exposes an implementation detail and requires sys/fs/zfs.h to be included
in libzfs_core.h creating unnecessary coupling between libzfs_core interface
and ZFS internals.
I think that dmu_objset_type_t should be replaced with a libzfs_core
enumeration of supported dataset types.
For ABI reasons the new enumeration could be bit-compatible with
For example:
typedef enum {
} lzc_dataset_type_t;

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>

MFC after: 2 weeks
Sponsored by: ClusterHQ
b9677d1249766bc8578b5e6aa51d4ebdbc3be5a9 17-Apr-2016 markj <markj@FreeBSD.org> Make the second argument of dtrace_invop() a trapframe pointer.

Currently this argument is a pointer into the stack which is used by FBT
to fetch the first five probe arguments. On all non-x86 architectures it's
simply the trapframe address, so this change has no functional impact. On
amd64 it's a pointer into the trapframe such that stack[1 .. 5] gives the
first five argument registers, which are deliberately grouped together in
the amd64 trapframe definition.

A trapframe argument simplifies the invop handlers on !x86 and makes the
x86 FBT invop handler easier to understand. Moreover, it allows for invop
handlers that may want to modify the register set of the interrupted thread.
6dcdb61ae36b64cef3125daa63d6dc75f5bcdadd 09-Apr-2016 mav <mav@FreeBSD.org> MFV r297760: 6418 zpool should have a label clearing command

Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Author: Will Andrews <will@firepipe.net>

Closes #83
Closes #32


FreeBSD already had `zpool labelclear` functionality, so this is mostly
just a diff reduction.

MFC after: 1 month
eb06d29389f4f401f2ae07d826816af25221edca 08-Mar-2016 mav <mav@FreeBSD.org> MFV r296518: 5027 zfs large block support (add copyright)

Author: Matthew Ahrens <matt@mahrens.org>

b22a6e7b7cbc3af02b4a4bb6a5b49328d2fab47c 08-Mar-2016 markj <markj@FreeBSD.org> Fix fasttrap tracepoint locking.

Upstream, tracepoints are protected by per-CPU mutexes. An unlinked
tracepoint may be freed once all the tracepoint mutexes have been acquired
and released - this is done in fasttrap_mod_barrier(). This mechanism was
not properly ported: in some places, the proc lock is used in place of a
tracepoint lock, and in others the locking is omitted entirely. This change
implements tracepoint locking with an rmlock, where the read lock is used
in fasttrap probe context. As a side effect, this fixes a recursion on the
proc lock when the raise action is used from a userland probe.

MFC after: 1 month
778a34fa2a878f3f49f6793fe0729d5203e9539a 29-Jan-2016 br <br@FreeBSD.org> Welcome the RISC-V 64-bit kernel.

This is the final step required allowing to compile and to run RISC-V
kernel and userland from HEAD.

RISC-V is a completely open ISA that is freely available to academia
and industry.

Thanks to all the people involved! Special thanks to Andrew Turner,
David Chisnall, Ed Maste, Konstantin Belousov, John Baldwin and
Arun Thomas for their help.
Thanks to Robert Watson for organizing this project.

This project sponsored by UK Higher Education Innovation Fund (HEIF5) and
DARPA CTSRD project at the University of Cambridge Computer Laboratory.

FreeBSD/RISC-V project home: https://wiki.freebsd.org/riscv

Reviewed by: andrew, emaste, kib
Relnotes: Yes
Sponsored by: DARPA, AFRL
Sponsored by: HEIF5
Differential Revision: https://reviews.freebsd.org/D4982
8fdcecf370f4a3c549ff1edc8228634b7d16b666 07-Dec-2015 markj <markj@FreeBSD.org> MFV r289003:
6271 dtrace caused excessive fork time

Author: Bryan Cantrill <bryan@joyent.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Gordon Ross <gwr@nexenta.com>

d58e8ed932a37bd1eefd67ead3dda35687d3a9f1 16-Oct-2015 mav <mav@FreeBSD.org> MFV r289310:
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matthew Ahrens <mahrens@delphix.com>


This is only a partial merge of respective ZFS infrastructure changes.
At this moment FreeBSD kernel has no those crypto algorithms, so the
parts of the code to enable them are commented out. When they are
implemented, it will be trivial to plug them in.
c1c5997359a0319d0b6984203c5c3bb10c28274c 15-Oct-2015 mav <mav@FreeBSD.org> MFV r289312: 2605 want to resume interrupted zfs send

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>


For more info, see:
- slides http://www.slideshare.net/MatthewAhrens/openzfs-send-and-receive
- video https://www.youtube.com/watch?v=iY44jPMvxog
- manpage changes (for zfs resume -s and zfs send -t)
- upcoming talk at the OpenZFS Developer Summit

The TL;DR is:
Use "zfs receive -s" to save the partially received state on failure.
On failure, get the receive token with "zfs get receive_resume_token <fs>"
Resume the send with "zfs send -t <token_value>"

Relnotes: yes
2571010394cf1c85899a675a6bb4d59c31053614 30-Sep-2015 markj <markj@FreeBSD.org> MFV r288408:
6266 harden dtrace_difo_chunksize() with respect to malicious DIF


Reviewed by: Alex Wilson <alex.wilson@joyent.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Bryan Cantrill <bryan@joyent.com>

MFC after: 1 week
afe48155b2ef65e4e81f790242635b1d2f24cd78 13-Sep-2015 delphij <delphij@FreeBSD.org> MFV r287623: 5997 FRU field not set during pool creation and never

ZFS already supports storing the vdev FRU in a vdev property. There
is code in libzfs to work with this property, and there is code in
the zfs-retire FMA module that looks for that information. But there
is no code actually setting or updating the FRU.

To address this, ZFS is changed to send a handful of new events
whenever a vdev is added, attached, cleared, or onlined, as well
as when a pool is created or imported.

Note that syseventd is not currently available on FreeBSD and thus
some work is needed to actually support the new ZFS events (e.g. in
zfsd) to actually use this capability, this changeset is mostly a
diff reduction from upstream.


Illumos issues:

5997 FRU field not set during pool creation and never updated
db0a2c953d7c03e6526acbf76bee458a3e6543ce 04-Sep-2015 delphij <delphij@FreeBSD.org> Expose an interface to determine if an ACE is inherited.

Submitted by: sef
Reviewed by: trasz
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D3540
70d4d5a42d9036c0ff34c844811fa1aab3ab394c 01-Jul-2015 br <br@FreeBSD.org> First cut of DTrace for AArch64.

Reviewed by: andrew, emaste
Sponsored by: ARM Limited
Differential Revision: https://reviews.freebsd.org/D2738
760c76460f7eb6c928e542bf58558a780b210578 19-Jun-2015 avg <avg@FreeBSD.org> illums compat: use flsl/flsll for highbit/highbit64

Do that only when when fast inline versions are available.
At the moment that can be the case only in the kernel and not for all

The original code uses the binary search and that's kept as a fallback.
This is a micro optimization.

Differential Revision: https://reviews.freebsd.org/D2839
Reviewed by: delphij, mahrens, mav
MFC after: 17 days
cd7e51a58a1764246ac1651d361dfee32ec0ed46 17-Jun-2015 avg <avg@FreeBSD.org> Revert r284511 because it caused build failures on many platforms

The problem is that when inline versions of flsl and flsll are not
available, then libkern.h must be included for their declarations
in kernel sources.
The fix would be trivial, but I would like to figure out first if
it even makes sense to use the libkern provided implementations.

Reported by: bz
Pointyhat to: avg
1574191a793bc236ab98611af46ae42fb9a6241a 17-Jun-2015 avg <avg@FreeBSD.org> illumos compat: use flsl/flsll for highbit/highbit64

This is a micro optimization.
The upstream code uses the binary search.

Differential Revision: https://reviews.freebsd.org/D2839
Reviewed by: delphij, mav
MFC after: 15 days
1d6ffde4f49799c1d95f31875779237c214fc09c 08-Apr-2015 markj <markj@FreeBSD.org> libdtrace: add support for lazyload mode.

Passing "-x lazyload" to dtrace -G during compilation causes dtrace(1) to
not link drti.o into the output object file, so the USDT probes are not created
during process startup. Instead, dtrace(1) will automatically discover and
create probes on the process' behalf when attaching.

Differential Revision: https://reviews.freebsd.org/D2203
Reviewed by: rpaulo
MFC after: 1 month
3046fbea8a690127ea911892b988e9315dfb9e81 01-Apr-2015 andrew <andrew@FreeBSD.org> Add the arm64 defines for cddl code.

Differential Revision: https://reviews.freebsd.org/D2186
Reviewed by: emaste
Sponsored by: The FreeBSD Foundation
ff8a1038b7c85a17ecd238a8349b0d4f3e2b3143 05-Mar-2015 andrew <andrew@FreeBSD.org> Add the MD parts of dtrace needed to use fbt on ARM. For this we need to
emulate the instructions used in function entry and exit.

For function entry ARM will use a push instruction to push up to 16
registers to the stack. While we don't expect all 16 to be used we need to
handle any combination the compiler may generate, even if it doesn't make
sense (e.g. pushing the program counter).

On function return we will either have a pop or branch instruction. The
former is similar to the push instruction, but with care to make sure we
update the stack pointer and program counter correctly in the cases they
are either in the list of registers or not. For branch we need to take the
24-bit offset, sign-extend it, and add that number of 4-byte words to the
program counter. Care needs to be taken as, due to historical reasons, the
address the branch is relative to is not the current instruction, but 8
bytes later.

This allows us to use the following probes on ARM boards:
dtrace -n 'fbt::malloc:entry { stack() }'
dtrace -n 'fbt::free:return { stack() }'

Differential Revision: https://reviews.freebsd.org/D2007
Reviewed by: gnn, rpaulo
Sponsored by: ABT Systems Ltd
b9be305241681da7846cc34819d953dd56af8657 10-Feb-2015 gnn <gnn@FreeBSD.org> Initial version of DTrace on ARM32.

Submitted by: Howard Su based on work by Oleksandr Tymoshenko
Reviewed by: ian, andrew, rpaulo, markj
55c26d898b33386bf97f560dfeaf3b2cdc029c31 17-Jan-2015 smh <smh@FreeBSD.org> Mechanically convert cddl sun #ifdef's to illumos

Since the upstream for cddl code is now illumos not sun, mechanically
convert all sun #ifdef's to illumos #ifdef's which have been used in all
newer code for some time.

Also do a manual pass to correct the use if #ifdef comments as per style(9)
as well as few uses of #if defined(__FreeBSD__) vs #ifndef illumos.

MFC after: 1 month
Sponsored by: Multiplay
15bfe2d26274bcac738882fc5c83faa0d0eddabe 07-Dec-2014 avg <avg@FreeBSD.org> remove opensolaris cyclic code, replace with high-precision callouts

In the old days callout(9) had 1 tick precision and that was inadequate
for some uses, e.g. DTrace profile module, so we had to emulate cyclic
API and behavior. Now we can directly use callout(9) in the very few
places where cyclic was used.

Differential Revision: https://reviews.freebsd.org/D1161
Reviewed by: gnn, jhb, markj
MFC after: 2 weeks
4b9dde8b13f81a3d2e1a4371338b53caed6da4ee 06-Dec-2014 andrew <andrew@FreeBSD.org> Apply the same fix in r274697 to the ARM case.
9df66c4cfbcd68e625be05f247efa69e70165db8 06-Dec-2014 delphij <delphij@FreeBSD.org> MFV r275534:

Sync with Illumos. This have no effect to FreeBSD.

Illumos issue:
5285 pass in cpu_pause_func via pause_cpus

MFC after: 2 weeks
a561c01fd5a0f2a155942cf8902ee1b9712fdd31 06-Dec-2014 delphij <delphij@FreeBSD.org> MFC r275533:

Sync with Illumos. This have no effect to FreeBSD.

Illumos issue:
5100 sparc build failed after 5004

MFC after: 2 weeks
f38628840cd05515326dff20b71bf5c470d2a420 19-Nov-2014 dim <dim@FreeBSD.org> Fix the following -Werror warning from clang 3.5.0, while building cddl/lib/libctf:

In file included from cddl/contrib/opensolaris/common/ctf/ctf_create.c:31:
In file included from sys/cddl/contrib/opensolaris/uts/common/sys/sysmacros.h:34:
sys/cddl/contrib/opensolaris/uts/common/sys/isa_defs.h:334:9: warning: '_ILP32' macro redefined [-Wmacro-redefined]
#define _ILP32
<built-in>:26:9: note: previous definition is here
#define _ILP32 1
1 warning generated.

This is because clang 3.5.0 started predefining _ILP32 and __ILP32__ for
the i386 arch. (Earlier versions already predefined _LP64 and __LP64__
for the x86_64 arch.)

Reviewed by: emaste, avg, smh, delphij, markj
Differential Revision: https://reviews.freebsd.org/D1187
fe03d9b9d26fcae476fe7e7fa57516b0fa684089 10-Nov-2014 delphij <delphij@FreeBSD.org> MFV r274273:

ZFS large block support.

Please note that booting from datasets that have recordsize greater
than 128KB is not supported (but it's Okay to enable the feature on
the pool). This *may* remain unchanged because of memory constraint.

Limited safety belt is provided for mounted root filesystem but use
caution is advised.

Illumos issue:
5027 zfs large block support

MFC after: 1 month
0584337fe8ddb1beb1e8a06c8696fb4e8252aeeb 25-Oct-2014 jpaetzel <jpaetzel@FreeBSD.org> This change addresses 4 bugs in ZFS exposed by Richard Kojedzinszky's
crash.sh script attached to FreeNAS bug 4109:

Three are in the snapshot layer:
a) AVG explains in his notes: https://wiki.freebsd.org/AvgVfsSolarisVsFreeBSD

"VOP_INACTIVE must not do any destructive actions to a vnode
and its filesystem node, nor invalidate them in any way."
gfs_vop_inactive and zfsctl_snapshot_inactive did just that. In
OpenSolaris VOP_INACTIVE is much closer to FreeBSD's VOP_RECLAIM.
Rename & move them to gfs_vop_reclaim and zfsctl_snapshot_reclaim
and merge in the requisite vnode_destroy from zfsctl_common_reclaim.

b) gfs_lookup_dot and various zfsctl functions do not honor the
FreeBSD VFS convention of only locking from the root downward. When
looking up ".." the convention is to drop the current leaf vnode lock before
acquiring the directory vnode and then subsequently re-acquiring the lock on the
leaf vnode. This fixes that in all the places that our exercised by crash.sh.

c) The snapshot may already be unmounted when the directory vnode is reclaimed.
Check for this case and return.

One in the common layer:
d) Callers of traverse expect the reference to the vnode passed in to be
maintained. Don't release it.

This last one may be an unclear contract. There may in fact be some callers that
do expect the reference to be dropped on success in addition to callers that
expect it to be released. In this case a further audit of the callers is needed
and a consensus on the correct behavior.

PR: 184677
Submitted by: kmacy
Reviewed by: delphij, will, avg
MFC after: 2 weeks
Sponsored by: iXsystems
626a49e1d6e2322242208e872046c6f5d7b742b4 22-Aug-2014 delphij <delphij@FreeBSD.org> MFV r270197:

Illumos issue:
5066 remove support for non-ANSI compilation
5068 Remove SCCSID() macro from <macros.h>

MFC after: 2 weeks
b248e9b18fdaada9f61497e89c6ba18345afa577 20-Aug-2014 delphij <delphij@FreeBSD.org> MFV r270193:

Illumos issues:
5042 stop using deprecated atomic functions

MFC after: 2 weeks
1747d793b5742aae99af6a1caf438bbb5e76b892 29-Jul-2014 delphij <delphij@FreeBSD.org> MFV r269223:

Change dn->dn_dbufs from linked list to AVL tree.

Illumos issues:
4873 zvol unmap calls can take a very long time for larger datasets

MFC after: 2 weeks
0f4faf42cbb855fd26557197bdcea66a237af96c 26-Jul-2014 delphij <delphij@FreeBSD.org> MFV r269010:

Import Illumos changes to address the following Illumos issues:
4976 zfs should only avoid writing to a failing non-redundant
top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature
is enabled
4984 device selection should use fragmentation metric

MFC after: 2 weeks
9f28abd980752efcf77578cd494f1015083c2a2b 07-Jul-2014 marcel <marcel@FreeBSD.org> Remove ia64.

This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation

This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h

Discussed at: BSDcan
af9b8de02039f02f632a331e8c35a42bdad7e072 03-Jul-2014 pfg <pfg@FreeBSD.org> Merge from OpenSolaris (20-Apr-2008):

6822482 DOF validation needs to handle loadable sections flagged as unloadable

MFC after: 1 week
ff95e6ff7f15e80345f114ebf0db450c0e226fd8 01-Jul-2014 delphij <delphij@FreeBSD.org> MFV r268122:

4929 want prevsnap property


MFC after: 2 weeks
0d7d731791d57691c640cdbdae4769591b0f68eb 01-Jul-2014 delphij <delphij@FreeBSD.org> MFV r267566:

4390 i/o errors when deleting filesystem/zvol can lead to space map corruption

MFC after: 2 weeks
da67e0c76e97f98e65b74852e2cf173c344ea29f 26-Jun-2014 rpaulo <rpaulo@FreeBSD.org> MFV illumos

4471 DTrace count() with histogram
4472 DTrace full width distribution histograms
4473 DTrace frequency trails

MFC after: 2 weeks
8c0e49065f210b6d5f3f7eae268cca2d60c727ba 26-Jun-2014 rpaulo <rpaulo@FreeBSD.org> MFV illumos

4474 DTrace Userland CTF Support
4475 DTrace userland Keyword
4476 DTrace tests should be better citizens
4479 pid provider types
4480 dof emulation is missing checks

MFC after: 2 weeks
c6d679e670d0710831f1978b7d379e5cec49186a 26-Jun-2014 rpaulo <rpaulo@FreeBSD.org> MFV illumos

4477 DTrace should speak JSON

MFC after: 2 weeks
3191dbe25d77ee4d9f61070776539a6446d7d778 26-Jun-2014 rpaulo <rpaulo@FreeBSD.org> MFV illumos r266986:

2915 DTrace in a zone should see "cpu", "curpsinfo", et al
2916 DTrace in a zone should be able to access fds[]
2917 DTrace in a zone should have limited provider access

MFC after: 2 weeks
37b311bee5d13e596aac0304bd2813b6de2fd12a 26-Jun-2014 rpaulo <rpaulo@FreeBSD.org> Revert r267898.
aedbcf436f4de9b9ce6c1f3c8c59c7f227272461 26-Jun-2014 rpaulo <rpaulo@FreeBSD.org> Bring the following change from the illumos-joyent repository:

commit 78e24ab6803bbe11ba37642624e1498ede5b239d
Author: Bryan Cantrill <bryan@joyent.com>
Date: Thu Oct 31 01:20:54 2013

OS-1688 DTrace count() with histogram
OS-2360 DTrace full width distribution histograms
OS-2361 DTrace frequency trails

MFC after: 2 weeks
f4fc0f344494005046953951113833a269f39583 23-Jun-2014 markj <markj@FreeBSD.org> Fix some bugs when fetching probe arguments in i386. Firstly ensure that
the 4 byte-aligned dtrace_invop_callsite can be found and that it
immediately follows the call to dtrace_invop(). Secondly, fix some pointer
arithmetic to account for differences between struct i386_frame and illumos'
struct frame. Finally, ensure that dtrace_getarg() isn't inlined. It works
by following a fixed number of frame pointers to the probe site, so inlining
breaks it.

MFC after: 3 weeks
78ba1760897362f983b51d564ae721e7355af226 27-May-2014 delphij <delphij@FreeBSD.org> MFV r266766:

Add a new zfs property, "redundant_metadata" which can have values "all" or
"most". The default will be "all", which is the current behavior. When set
to all, ZFS stores an extra copy of all metadata. If a single on-disk block
is corrupt, at worst a single block of user data (which is recordsize bytes
long) can be lost.

Setting to "most" will cause us to only store 1 copy of level-1 indirect
blocks of user data files. This can improve performance of random writes,
because less metadata has to be written. In practice, at worst about
100 blocks (of recordsize bytes each) of user data can be lost if a single
on-disk block is corrupt.

The exact behavior of which metadata blocks are stored redundantly may change
in future releases.

Illumos issue: 3835 zfs need not store 2 copies of all metadata

MFC after: 2 weeks
4d48e09fd736c5e6c154af2a2c4af1557f6a17ad 23-Apr-2014 delphij <delphij@FreeBSD.org> MFV r264830:

4745 fix AVL code misspellings

MFC after: 2 weeks
4a1b97c2c074b95c815a8d7596b8858e2b7e3177 23-Apr-2014 delphij <delphij@FreeBSD.org> MFV r264829:

3897 zfs filesystem and snapshot limits

MFC after: 2 weeks
3344ccfa375a376875af16049d67758b63e930ef 18-Apr-2014 delphij <delphij@FreeBSD.org> MFV r264666:

4374 dn_free_ranges should use range_tree_t


MFC after: 2 weeks
801a01a4411b8ecb97fe66c0a050bf5d07d85aef 14-Apr-2014 markj <markj@FreeBSD.org> DTrace's pid provider works by inserting breakpoint instructions at probe
sites and installing a hook at the kernel's trap handler. The fasttrap code
will emulate the overwritten instruction in some common cases, but otherwise
copies it out into some scratch space in the traced process' address space
and ensures that it's executed after returning from the trap.

In Solaris and illumos, this (per-thread) scratch space comes from some
reserved space in TLS, accessible via the fs segment register. This
approach is somewhat unappealing on FreeBSD since it would require some
modifications to rtld and jemalloc (for static TLS) to ensure that TLS is
executable, and would thus introduce dependencies on their implementation
details. I think it would also be impossible to safely trace static binaries
compiled without these modifications.

This change implements the functionality in a different way, by having
fasttrap map pages into the target process' address space on demand. Each
page is divided into 64-byte chunks for use by individual threads, and
fasttrap's process descriptor struct has been extended to keep track of
any scratch space allocated for the corresponding process.

With this change it's possible to trace all libc functions in a program,
e.g. with

pid$target:libc.so.*::entry {@[probefunc] = count();}

Previously this would generally cause the victim process to crash, as
tracing memcpy on amd64 requires the functionality described above.

Tested by: Prashanth Kumar <pra_udupi@yahoo.co.in> (earlier version)
MFC after: 6 weeks
dd28e6bbc0ebb901c5b00356af4645a9bc9c7b3b 05-Apr-2014 mav <mav@FreeBSD.org> Add property and sysctl to control how ZVOLs are exposed to OS.

New ZFS property volmode and sysctl vfs.zfs.vol.mode allow switching ZVOL
between three modes:
geom -- existing fully functional behavior (default);
dev -- exposing volumes only as raw disk device file in devfs;
none -- not exposing volumes outside ZFS.

The "dev" mode is less functional (can't be partitioned, mounted, etc),
but it is faster, and in some scenarios with untrusted consumers safer.
It can be useful for NAS, VM block storages, etc.
The "none" mode may be convenient for backup servers, etc. that don't
need direct data access.

Due to the way ZVOL is integrated with main ZFS code, those property
and sysctl are checked only during pool import and volume creation.

MFC after: 1 month
Sponsored by: iXsystems, Inc.
513727776172c040af6076666ba5249bdf9ed97e 02-Jan-2014 delphij <delphij@FreeBSD.org> MFV r260154 + 260182:

4369 implement zfs bookmarks
4368 zfs send filesystems from readonly pools


MFC after: 2 weeks
0d764663c2221bb1f9d8c4a5b157d2af7e8b1978 31-Dec-2013 markj <markj@FreeBSD.org> Revert r260091. The vmem calls seem to be slower than the *_unr() calls that
they replaced, which is important considering that probe IDs are allocated
during process startup for USDT probes.
27bf80971b2838a6a0a65f9c11927dfe3fcda189 30-Dec-2013 markj <markj@FreeBSD.org> Now that vmem(9) is available, use vmem arenas to allocate probe and
aggregation IDs, as is done in the upstream illumos code. This still
requires some FreeBSD-specific code, as our vmem API is not identical to the
one in illumos.

Submitted by: Mike Ma <mikemandarine@gmail.com>
d1f2a864011f62139cc89c64017b43cac5af042c 26-Nov-2013 avg <avg@FreeBSD.org> 734 taskq_dispatch_prealloc() desired

943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP

Essentially FreeBSD taskqueues already operate in a mode that
was added to Illumos with taskq_dispatch_ent change.
We even exposed the superior FreeBSD interface as taskq_dispatch_safe.
Now we just rename taskq_dispatch_safe to taskq_dispatch_ent and
struct struct ostask to taskq_ent_t, so that code differences will be

After this change sys/cddl/compat/opensolaris/sys/taskq.h header is no
longer needed.

Note that this commit is not an MFV because the upstream change was not
individually committed to the vendor area.

MFC after: 8 days
a1e8ebde3f089775556dc37214fb2b363e35a475 26-Nov-2013 avg <avg@FreeBSD.org> opensolaris taskq: some cosmetic changes

- drop trailing whitespace
- remove redundant "extern" from function declarations
- remove unused macro

MFC after: 1 week
19a7950d1dd7d19d09bbc5bcf5d6ab6f22e5005e 18-Nov-2013 markj <markj@FreeBSD.org> The fasttrap ioctl used to create probes takes a variable-sized argument.
It was not being correctly copied into the kernel on FreeBSD, and as a
result, probes with multiple probe sites were not being created properly.
To fix this, change the ioctl definition so that the fasttrap ioctl handler
is responsible for copying in userland data.

Submitted by: Prashanth Kumar <pra_udupi@yahoo.co.in>
MFC after: 1 month
2547e15155fd59763fe0345fe46b8a6735448230 05-Nov-2013 markj <markj@FreeBSD.org> Use suword32 and suword64 instead of copyout(9). This fixes a bug in the
emulation of the call instruction caused by reversing the uaddr and kaddr
arguments when copying data out to userland: the suword* functions take the
uaddr as the first argument whereas copyout(9) takes the kaddr as the first
argument. This also partially undoes the fixes from r257143.

Submitted by: Prashanth Kumar <pra_udupi@yahoo.co.in> (original version)
MFC after: 1 month
3ecc6f129801776dd571d69cf9a262a97ad23968 16-Oct-2013 markj <markj@FreeBSD.org> Add a function, memstr, which can be used to convert a buffer of
null-separated strings to a single string. This can be used to print the
full arguments of a process using execsnoop (from the DTrace toolkit) or
with the following one-liner:

dtrace -n 'syscall::execve:return {trace(curpsinfo->pr_psargs);}'

Note that this relies on the process arguments being cached via the struct
proc, which means that it will not work for argvs longer than
kern.ps_arg_cache_limit. However, the following rather non-portable
script can be used to extract any argv at exec time:

printf("%s", memstr(args[1]->begin_argv, ' ',
args[1]->begin_envv - args[1]->begin_argv));

The debug.dtrace.memstr_max sysctl limits the maximum argument size to
memstr(). Thanks to Brendan Gregg for helpful comments on freebsd-dtrace.

Tested by: Fabian Keil (earlier version)
MFC after: 2 weeks
5017a032d282b2ecdd81495733cc73657ff2b8fb 23-Aug-2013 delphij <delphij@FreeBSD.org> MFV r254422:

Illumos DTrace issues:
3089 want ::typedef
3094 libctf should support removing a dynamic type
3095 libctf does not validate arrays correctly
3096 libctf does not validate function types correctly
bd47afb2893407b47a1dc3c9165507d2aa77b5b7 21-Aug-2013 gibbs <gibbs@FreeBSD.org> Enhance the ZFS vdev layer to maintain both a logical and a physical
minimum allocation size for devices. Use this information to
automatically increase ZFS's minimum allocation size for new top-level
vdevs to a value that more closely matches the optimum device
allocation size.

Use GEOM's stripesize attribute, if set, as the physical sector
size of the GEOM.

Calculate the minimum blocksize of each metaslab class. Use the
calculated value instead of SPA_MINBLOCKSIZE (512b) when determining
the likelyhood of compression yeilding a reduction in physical space

Report devices with sub-optimal block size configuration in "zpool
status". Also properly fail attempts to attach devices with a
logical block size greater than 8kB, since this will cause corruption
to ZFS's label area.

Sponsored by: Spectra Logic Corporaion
MFC after: 2 weeks

Many modern devices use physical allocation units that are much
larger than the minimum logical allocation size accessible by
external commands. Two prevalent examples of this are 512e disk
drives (512b logical sector, 4K physical sector) and flash devices
(512b logical sector, 4K or larger allocation block size, and 128k
or larger erase block size). Operations that modify less than the
physical sector size result in a costly read-modify-write or garbage
collection sequence on these devices.

Simply exporting the true physical sector of the device to ZFS would
yield optimal performance, but has two serious drawbacks:

1) Existing pools created with devices that have different logical
and physical block sizes, but were configured to use the logical
block size (e.g. because the OS version used for pool construction
reported the logical block size instead of the physical block
size) will suddenly find that the vdev allocation size has
increased. This can be easily tolerated for active members of
the array, but ZFS would prevent replacement of a vdev with
another identical device because it now appears that the smaller
allocation size required by the pool is not supported by the new

2) The device's physical block size may be too large to be supported
by ZFS. The optimal allocation size for the vdev may be quite
large. For example, a RAID controller may export a vdev that
requires read-modify-write cycles unless accessed using 64k
aligned/sized requests. ZFS currently has an 8k minimum block
size limit.

Reporting both the logical and physical allocation sizes for vdevs
solves these problems. A device may be used so long as the logical
block size is compatible with the configuration. By comparing the
logical and physical block sizes, new configurations can be optimized
and administrators can be notified of any existing pools that are

Add the SPA_ASHIFT constant. ZFS currently has a hard upper
limit of 13 (8k) for ashift and this constant is used to
both document and enforce this limit.

Add the VDEV_AUX_ASHIFT_TOO_BIG error code.

Add fields for exporting the configured, logical, and
physical ashift to the vdev_stat_t structure.

Add VDEV_STAT_VALID() macro which can be used to verify the
presence of required vdev_stat_t fields in nvlist data.

Provide a SYSCTL_PROC handler for "max_auto_ashift". Since
the limit is only referenced long after boot when a create
operation occurs, there's no compelling need for it to be
a boot time configurable tunable. This also allows the
validation code for the max_auto_ashift value to be contained
within the sysctl handler.

Populate the new fields in the vdev_stat_t structure.

Fail vdev opens if the vdev reports an ashift larger than

Propogate vdev_logical_ashift and vdev_physical_ashift between
child and parent vdevs as is done for vdev_ashift.

In vdev_open(), restore code that fails opens for devices
where vdev_ashift grows. This can only happen now if the
device's logical ashift grows, which means it really isn't
safe to use the device.

Update the vdev_open() API so that both logical (what was
just ashift before) and physical ashift are reported.

Add two new fields, vdev_physical_ashift and vdev_logical_ashift,
to vdev_t.

Add vdev_ashift_optimize(). Call it anytime a new top-level
vdev is allocated.

Add text for the VDEV_AUX_ASHIFT_TOO_BIG error.

For each sub-optimally configured leaf vdev, report configured
and native block sizes.

Introduce a new zpool status: ZPOOL_STATUS_NON_NATIVE_ASHIFT.
This status is reported on healthy pools containing vdevs
configured to use a block size smaller than their reported
physical block size.

Update find_vdev_problem() and supporting functions to
provide the full vdev_stat_t structure to problem checking
routines, and to allow decent into replacing vdevs.

Add a vdev_non_native_ashift() validator which is used on
the full vdev tree to check for ZPOOL_STATUS_NON_NATIVE_ASHIFT.

Enhance sysctl userland stubs now that a SYSCTL_PROC handler
is used in vdev.c.

When the group membership of a metaslab class changes (i.e.
when a vdev is added or removed from a pool), walk the group
list to determine the smallest block size currently available
and record this in the metaslab class.

Add the metaslab_class_get_minblocksize() accessor.

In zio_compress_data(), take the minimum blocksize as an
input parameter instead of assuming SPA_MINBLOCKSIZE.

In l2arc_compress_buf(), pass SPA_MINBLOCKSIZE as the minimum
blocksize of the device. The l2arc code performs has it's own
code for deciding if compression is worth while, so this
effectively disables zio_compress_data() from second guessing
the original decision.

In zio_write_bp_init(), use the minimum blocksize of the
normal metaslab class when compressing data.
5a3f78714c72773b5d7b845e2cfa02dd8e9b2d94 13-Aug-2013 markj <markj@FreeBSD.org> FreeBSD's DTrace implementation has a few problems with respect to handling
probes declared in a kernel module when that module is unloaded. In

* Unloading a module with active SDT probes will cause a panic. [1]
* A module's (FBT/SDT) probes aren't destroyed when the module is unloaded;
trying to use them after the fact will generally cause a panic.

This change fixes both problems by porting the DTrace module load/unload
handlers from illumos and registering them with the corresponding
EVENTHANDLER(9) handlers. This allows the DTrace framework to destroy all
probes defined in a module when that module is unloaded, and to prevent a
module unload from proceeding if some of its probes are active. The latter
problem has already been fixed for FBT probes by checking lf->nenabled in
kern_kldunload(), but moving the check into the DTrace framework generalizes
it to all kernel providers and also fixes a race in the current
implementation (since a probe may be activated between the check and the
call to linker_file_unload()).

Additionally, the SDT implementation has been reworked to define SDT
providers/probes/argtypes in linker sets rather than using SYSINIT/SYSUNINIT
to create and destroy SDT probes when a module is loaded or unloaded. This
simplifies things quite a bit since it means that pretty much all of the SDT
code can live in sdt.ko, and since it becomes easier to integrate SDT with
the DTrace framework. Furthermore, this allows FreeBSD to be quite flexible
in that SDT providers spanning multiple modules can be created on the fly
when a module is loaded; at the moment it looks like illumos' SDT
implementation requires all SDT probes to be statically defined in a single
kernel table.

PR: 166927, 166926, 166928
Reported by: davide [1]
Reviewed by: avg, trociny (earlier version)
MFC after: 1 month
ccc3c4970e788379d0e1e21c2a8a5a56b6fb6d74 08-Aug-2013 delphij <delphij@FreeBSD.org> MFV r254079:

Illumos ZFS issues:
3957 ztest should update the cachefile before killing itself
3958 multiple scans can lead to partial resilvering
3959 ddt entries are not always resilvered
3960 dsl_scan can skip over dedup-ed blocks if
physical birth != logical birth
3961 freed gang blocks are not resilvered and can cause pool to suspend
3962 ztest should print out zfs debug buffer before exiting
c5affee6a3a59d69c44d3493fa98bec9080bb11f 30-Jul-2013 delphij <delphij@FreeBSD.org> MFV r253781 + r253871:

Illumos ZFS issues:
3894 zfs should not allow snapshot of inconsistent dataset

MFC after: 2 weeks
46c768b6a7fd9d51fc2b9385b56f30e852d00811 11-Jun-2013 delphij <delphij@FreeBSD.org> MFV r251626:

ZFS event processing should work on R/O root filesystems

Illumos ZFS issues:
3749 zfs event processing should work on R/O root filesystems

MFC after: 2 weeks
8f40d761c17b416a0ae225d630d4b7eeaf84ca6a 12-May-2013 markj <markj@FreeBSD.org> Bring back part of r249367 by adding DTrace's temporal option, which allows
users to guarantee that the output of DTrace scripts will be time-ordered.
This option is enabled by adding the line

#pragma D option temporal

to the beginning of a script, or by adding '-x temporal' to the arguments of

This change fixes a bug in the original port of the temporal option. This
bug was causing some assertions to fail, so they had been disabled; in this
revision the assertions are working properly and are enabled.

The DTrace version number has been bumped from 1.9.0 to 1.9.1 to reflect
the language change that's being introduced.

This change corresponds to part of illumos-gate commit e5803b76927480:
3021 option for time-ordered output from dtrace(1M)

Reviewed by: pfg
Obtained from: illumos
MFC after: 1 month
f9796e10e2b7bb43bf358c337b15bb183a6b1aaf 17-Apr-2013 pfg <pfg@FreeBSD.org> DTrace: Revert r249367

The following change from illumos brought caused DTrace to
pause in an interactive environment:

3026 libdtrace should set LD_NOLAZYLOAD=1 to help the pid provider

This was not detected during testing because it doesn't
affect scripts.

We shouldn't be changing the environment, especially since the
LD_NOLAZYLOAD option doesn't apply to our (GNU) ld.
Unfortunately the change from upstream was made in such a way
that it is very difficult to separate this change from the
others so, at least for now, it's better to just revert


Reported by: Navdeep Parhar and Mark Johnston
2554e81a3708e57166b2d06fe3c4fe86ad8b9336 11-Apr-2013 pfg <pfg@FreeBSD.org> DTrace: option for time-ordered output

Merge changes from illumos:

3021 option for time-ordered output from dtrace(1M)
3022 DTrace: keys should not affect the sort order when sorting by value
3023 it should be possible to dereference dynamic variables
3024 D integer narrowing needs some work
3025 register leak in D code generation
3026 libdtrace should set LD_NOLAZYLOAD=1 to help the pid provider

This brings yet another feature implemented in upstream DTrace.
A complete description is available here:

This change bumps the DT_VERS_* number to 1.9.1 in
accordance to what is done in illumos.

This change was somewhat complicated because upstream is mixed many
changes in an individual commit and some of the tests don't really
apply to us.

There are also appear to be differences in timestamping with Solaris
so we had to workaround some assertions making sure no regression

Special thanks to Fabian Keil for changes and testing.

Illumos Revisions: 13758:23432da34147


Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 1 months
50c97d79d92898d42c89c511f1d7b5971e269fc6 01-Apr-2013 pfg <pfg@FreeBSD.org> Dtrace: enablings on defunct providers prevent providers from unregistering

Merge change from illumos:

1368 enablings on defunct providers prevent providers from unregistering

We try to address some underlying differences between the Solaris
and FreeBSD implementations: dtrace_attach() / dtrace_detach() are
currently unimplemented in FreeBSD but the new code from illumos
makes use of taskq so some adaptations were made to dtrace_open()
and dtrace_close() to handle them appropriately.

Illumos Revision: r13430:8e6add739e38


Reviewed by: gnn
Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 3 weeks
fe83fcce74a0fc5b36b0b3aa414040f8d88c3a61 25-Mar-2013 pfg <pfg@FreeBSD.org> Dtrace: add toupper()/tolower() and enhancements to lltostr().

Merge changes from illumos:

1451 DTrace needs toupper()/tolower() subroutines
1457 lltostr() D subroutine should take an optional base

This change bumps the DT_VERS_* number to 1.8.1 in
accordance to what is done in illumos.

The test suite we currently include is outdated and
doesnt support some updates in tst.subr.d which had to
be left out for now.

Illumos Revisions: r13458 5e394d8db762
r13459 c3454574dd1a


Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 1 month
61cbaa2ebdd1625615b379ecfa0b9b2e369a7139 24-Mar-2013 pfg <pfg@FreeBSD.org> Dtrace: add optional size argument to tracemem().

Merge change from illumos:

1455 DTrace tracemem() should take an optional size argument

Our local enhancements to dt_print_bytes were equivalent to
those in illumos but we made it match the illumos version
to ease further code merges.

For now leave out tst.smallsize.d and tst.smallsize.d.out
since those don't seem to work cleanly on FreeBSD.

This change bumps the DT_VERS_* number to 1.7.1 in accordance
to what is done in illumos.

Illumos Revision: 13457:571b0355c2e3


Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 1 month
7c87858955593be19a80e57b7353b09f5587ae9b 19-Mar-2013 mm <mm@FreeBSD.org> MFV r247580:
Merge synctask code restructuring from vendor.

Modify forward and backward compatibility to support new change.

Illumos ZFS issues:
3464 zfs synctask code needs restructuring

Sponsored by: Hybrid Logic Ltd.
09f7f9e4ff3041f8e236af6af32f05f1d181e23d 18-Mar-2013 mm <mm@FreeBSD.org> MFC @248461
7b62f31cdf0ac5e5b9e97e38a21caf00e4897a45 18-Mar-2013 jhibbits <jhibbits@FreeBSD.org> Add FBT for PowerPC DTrace. Also, clean up the DTrace assembly code,
much of which is not necessary for PowerPC.

The FBT module can likely be factored into 3 separate files: common,
intel, and powerpc, rather than duplicating most of the code between
the x86 and PowerPC flavors.

All DTrace modules for PowerPC will be MFC'd together once Fasttrap is
99f883783c89d10b9d02e77393eb9d0a307f3ca0 05-Mar-2013 mm <mm@FreeBSD.org> WiP merge of libzfs_core (MFV r238590, r238592)
not yet working, ioctl handling needs to be changed
2bed8f5691b7572d6f31bfc68e9f762a938af863 01-Mar-2013 mm <mm@FreeBSD.org> MFV r247316:
Merge new read-only zfs properties from vendor (illumos)

Illumos ZFS issues:
3588 provide zfs properties for logical (uncompressed) space used and


MFC after: 2 weeks
1409e8df208328336c5d9a8943bd8b7f1d42d12e 13-Nov-2012 kib <kib@FreeBSD.org> Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR: standards/170346
Submitted by: "Jukka A. Ukkonen" <jau@iki.fi>
MFC after: 1 month
063b8917f2cb6a041f16c02ed902756ef7313642 12-Sep-2012 mm <mm@FreeBSD.org> Merge recent zfs vendor changes, sync code and adjust userland DEBUG.

Illumos issued covered:
1884 Empty "used" field for zfs *space commands
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first argument
is zero
3028 zfs {group,user}space -n prints (null) instead of numeric GID/UID
3048 zfs {user,group}space [-s|-S] is broken
3049 zfs {user,group}space -t doesn't really filter the results
3060 zfs {user,group}space -H output isn't tab-delimited
3061 zfs {user,group}space -o doesn't use specified fields order
3064 usr/src/cmd/zpool/zpool_main.c misspells "successful"
3093 zfs {user,group}space's -i is noop
3098 zfs userspace/groupspace fail without saying why when run as non-root

https://www.illumos.org/issues/ + [issue_id]

Obtained from: illumos (vendor/illumos, vendor/illumos-sys)
MFC after: 2 weeks
732e623063e898acb15e1526337c4186a02c10d7 10-Sep-2012 mm <mm@FreeBSD.org> Add assfail() and assfail3() to the opensolaris module.
Remove obsoleted intermediate cddl/compat/opensolaris/sys/debug.h.

MFC after: 2 weeks
cf6d705d5b2f458f6a07c3c33eac90bca667108a 30-Jul-2012 mm <mm@FreeBSD.org> Partial MFV (illumos-gate 13753:2aba784c276b)
2762 zpool command should have better support for feature flags


MFC after: 2 weeks
bc23332ce7bed6bcbb1db9452d848c63a713e7c2 27-Jun-2012 pfg <pfg@FreeBSD.org> Bring llquantize support into Dtrace.

Bryan Cantrill implemented the equivalent of semi-log graph
paper for Dtrace so llquantize will use one logarithmic and
one linear scale.

Special thanks to Mark Peek for providing fix to an
assertion and to Fabian Keill for testing the port.

Illumos Revision: 13355:15b74a2a9a9d


Obtained from: Illumos
Tested by: Fabian Keill, mp
MFC after: 4 days
cc61ab2f133566dca51970d44cc49a4355039b5d 11-Jun-2012 mm <mm@FreeBSD.org> Introduce "feature flags" for ZFS pools (bump SPA version to 5000).
Add first feature "com.delphix:async_destroy" (asynchronous destroy
of ZFS datasets).
Implement features support in ZFS boot code.

Illumos revisions merged:
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags


Obtained from: illumos (issue #2619, #2747)
MFC after: 1 month
38ef930079cbb1df5de9df3c1064426dba3976b1 27-May-2012 mm <mm@FreeBSD.org> Import illumos changeset 13570:3411fd5f1589
1948 zpool list should show more detailed pool information

Display per-vdev information with "zpool list -v".
The added expandsize property has currently no value on FreeBSD.
This changeset allows adding expansion support to individual vdevs
in the future.


Obtained from: illumos (issue #1948)
MFC after: 2 weeks
046ff8962602e8d65b6b3fae48573513ab7e433f 10-May-2012 mm <mm@FreeBSD.org> Import illumos changeset 13686:4bc0783f6064
2703 add mechanism to report ZFS send progress

If the zfs send command is used with the -v flag, the amount of bytes
transmitted is reported in per second updates.


Obtained from: illumos (issue #2703)
MFC after: 2 weeks
ad032fccda1eefdb6355b59b79560ed68ca81e68 26-Apr-2012 rstone <rstone@FreeBSD.org> Implement the D "cpu" variable, which returns curcpu. I have chosen not
to follow the example of OpenSolaris and its descendants, which implemented
cpu as an inline that took a value out of curthread. At certain points in
the FreeBSD scheduler curthread->td_oncpu will no longer be valid (in
particukar, just before the thread gets descheduled) so instead I have
implemented this as its own built-in variable.

Sponsored by: Sandvine Inc.
MFC after: 1 week
73cab7197174153a718dedc97ff344341bcf6098 28-Nov-2011 mm <mm@FreeBSD.org> Merge new ZFS features from illumos:

1644 add ZFS "clones" property

1645 add ZFS "written" and "written@..." properties

1646 "zfs send" should estimate size of stream

1647 "zfs destroy" should determine space reclaimed by destroying multiple

1693 persistent 'comment' field for a zpool

1708 adjust size of zpool history data

1748 desire support for reguid in zfs

Obtained from: illumos (changesets 13514, 13524, 13525)
MFC after: 1 month
c5160d4717e9b92608f5ef6d4304a004dc271bfc 18-Jul-2011 mm <mm@FreeBSD.org> Resurrect the ZFS "aclmode" property
Change default of "aclmode" to "discard".

Illumos-gate changeset: 13370:8c04143bd318

Obtained from: Illumos (Feature #742)
MFC after: 2 weeks
333e34b938a2b5e7b036e02b36408880b415109c 28-Jun-2011 mm <mm@FreeBSD.org> Add a new "REFCOMPRESSRATIO" property.

For snapshots, this is the same as COMPRESSRATIO, but for
filesystems/volumes, the COMPRESSRATIO is based on the data "USED" (ie,
includes blocks in children, but not blocks shared with the origin).

This is needed to figure out how much space a filesystem would use if it
were not compressed (ignoring snapshots).

Illumos-gate revision: 13387

Obtained from: Illumos (Feature #1092)
MFC after: 2 weeks
2071e3510abcb0d23655e9ec6f21ded8a0d7fa8a 18-Jun-2011 benl <benl@FreeBSD.org> Fix clang warnings.

Approved by: philip (mentor)
1b03c5bf41222b723415638f03e00ed12cac076a 27-Feb-2011 pjd <pjd@FreeBSD.org> Finally... Import the latest open-source ZFS version - (SPA) 28.

Few new things available from now on:

- Data deduplication.
- Triple parity RAIDZ (RAIDZ3).
- zfs diff.
- zpool split.
- Snapshot holds.
- zpool import -F. Allows to rewind corrupted pool to earlier
transaction group.
- Possibility to import pool in read-only mode.

MFC after: 1 month
a2f2f93652c6110c2d126720bc696a507347df3c 09-Sep-2010 rpaulo <rpaulo@FreeBSD.org> Fix two bugs in DTrace:
* when the process exits, remove the associated USDT probes
* when the process forks, duplicate the USDT probes.

Sponsored by: The FreeBSD Foundation
a5c8d0424be1b289a1d47240266c6ad9ff009f58 28-Aug-2010 mm <mm@FreeBSD.org> Import changes from OpenSolaris that provide
- better ACL caching and speedup of ACL permission checks
- faster handling of stat()
- lowered mutex contention in the read/writer lock (rrwlock)
- several related bugfixes

Detailed information (OpenSolaris onnv changesets and Bug IDs):

6802734 Support for Access Based Enumeration (not used on FreeBSD)
6844861 inconsistent xattr readdir behavior with too-small buffer

6848431 zfs with rstchown=0 or file_chown_self privilege allows user to "take" ownership

6775100 stat() performance on files on zfs should be improved
6827779 rrwlock is overly protective of its counters

6857433 memory leaks found at: zfs_acl_alloc/zfs_acl_node_alloc
6860318 truncate() on zfsroot succeeds when file has a component of its path set without access permission

6865875 zfs sometimes incorrectly giving search access to a dir

6867395 zpool_upgrade_007_pos testcase panic'd with BAD TRAP: type=e (#pf Page fault)

6868276 zfs_rezget() can be hazardous when znode has a cached ACL

6870564 panic in zfs_getsecattr

Approved by: delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
MFC after: 2 weeks
8111273b840f5fb9e4e4739f35d615a84b6e45fd 22-Aug-2010 rpaulo <rpaulo@FreeBSD.org> Port this to FreeBSD. We miss some suword functions, so we use copyout.

Sponsored by: The FreeBSD Foundation
> Description of fields to fill in above: 76 columns --|
> PR: If a GNATS PR is affected by the change.
> Submitted by: If someone else sent in the change.
> Reviewed by: If someone else reviewed your modification.
> Approved by: If you needed approval for this commit.
> Obtained from: If the change is from a third party.
> MFC after: N [day[s]|week[s]|month[s]]. Request a reminder email.
> Security: Vulnerability reference (one per line) or description.
> Empty fields above will be automatically removed.

M sys/fasttrap_impl.h
2b32a31ca3484e23ff35b71b7349d361ba187ee3 22-Aug-2010 rpaulo <rpaulo@FreeBSD.org> Kernel DTrace support for:
o uregs (sson@)
o ustack (sson@)
o /dev/dtrace/helper device (needed for USDT probes)

The work done by me was:
Sponsored by: The FreeBSD Foundation
916a9e33254f1064a5588724dd63546e3754d246 22-Aug-2010 rpaulo <rpaulo@FreeBSD.org> Add the FreeBSD definition for the fasttrap ioctls.

Sponsored by: The FreeBSD Foundation
b8de09809273e0633438b2039f3fa1d273f5dcff 21-Aug-2010 rpaulo <rpaulo@FreeBSD.org> Port the DTrace helper ioctls to FreeBSD and add a helper member to
dof_helper_t (needed by drti.o).

Sponsored by: The FreeBSD Foundation
48183e9b238a8ed57106391ce5fe952483664d5e 19-Aug-2010 imp <imp@FreeBSD.org> First cut at mips n64 ABI support
b2946e89348042300795fce8f0b12a01250541df 12-Jul-2010 mm <mm@FreeBSD.org> Merge ZFS version 15 and almost all OpenSolaris bugfixes referenced
in Solaris 10 updates 141445-09 and 142901-14.

Detailed information:
(OpenSolaris revisions and Bug IDs, Solaris 10 patch numbers)

6755435 zfs_open() and zfs_close() needs to use ZFS_ENTER/ZFS_VERIFY_ZP (141445-01)

6748436 inconsistent zpool.cache in boot_archive could panic a zfs root filesystem upon boot-up (141445-01)

6740164 zpool attach can create an illegal root pool (141909-02)

6769612 zpool_import() will continue to write to cachefile even if altroot is set (N/A)

6757430 want an option for zdb to disable space map loading and leak tracking (141445-01)

6542860 ASSERT: reason != VDEV_LABEL_REMOVE||vdev_inuse(vd, crtxg, reason, 0) (141445-01)

6761100 want zdb option to select older uberblocks (141445-01)

6774886 zfs_setattr() won't allow ndmp to restore SUNWattr_rw (141445-01)

6737463 panic while trying to write out config file if root pool import fails (141445-01)

6765294 Refactor replay (141445-01)

6572357 libzfs should do more to avoid mnttab lookups (141909-01)
6572376 zfs_iter_filesystems and zfs_iter_snapshots get objset stats twice (141909-01)

6328632 zpool offline is a bit too conservative (141445-01)
6739487 ASSERT: txg <= spa_final_txg due to scrub/export race (141445-01)
6767129 ASSERT: cvd->vdev_isspare, in spa_vdev_detach() (141445-01)
6747698 checksum failures after offline -t / export / import / scrub (141445-01)
6745863 ZFS writes to disk after it has been offlined (141445-01)
6722540 50% slowdown on scrub/resilver with certain vdev configurations (141445-01)
6759999 resilver logic rewrites ditto blocks on both source and destination (141445-01)
6758107 I/O should never suspend during spa_load() (141445-01)
6776548 codereview(1) runs off the page when faced with multi-line comments (N/A)
6761406 AMD errata 91 workaround doesn't work on 64-bit systems (141445-01)

6770866 GRUB/ZFS should require physical path or devid, but not both (141445-01)

6674216 "zfs share" doesn't work, but "zfs set sharenfs=on" does (141445-01)
6621164 $SRC/cmd/zfs/zfs_main.c seems to have a syntax error in the translation note (141445-01)
6635482 i18n problems in libzfs_dataset.c and zfs_main.c (141445-01)
6595194 "zfs get" VALUE column is as wide as NAME (141445-01)
6722991 vdev_disk.c: error checking for ddi_pathname_to_dev_t() must test for NODEV (141445-01)
6396518 ASSERT strings shouldn't be pre-processed (141445-01)

6713916 scrub/resilver needlessly decompress data (141445-01)

6739553 libzfs_status msgid table is out of sync (141445-01)
6784104 libzfs unfairly rejects numerical values greater than 2^63 (141445-01)
6784108 zfs_realloc() should not free original memory on failure (141445-01)

6788830 set large value to reservation cause core dump (141445-01)
6791064 want sysevents for ZFS scrub (141445-01)
6791066 need to be able to set cachefile on faulted pools (141445-01)
6791071 zpool_do_import() should not enable datasets on faulted pools (141445-01)
6792134 getting multiple properties on a faulted pool leads to confusion (141445-01)

6792884 Vista clients cannot access .zfs (141445-01)

6798384 It can take a village to raise a zio (141445-01)

6551866 deadlock between zfs_write(), zfs_freesp(), and zfs_putapage() (141909-01)
6504953 zfs_getpage() misunderstands VOP_GETPAGE() interface (141909-01)
6702206 ZFS read/writer lock contention throttles sendfile() benchmark (141445-01)
6780491 Zone on a ZFS filesystem has poor fork/exec performance (141445-01)
6747596 assertion failed: DVA_EQUAL(BP_IDENTITY(&zio->io_bp_orig), BP_IDENTITY(zio->io_bp))); (141445-01)

6801507 ZFS read aggregation should not mind the gap (141445-01)

6633095 creating a filesystem with many properties set is slow (141445-01)

6775697 oracle crashes when overwriting after hitting quota on zfs (141909-01)

6790687 libzfs mnttab caching ignores external changes (141445-01)
6791101 memory leak from libzfs_mnttab_init (141445-01)

6800942 smb_session_create() incorrectly stores IP addresses (N/A)
6582163 Access Control List (ACL) for shares (141445-01)
6804954 smb_search - shortname field should be space padded following the NULL terminator (N/A)
6800184 Panic at smb_oplock_conflict+0x35() (N/A)

6803822 Reboot after replacement of system disk in a ZFS mirror drops to grub> prompt (141445-01)

6789318 coredump when issue zdb -uuuu poolname/ (141445-01)
6790345 zdb -dddd -e poolname coredump (141445-01)
6797109 zdb: 'zdb -dddddd pool_name/fs_name inode' coredump if the file with inode was deleted (141445-01)
6797118 zdb: 'zdb -dddddd poolname inum' coredump if I miss the fs name (141445-01)
6803343 shareiscsi=on failed, iscsitgtd failed request to share (141445-01)

6815893 hang mounting a dataset after booting into a new boot environment (141445-01)

6809691 'zpool create -f' no longer overwrites ufs infomation (141445-01)

6790064 zfs needs to determine uid and gid earlier in create process (141445-01)

6604992 forced unmount + being in .zfs/snapshot/<snap1> = not happy (141909-01)
6810367 assertion failed: dvp->v_flag & VROOT, file: ../../common/fs/gfs.c, line: 426 (141909-01)

6807765 ztest_dsl_dataset_promote_busy needs to clean up after ENOSPC (141445-01)

6821169 offlining a device results in checksum errors (141445-01)
6821170 ZFS should not increment error stats for unavailable devices (141445-01)
6824006 need to increase issue and interrupt taskqs threads in zfs (141445-01)

6792139 recovering from a suspended pool needs some work (141445-01)
6794830 reboot command hangs on a failed zfs pool (141445-01)

6824062 System panicked in zfs_mount due to NULL pointer dereference when running btts and svvs tests (141909-01)

6816124 System crash running zpool destroy on broken zpool (141445-03)

6818183 zfs snapshot -r is slow due to set_snap_props() doing txg_wait_synced() for each new snapshot (141445-03)

6710376 log device can show incorrect status when other parts of pool are degraded (141445-03)

9396:f41cf682d0d3 (part already merged)
6501037 want user/group quotas on ZFS (141445-03)
6827260 assertion failed in arc_read(): hdr == pbuf->b_hdr (141445-03)
6815592 panic: No such hold X on refcount Y from zfs_znode_move (141445-03)
6759986 zfs list shows temporary %clone when doing online zfs recv (141445-03)

6774713 zfs ignores canmount=noauto when sharenfs property != off (141445-03)

6717022 ZFS DMU needs zero-copy support (141445-03)

6799895 spa_add_spares() needs to be protected by config lock (141445-03)
6826466 want to post sysevents on hot spare activation (141445-03)
6826468 spa 'allowfaulted' needs some work (141445-03)
6826469 kernel support for storing vdev FRU information (141445-03)
6826470 skip posting checksum errors from DTL regions of leaf vdevs (141445-03)
6826471 I/O errors after device remove probe can confuse FMA (141445-03)
6826472 spares should enjoy some of the benefits of cache devices (141445-03)

6833711 gang leaders shouldn't have to be logical (141445-03)

6764124 want zdb to be able to checksum metadata blocks only (141445-03)

6830237 zfs panic in zfs_groupmember() (141445-03)

6833162 phantom log device in zpool status (141445-03)

6824968 add ZFS userquota support to rquotad (141445-03)

6834217 godfather I/O should reexecute (141445-03)

6596237 Stop looking and start ganging (141909-02)

6623978 lwb->lwb_buf != NULL, file ../../../uts/common/fs/zfs/zil.c, line 787, function zil_lwb_commit (141445-06)

6801810 Commit of aligned streaming rewrites to ZIL device causes unwanted disk reads (N/A)

6586537 async zio taskqs can block out userland commands (142901-09)

6836768 zfs_userspace() callback has no way to indicate failure (N/A)

6838062 zfs panics when an error is encountered in space_map_load() (141909-02)

6794136 Panic BAD TRAP: type=e when importing degraded zraid pool. (141909-03)

6776104 "zfs import" deadlock between spa_unload() and spa_async_thread() (141445-06)

6664765 Unable to remove files when using fat-zap and quota exceeded on ZFS filesystem (141445-06)

6841321 zfs userspace / zfs get userused@ doesn't work on mounted snapshot (N/A)
6843069 zfs get userused@S-1-... doesn't work (N/A)

6847229 assertion failed: refcount_count(&tx->tx_space_written) + delta <= tx->tx_space_towrite in dmu_tx.c (141445-06)

6838344 kernel heap corruption detected on zil while stress testing (141445-06)

6844900 zfs_ioc_userspace_upgrade leaks (N/A)

6857012 zfs panics on zpool import (141445-06)

6848242 zdb -e no longer works as expected (N/A)

6856634 snv_117 not booting: zfs_parse_bootfs: error2 (141445-07)

6861983 zfs should use new name <-> SID interfaces (N/A)
6862984 userquota commands can hang (141445-06)

6696858 zfs receive of incremental replication stream can dereference NULL pointer and crash (N/A)

6696858 zfs receive of incremental replication stream can dereference NULL pointer and crash (fix lint) (N/A)

10575:2a8816c5173b (partial merge)
6882227 spa_async_remove() shouldn't do a full clear (142901-14)

6880764 fsync on zfs is broken if writes are greater than 32kb on a hard crash and no log attached (142901-09)
6793430 zdb -ivvvv assertion failure: bp->blk_cksum.zc_word[2] == dmu_objset_id(zilog->zl_os) (N/A)

10801:e0bf032e8673 (partial merge)
6822816 assertion failed: zap_remove_int(ds_next_clones_obj) returns ENOENT (142901-09)

6892298 buf->b_hdr->b_state != arc_anon, file: ../../common/fs/zfs/arc.c, line: 2849 (142901-09)

6807339 spurious checksum errors when replacing a vdev (142901-13)

6906110 bad trap panic in zil_replay_log_record (142901-13)
6906946 zfs replay isn't handling uid/gid correctly (142901-13)

6898245 suspended zpool should not cause rest of the zfs/zpool commands to hang (142901-10)

11546:42ea6be8961b (partial merge)
6833999 3-way deadlock in dsl_dataset_hold_ref() and dsl_sync_task_group_sync() (142901-09)

Discussed with: pjd
Approved by: delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
MFC after: 2 months
05f71bf7df1979d094d54bfb3e8f1c67f6f831f4 06-Jul-2010 rpaulo <rpaulo@FreeBSD.org> Merge from vendor-sys/opensolaris:
* add fasttrap files
8f18b4f64ebac9727fec60c57b11312239967ec9 05-Feb-2010 delphij <delphij@FreeBSD.org> Remove two files that are not needed by FreeBSD.

Approved by: pjd
MFC after: 2 weeks
2414a09002c0b7080b6044e995d127de758f3e12 28-Dec-2009 delphij <delphij@FreeBSD.org> Apply OpenSolaris revision 8012 which brings our zpool to version 14,
making it possible for zpools created on OpenSolaris 2009.06 be used
on FreeBSD.

PR: kern/141800
Submitted by: mm
Reviewed by: pjd, trasz
Obtained from: OpenSolaris
MFC after: 2 weeks
aad71995319d82ad248fc09625c6f36d5749859d 30-Oct-2009 pjd <pjd@FreeBSD.org> - zfs_zaccess() can handle VAPPEND too, so map V_APPEND to VAPPEND and call
zfs_access() instead of vaccess() in this case as well.
- If VADMIN is specified with another V* flag (unlikely) call both
zfs_access() and vaccess() after spliting V* flags.

This fixes "dirtying snapshot!" panic.

PR: kern/139806
Reported by: Carl Chave <carl@chave.us>
In co-operation with: jh
MFC after: 3 days
b6fbbf626eb21066b710af86a0cd1fef7b98f66d 23-Aug-2009 pjd <pjd@FreeBSD.org> - Hide ZFS kernel threads under zfskern process.
- Use better (shorter) threads names:
'zvol:worker zvol/tank/vol00' -> 'zvol tank/vol00'
'vdev:worker da0' -> 'vdev da0'
131066d515bbb9cbb5a4fe078b182d438544ade5 17-Aug-2009 pjd <pjd@FreeBSD.org> Manage asynchronous vnode release just like Solaris.

Discussed with: kmacy
Approved by: re (kib)
ea8df6fcea7daa26626d5ebfe923d8bafdb0434a 17-Aug-2009 pjd <pjd@FreeBSD.org> Remove OpenSolaris taskq port (it performs very poorly in our kernel) and
replace it with wrappers around our taskqueue(9).
To make it possible implement taskqueue_member() function which returns 1
if the given thread was created by the given taskqueue.

Approved by: re (kib)
0bf624fc06bae6de575f74e42fdd761ee17a38a9 26-May-2009 trasz <trasz@FreeBSD.org> MFp4 changes neccessary for NFSv4 ACLs support in ZFS. This is mostly
about removing a few #ifdefs and providing compatibility wrappers and
VOP implementations to get and set an ACL; ZFS does ACL enforcement all
by itself.

Note that the VOPs are ifdefed out for now, so this change should be
a no-op.

Reviewed by: pjd
386b2c2f90a8aa2e5b8e2119040e347f2d9f4e91 07-May-2009 kmacy <kmacy@FreeBSD.org> move VN_RELE_ASYNC to the compatibility layer with the rest of the VN_* defines
54e76e600e5758a0c21cdeeca3790d960269bd4d 07-May-2009 kmacy <kmacy@FreeBSD.org> Asynchronously release vnodes to avoid blocking on range locks when calling back in to zfs.
This is based on a fix that went in to opensolaris on March 9th. However, it uses a dedicated
thread instead of a Solaris' taskq to avoid doing a blocking memory allocation with the vnode
interlock held.

This fixes a long-time deadlock in ZFS. This is not, strictly speaking, an LOR. The spa_zio
thread releases a vnode, this calls in to vn_reclaim which in turn needs to acquire range locks
to sync dirty data out to disk. The range locks are already held by a user-level process waiting
on a condition variable that it the process is waiting on a spa_zio thread to signal it on. The
process could not be signalled because the spa_zio thread could not proceed.

The nature of this problem was not apparent due to ZFS locks opting out of witness which meant
that DDB did not know about the locks that were held by ZFS.

Reviewed by: pjd
MFC after: 7 days
40f437cf7fe146b784e57ffdac3177c408dac285 04-Dec-2008 imp <imp@FreeBSD.org> Put the MIPS support back in after it was removed in r185029.
bbe899b96e388a8b82439f81ed3707e0d9c6070d 17-Nov-2008 pjd <pjd@FreeBSD.org> Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.

This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.


Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.

- slog

Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).

- vfs.zfs.super_owner

Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

Not all the flags are supported. This still needs work.

- ZFSBoot

Support to boot off of ZFS pool. Not finished, AFAIK.

Submitted by: dfr

- Snapshot properties

- New failure modes

Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.

- Sparse volumes

ZVOLs that don't reserve space in the pool.

- External attributes

Compatible with extattr(2).

- NFSv4-ACLs

Not sure about the status, might not be complete yet.

Submitted by: trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from: OpenSolaris
89db8837411413032da33754fbfefaf4a7387445 23-Aug-2008 imp <imp@FreeBSD.org> Add MIPS support.

Reviewed by: jb@
28d80fbb7a0b38fd6acd0beba42e1858ef18d9e2 01-Jun-2008 jb <jb@FreeBSD.org> Merge a recent change from the OpenSolaris source tree.
(Don't ask for a vendor import of this yet, we're in the early days of svn)

Instead of using cyclic timers to call the state clean and deadman callbacks,
use a callout on FreeBSD to avoid the deadlock on FreeBSD due to trying to
send interprocessor interrupts with interrupts disabled.

Reported by: ps, jhb, peter, thompsa
9fb08e48b31222fd438418ba7083bd046dffa9ff 23-May-2008 jb <jb@FreeBSD.org> Delete a couple of OpenSolaris headers which get in the way of our
5788c140d796d33530527e138c61b21303e8d006 22-May-2008 jb <jb@FreeBSD.org> FreeBSD changes to vendor source.
a7b5cc6647e71c6885fd4984d8b34e02b065187a 22-May-2008 jb <jb@FreeBSD.org> This commit was generated by cvs2svn to compensate for changes in r179193,
which included commits to RCS files with non-trunk default branches.
d3be9e792d84404d822007f5407640782d9ae216 22-May-2008 jb <jb@FreeBSD.org> Vendor import of the src/sys OpenSolaris bits for DTrace.
29fb737a373c6b536dea728ed0e7294c45476493 11-Apr-2008 marius <marius@FreeBSD.org> - Fix the path encoded in the multiple inclusion protection.
- GCC uses 32-byte function alignment for UltraSPARC CPUs.
- Remove code duplication.

Approved by: core, pjd
MFC after: 2 weeks
22ebcc078a4fb976178e2d0eea4c0d1387d8ccf6 28-Nov-2007 jb <jb@FreeBSD.org> * Check endianness the FreeBSD way.

* Use LBOLT rather than lbolt to avoid a clash with a FreeBSD global
22720aa033d3ec0fd7c4f18b36b27fad2b50c056 28-Nov-2007 jb <jb@FreeBSD.org> Fix a prototype definition.
7b3d6c34b9777b6157c6a696f229c5777ad9f972 05-Nov-2007 pjd <pjd@FreeBSD.org> Remove unused header.

MFC after: 3 days
9eba2904d11ed785ef1cc694b8615aeb353c2252 08-Jun-2007 pjd <pjd@FreeBSD.org> - Reduce number of atomic operations needed to be implemented in asm by
implementing some of them using existing ones.
- Allow to compile ZFS on all archs and use atomic operations surrounded
by global mutex on archs we don't have or can't have all atomic
operations needed by ZFS.
8b3bf231cc710dd95386b222c0caa64dbf802655 23-May-2007 pjd <pjd@FreeBSD.org> FreeBSD's namecache works quite well with ZFS, so remove DNLC.
afcf861a95ee4c76fa8a2141fcaf62a34cee70ba 08-Apr-2007 pjd <pjd@FreeBSD.org> Move zpool.cache from /etc/zfs/ to /boot/zfs/, so we can keep it on
dedicated /boot/ file system and use ZFS for the root file system.
2b260dcd5e356a3d85c66dff3715260105c412f7 08-Apr-2007 pjd <pjd@FreeBSD.org> MFp4: Synchronize with recent OpenSolaris changes.
3b005d330261f33318ca1ee3fef1940237fd788b 06-Apr-2007 pjd <pjd@FreeBSD.org> Please welcome ZFS - The last word in file systems.

ZFS file system was ported from OpenSolaris operating system. The code in under
CDDL license.

I'd like to thank all SUN developers that created this great piece of software.

Supported by: Wheel LTD (http://www.wheel.pl/)
Supported by: The FreeBSD Foundation (http://www.freebsdfoundation.org/)
Supported by: Sentex (http://www.sentex.net/)