1*5cabbc6bSPrashanth Sreenivasa /* 2*5cabbc6bSPrashanth Sreenivasa * CDDL HEADER START 3*5cabbc6bSPrashanth Sreenivasa * 4*5cabbc6bSPrashanth Sreenivasa * This file and its contents are supplied under the terms of the 5*5cabbc6bSPrashanth Sreenivasa * Common Development and Distribution License ("CDDL"), version 1.0. 6*5cabbc6bSPrashanth Sreenivasa * You may only use this file in accordance with the terms of version 7*5cabbc6bSPrashanth Sreenivasa * 1.0 of the CDDL. 8*5cabbc6bSPrashanth Sreenivasa * 9*5cabbc6bSPrashanth Sreenivasa * A full copy of the text of the CDDL should have accompanied this 10*5cabbc6bSPrashanth Sreenivasa * source. A copy of the CDDL is also available via the Internet at 11*5cabbc6bSPrashanth Sreenivasa * http://www.illumos.org/license/CDDL. 12*5cabbc6bSPrashanth Sreenivasa * 13*5cabbc6bSPrashanth Sreenivasa * CDDL HEADER END 14*5cabbc6bSPrashanth Sreenivasa */ 15*5cabbc6bSPrashanth Sreenivasa 16*5cabbc6bSPrashanth Sreenivasa /* 17*5cabbc6bSPrashanth Sreenivasa * Copyright (c) 2014, 2015 by Delphix. All rights reserved. 18*5cabbc6bSPrashanth Sreenivasa */ 19*5cabbc6bSPrashanth Sreenivasa 20*5cabbc6bSPrashanth Sreenivasa #include <sys/zfs_context.h> 21*5cabbc6bSPrashanth Sreenivasa #include <sys/spa.h> 22*5cabbc6bSPrashanth Sreenivasa #include <sys/spa_impl.h> 23*5cabbc6bSPrashanth Sreenivasa #include <sys/vdev_impl.h> 24*5cabbc6bSPrashanth Sreenivasa #include <sys/fs/zfs.h> 25*5cabbc6bSPrashanth Sreenivasa #include <sys/zio.h> 26*5cabbc6bSPrashanth Sreenivasa #include <sys/metaslab.h> 27*5cabbc6bSPrashanth Sreenivasa #include <sys/refcount.h> 28*5cabbc6bSPrashanth Sreenivasa #include <sys/dmu.h> 29*5cabbc6bSPrashanth Sreenivasa #include <sys/vdev_indirect_mapping.h> 30*5cabbc6bSPrashanth Sreenivasa #include <sys/dmu_tx.h> 31*5cabbc6bSPrashanth Sreenivasa #include <sys/dsl_synctask.h> 32*5cabbc6bSPrashanth Sreenivasa #include <sys/zap.h> 33*5cabbc6bSPrashanth Sreenivasa 34*5cabbc6bSPrashanth Sreenivasa /* 35*5cabbc6bSPrashanth Sreenivasa * An indirect vdev corresponds to a vdev that has been removed. Since 36*5cabbc6bSPrashanth Sreenivasa * we cannot rewrite block pointers of snapshots, etc., we keep a 37*5cabbc6bSPrashanth Sreenivasa * mapping from old location on the removed device to the new location 38*5cabbc6bSPrashanth Sreenivasa * on another device in the pool and use this mapping whenever we need 39*5cabbc6bSPrashanth Sreenivasa * to access the DVA. Unfortunately, this mapping did not respect 40*5cabbc6bSPrashanth Sreenivasa * logical block boundaries when it was first created, and so a DVA on 41*5cabbc6bSPrashanth Sreenivasa * this indirect vdev may be "split" into multiple sections that each 42*5cabbc6bSPrashanth Sreenivasa * map to a different location. As a consequence, not all DVAs can be 43*5cabbc6bSPrashanth Sreenivasa * translated to an equivalent new DVA. Instead we must provide a 44*5cabbc6bSPrashanth Sreenivasa * "vdev_remap" operation that executes a callback on each contiguous 45*5cabbc6bSPrashanth Sreenivasa * segment of the new location. This function is used in multiple ways: 46*5cabbc6bSPrashanth Sreenivasa * 47*5cabbc6bSPrashanth Sreenivasa * - reads and repair writes to this device use the callback to create 48*5cabbc6bSPrashanth Sreenivasa * a child io for each mapped segment. 49*5cabbc6bSPrashanth Sreenivasa * 50*5cabbc6bSPrashanth Sreenivasa * - frees and claims to this device use the callback to free or claim 51*5cabbc6bSPrashanth Sreenivasa * each mapped segment. (Note that we don't actually need to claim 52*5cabbc6bSPrashanth Sreenivasa * log blocks on indirect vdevs, because we don't allocate to 53*5cabbc6bSPrashanth Sreenivasa * removing vdevs. However, zdb uses zio_claim() for its leak 54*5cabbc6bSPrashanth Sreenivasa * detection.) 55*5cabbc6bSPrashanth Sreenivasa */ 56*5cabbc6bSPrashanth Sreenivasa 57*5cabbc6bSPrashanth Sreenivasa /* 58*5cabbc6bSPrashanth Sreenivasa * "Big theory statement" for how we mark blocks obsolete. 59*5cabbc6bSPrashanth Sreenivasa * 60*5cabbc6bSPrashanth Sreenivasa * When a block on an indirect vdev is freed or remapped, a section of 61*5cabbc6bSPrashanth Sreenivasa * that vdev's mapping may no longer be referenced (aka "obsolete"). We 62*5cabbc6bSPrashanth Sreenivasa * keep track of how much of each mapping entry is obsolete. When 63*5cabbc6bSPrashanth Sreenivasa * an entry becomes completely obsolete, we can remove it, thus reducing 64*5cabbc6bSPrashanth Sreenivasa * the memory used by the mapping. The complete picture of obsolescence 65*5cabbc6bSPrashanth Sreenivasa * is given by the following data structures, described below: 66*5cabbc6bSPrashanth Sreenivasa * - the entry-specific obsolete count 67*5cabbc6bSPrashanth Sreenivasa * - the vdev-specific obsolete spacemap 68*5cabbc6bSPrashanth Sreenivasa * - the pool-specific obsolete bpobj 69*5cabbc6bSPrashanth Sreenivasa * 70*5cabbc6bSPrashanth Sreenivasa * == On disk data structures used == 71*5cabbc6bSPrashanth Sreenivasa * 72*5cabbc6bSPrashanth Sreenivasa * We track the obsolete space for the pool using several objects. Each 73*5cabbc6bSPrashanth Sreenivasa * of these objects is created on demand and freed when no longer 74*5cabbc6bSPrashanth Sreenivasa * needed, and is assumed to be empty if it does not exist. 75*5cabbc6bSPrashanth Sreenivasa * SPA_FEATURE_OBSOLETE_COUNTS includes the count of these objects. 76*5cabbc6bSPrashanth Sreenivasa * 77*5cabbc6bSPrashanth Sreenivasa * - Each vic_mapping_object (associated with an indirect vdev) can 78*5cabbc6bSPrashanth Sreenivasa * have a vimp_counts_object. This is an array of uint32_t's 79*5cabbc6bSPrashanth Sreenivasa * with the same number of entries as the vic_mapping_object. When 80*5cabbc6bSPrashanth Sreenivasa * the mapping is condensed, entries from the vic_obsolete_sm_object 81*5cabbc6bSPrashanth Sreenivasa * (see below) are folded into the counts. Therefore, each 82*5cabbc6bSPrashanth Sreenivasa * obsolete_counts entry tells us the number of bytes in the 83*5cabbc6bSPrashanth Sreenivasa * corresponding mapping entry that were not referenced when the 84*5cabbc6bSPrashanth Sreenivasa * mapping was last condensed. 85*5cabbc6bSPrashanth Sreenivasa * 86*5cabbc6bSPrashanth Sreenivasa * - Each indirect or removing vdev can have a vic_obsolete_sm_object. 87*5cabbc6bSPrashanth Sreenivasa * This is a space map containing an alloc entry for every DVA that 88*5cabbc6bSPrashanth Sreenivasa * has been obsoleted since the last time this indirect vdev was 89*5cabbc6bSPrashanth Sreenivasa * condensed. We use this object in order to improve performance 90*5cabbc6bSPrashanth Sreenivasa * when marking a DVA as obsolete. Instead of modifying an arbitrary 91*5cabbc6bSPrashanth Sreenivasa * offset of the vimp_counts_object, we only need to append an entry 92*5cabbc6bSPrashanth Sreenivasa * to the end of this object. When a DVA becomes obsolete, it is 93*5cabbc6bSPrashanth Sreenivasa * added to the obsolete space map. This happens when the DVA is 94*5cabbc6bSPrashanth Sreenivasa * freed, remapped and not referenced by a snapshot, or the last 95*5cabbc6bSPrashanth Sreenivasa * snapshot referencing it is destroyed. 96*5cabbc6bSPrashanth Sreenivasa * 97*5cabbc6bSPrashanth Sreenivasa * - Each dataset can have a ds_remap_deadlist object. This is a 98*5cabbc6bSPrashanth Sreenivasa * deadlist object containing all blocks that were remapped in this 99*5cabbc6bSPrashanth Sreenivasa * dataset but referenced in a previous snapshot. Blocks can *only* 100*5cabbc6bSPrashanth Sreenivasa * appear on this list if they were remapped (dsl_dataset_block_remapped); 101*5cabbc6bSPrashanth Sreenivasa * blocks that were killed in a head dataset are put on the normal 102*5cabbc6bSPrashanth Sreenivasa * ds_deadlist and marked obsolete when they are freed. 103*5cabbc6bSPrashanth Sreenivasa * 104*5cabbc6bSPrashanth Sreenivasa * - The pool can have a dp_obsolete_bpobj. This is a list of blocks 105*5cabbc6bSPrashanth Sreenivasa * in the pool that need to be marked obsolete. When a snapshot is 106*5cabbc6bSPrashanth Sreenivasa * destroyed, we move some of the ds_remap_deadlist to the obsolete 107*5cabbc6bSPrashanth Sreenivasa * bpobj (see dsl_destroy_snapshot_handle_remaps()). We then 108*5cabbc6bSPrashanth Sreenivasa * asynchronously process the obsolete bpobj, moving its entries to 109*5cabbc6bSPrashanth Sreenivasa * the specific vdevs' obsolete space maps. 110*5cabbc6bSPrashanth Sreenivasa * 111*5cabbc6bSPrashanth Sreenivasa * == Summary of how we mark blocks as obsolete == 112*5cabbc6bSPrashanth Sreenivasa * 113*5cabbc6bSPrashanth Sreenivasa * - When freeing a block: if any DVA is on an indirect vdev, append to 114*5cabbc6bSPrashanth Sreenivasa * vic_obsolete_sm_object. 115*5cabbc6bSPrashanth Sreenivasa * - When remapping a block, add dva to ds_remap_deadlist (if prev snap 116*5cabbc6bSPrashanth Sreenivasa * references; otherwise append to vic_obsolete_sm_object). 117*5cabbc6bSPrashanth Sreenivasa * - When freeing a snapshot: move parts of ds_remap_deadlist to 118*5cabbc6bSPrashanth Sreenivasa * dp_obsolete_bpobj (same algorithm as ds_deadlist). 119*5cabbc6bSPrashanth Sreenivasa * - When syncing the spa: process dp_obsolete_bpobj, moving ranges to 120*5cabbc6bSPrashanth Sreenivasa * individual vdev's vic_obsolete_sm_object. 121*5cabbc6bSPrashanth Sreenivasa */ 122*5cabbc6bSPrashanth Sreenivasa 123*5cabbc6bSPrashanth Sreenivasa /* 124*5cabbc6bSPrashanth Sreenivasa * "Big theory statement" for how we condense indirect vdevs. 125*5cabbc6bSPrashanth Sreenivasa * 126*5cabbc6bSPrashanth Sreenivasa * Condensing an indirect vdev's mapping is the process of determining 127*5cabbc6bSPrashanth Sreenivasa * the precise counts of obsolete space for each mapping entry (by 128*5cabbc6bSPrashanth Sreenivasa * integrating the obsolete spacemap into the obsolete counts) and 129*5cabbc6bSPrashanth Sreenivasa * writing out a new mapping that contains only referenced entries. 130*5cabbc6bSPrashanth Sreenivasa * 131*5cabbc6bSPrashanth Sreenivasa * We condense a vdev when we expect the mapping to shrink (see 132*5cabbc6bSPrashanth Sreenivasa * vdev_indirect_should_condense()), but only perform one condense at a 133*5cabbc6bSPrashanth Sreenivasa * time to limit the memory usage. In addition, we use a separate 134*5cabbc6bSPrashanth Sreenivasa * open-context thread (spa_condense_indirect_thread) to incrementally 135*5cabbc6bSPrashanth Sreenivasa * create the new mapping object in a way that minimizes the impact on 136*5cabbc6bSPrashanth Sreenivasa * the rest of the system. 137*5cabbc6bSPrashanth Sreenivasa * 138*5cabbc6bSPrashanth Sreenivasa * == Generating a new mapping == 139*5cabbc6bSPrashanth Sreenivasa * 140*5cabbc6bSPrashanth Sreenivasa * To generate a new mapping, we follow these steps: 141*5cabbc6bSPrashanth Sreenivasa * 142*5cabbc6bSPrashanth Sreenivasa * 1. Save the old obsolete space map and create a new mapping object 143*5cabbc6bSPrashanth Sreenivasa * (see spa_condense_indirect_start_sync()). This initializes the 144*5cabbc6bSPrashanth Sreenivasa * spa_condensing_indirect_phys with the "previous obsolete space map", 145*5cabbc6bSPrashanth Sreenivasa * which is now read only. Newly obsolete DVAs will be added to a 146*5cabbc6bSPrashanth Sreenivasa * new (initially empty) obsolete space map, and will not be 147*5cabbc6bSPrashanth Sreenivasa * considered as part of this condense operation. 148*5cabbc6bSPrashanth Sreenivasa * 149*5cabbc6bSPrashanth Sreenivasa * 2. Construct in memory the precise counts of obsolete space for each 150*5cabbc6bSPrashanth Sreenivasa * mapping entry, by incorporating the obsolete space map into the 151*5cabbc6bSPrashanth Sreenivasa * counts. (See vdev_indirect_mapping_load_obsolete_{counts,spacemap}().) 152*5cabbc6bSPrashanth Sreenivasa * 153*5cabbc6bSPrashanth Sreenivasa * 3. Iterate through each mapping entry, writing to the new mapping any 154*5cabbc6bSPrashanth Sreenivasa * entries that are not completely obsolete (i.e. which don't have 155*5cabbc6bSPrashanth Sreenivasa * obsolete count == mapping length). (See 156*5cabbc6bSPrashanth Sreenivasa * spa_condense_indirect_generate_new_mapping().) 157*5cabbc6bSPrashanth Sreenivasa * 158*5cabbc6bSPrashanth Sreenivasa * 4. Destroy the old mapping object and switch over to the new one 159*5cabbc6bSPrashanth Sreenivasa * (spa_condense_indirect_complete_sync). 160*5cabbc6bSPrashanth Sreenivasa * 161*5cabbc6bSPrashanth Sreenivasa * == Restarting from failure == 162*5cabbc6bSPrashanth Sreenivasa * 163*5cabbc6bSPrashanth Sreenivasa * To restart the condense when we import/open the pool, we must start 164*5cabbc6bSPrashanth Sreenivasa * at the 2nd step above: reconstruct the precise counts in memory, 165*5cabbc6bSPrashanth Sreenivasa * based on the space map + counts. Then in the 3rd step, we start 166*5cabbc6bSPrashanth Sreenivasa * iterating where we left off: at vimp_max_offset of the new mapping 167*5cabbc6bSPrashanth Sreenivasa * object. 168*5cabbc6bSPrashanth Sreenivasa */ 169*5cabbc6bSPrashanth Sreenivasa 170*5cabbc6bSPrashanth Sreenivasa boolean_t zfs_condense_indirect_vdevs_enable = B_TRUE; 171*5cabbc6bSPrashanth Sreenivasa 172*5cabbc6bSPrashanth Sreenivasa /* 173*5cabbc6bSPrashanth Sreenivasa * Condense if at least this percent of the bytes in the mapping is 174*5cabbc6bSPrashanth Sreenivasa * obsolete. With the default of 25%, the amount of space mapped 175*5cabbc6bSPrashanth Sreenivasa * will be reduced to 1% of its original size after at most 16 176*5cabbc6bSPrashanth Sreenivasa * condenses. Higher values will condense less often (causing less 177*5cabbc6bSPrashanth Sreenivasa * i/o); lower values will reduce the mapping size more quickly. 178*5cabbc6bSPrashanth Sreenivasa */ 179*5cabbc6bSPrashanth Sreenivasa int zfs_indirect_condense_obsolete_pct = 25; 180*5cabbc6bSPrashanth Sreenivasa 181*5cabbc6bSPrashanth Sreenivasa /* 182*5cabbc6bSPrashanth Sreenivasa * Condense if the obsolete space map takes up more than this amount of 183*5cabbc6bSPrashanth Sreenivasa * space on disk (logically). This limits the amount of disk space 184*5cabbc6bSPrashanth Sreenivasa * consumed by the obsolete space map; the default of 1GB is small enough 185*5cabbc6bSPrashanth Sreenivasa * that we typically don't mind "wasting" it. 186*5cabbc6bSPrashanth Sreenivasa */ 187*5cabbc6bSPrashanth Sreenivasa uint64_t zfs_condense_max_obsolete_bytes = 1024 * 1024 * 1024; 188*5cabbc6bSPrashanth Sreenivasa 189*5cabbc6bSPrashanth Sreenivasa /* 190*5cabbc6bSPrashanth Sreenivasa * Don't bother condensing if the mapping uses less than this amount of 191*5cabbc6bSPrashanth Sreenivasa * memory. The default of 128KB is considered a "trivial" amount of 192*5cabbc6bSPrashanth Sreenivasa * memory and not worth reducing. 193*5cabbc6bSPrashanth Sreenivasa */ 194*5cabbc6bSPrashanth Sreenivasa uint64_t zfs_condense_min_mapping_bytes = 128 * 1024; 195*5cabbc6bSPrashanth Sreenivasa 196*5cabbc6bSPrashanth Sreenivasa /* 197*5cabbc6bSPrashanth Sreenivasa * This is used by the test suite so that it can ensure that certain 198*5cabbc6bSPrashanth Sreenivasa * actions happen while in the middle of a condense (which might otherwise 199*5cabbc6bSPrashanth Sreenivasa * complete too quickly). If used to reduce the performance impact of 200*5cabbc6bSPrashanth Sreenivasa * condensing in production, a maximum value of 1 should be sufficient. 201*5cabbc6bSPrashanth Sreenivasa */ 202*5cabbc6bSPrashanth Sreenivasa int zfs_condense_indirect_commit_entry_delay_ticks = 0; 203*5cabbc6bSPrashanth Sreenivasa 204*5cabbc6bSPrashanth Sreenivasa /* 205*5cabbc6bSPrashanth Sreenivasa * Mark the given offset and size as being obsolete in the given txg. 206*5cabbc6bSPrashanth Sreenivasa */ 207*5cabbc6bSPrashanth Sreenivasa void 208*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mark_obsolete(vdev_t *vd, uint64_t offset, uint64_t size, 209*5cabbc6bSPrashanth Sreenivasa uint64_t txg) 210*5cabbc6bSPrashanth Sreenivasa { 211*5cabbc6bSPrashanth Sreenivasa spa_t *spa = vd->vdev_spa; 212*5cabbc6bSPrashanth Sreenivasa ASSERT3U(spa_syncing_txg(spa), ==, txg); 213*5cabbc6bSPrashanth Sreenivasa ASSERT3U(vd->vdev_indirect_config.vic_mapping_object, !=, 0); 214*5cabbc6bSPrashanth Sreenivasa ASSERT(vd->vdev_removing || vd->vdev_ops == &vdev_indirect_ops); 215*5cabbc6bSPrashanth Sreenivasa ASSERT(size > 0); 216*5cabbc6bSPrashanth Sreenivasa VERIFY(vdev_indirect_mapping_entry_for_offset( 217*5cabbc6bSPrashanth Sreenivasa vd->vdev_indirect_mapping, offset) != NULL); 218*5cabbc6bSPrashanth Sreenivasa 219*5cabbc6bSPrashanth Sreenivasa if (spa_feature_is_enabled(spa, SPA_FEATURE_OBSOLETE_COUNTS)) { 220*5cabbc6bSPrashanth Sreenivasa mutex_enter(&vd->vdev_obsolete_lock); 221*5cabbc6bSPrashanth Sreenivasa range_tree_add(vd->vdev_obsolete_segments, offset, size); 222*5cabbc6bSPrashanth Sreenivasa mutex_exit(&vd->vdev_obsolete_lock); 223*5cabbc6bSPrashanth Sreenivasa vdev_dirty(vd, 0, NULL, txg); 224*5cabbc6bSPrashanth Sreenivasa } 225*5cabbc6bSPrashanth Sreenivasa } 226*5cabbc6bSPrashanth Sreenivasa 227*5cabbc6bSPrashanth Sreenivasa /* 228*5cabbc6bSPrashanth Sreenivasa * Mark the DVA vdev_id:offset:size as being obsolete in the given tx. This 229*5cabbc6bSPrashanth Sreenivasa * wrapper is provided because the DMU does not know about vdev_t's and 230*5cabbc6bSPrashanth Sreenivasa * cannot directly call vdev_indirect_mark_obsolete. 231*5cabbc6bSPrashanth Sreenivasa */ 232*5cabbc6bSPrashanth Sreenivasa void 233*5cabbc6bSPrashanth Sreenivasa spa_vdev_indirect_mark_obsolete(spa_t *spa, uint64_t vdev_id, uint64_t offset, 234*5cabbc6bSPrashanth Sreenivasa uint64_t size, dmu_tx_t *tx) 235*5cabbc6bSPrashanth Sreenivasa { 236*5cabbc6bSPrashanth Sreenivasa vdev_t *vd = vdev_lookup_top(spa, vdev_id); 237*5cabbc6bSPrashanth Sreenivasa ASSERT(dmu_tx_is_syncing(tx)); 238*5cabbc6bSPrashanth Sreenivasa 239*5cabbc6bSPrashanth Sreenivasa /* The DMU can only remap indirect vdevs. */ 240*5cabbc6bSPrashanth Sreenivasa ASSERT3P(vd->vdev_ops, ==, &vdev_indirect_ops); 241*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mark_obsolete(vd, offset, size, dmu_tx_get_txg(tx)); 242*5cabbc6bSPrashanth Sreenivasa } 243*5cabbc6bSPrashanth Sreenivasa 244*5cabbc6bSPrashanth Sreenivasa static spa_condensing_indirect_t * 245*5cabbc6bSPrashanth Sreenivasa spa_condensing_indirect_create(spa_t *spa) 246*5cabbc6bSPrashanth Sreenivasa { 247*5cabbc6bSPrashanth Sreenivasa spa_condensing_indirect_phys_t *scip = 248*5cabbc6bSPrashanth Sreenivasa &spa->spa_condensing_indirect_phys; 249*5cabbc6bSPrashanth Sreenivasa spa_condensing_indirect_t *sci = kmem_zalloc(sizeof (*sci), KM_SLEEP); 250*5cabbc6bSPrashanth Sreenivasa objset_t *mos = spa->spa_meta_objset; 251*5cabbc6bSPrashanth Sreenivasa 252*5cabbc6bSPrashanth Sreenivasa for (int i = 0; i < TXG_SIZE; i++) { 253*5cabbc6bSPrashanth Sreenivasa list_create(&sci->sci_new_mapping_entries[i], 254*5cabbc6bSPrashanth Sreenivasa sizeof (vdev_indirect_mapping_entry_t), 255*5cabbc6bSPrashanth Sreenivasa offsetof(vdev_indirect_mapping_entry_t, vime_node)); 256*5cabbc6bSPrashanth Sreenivasa } 257*5cabbc6bSPrashanth Sreenivasa 258*5cabbc6bSPrashanth Sreenivasa sci->sci_new_mapping = 259*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_open(mos, scip->scip_next_mapping_object); 260*5cabbc6bSPrashanth Sreenivasa 261*5cabbc6bSPrashanth Sreenivasa return (sci); 262*5cabbc6bSPrashanth Sreenivasa } 263*5cabbc6bSPrashanth Sreenivasa 264*5cabbc6bSPrashanth Sreenivasa static void 265*5cabbc6bSPrashanth Sreenivasa spa_condensing_indirect_destroy(spa_condensing_indirect_t *sci) 266*5cabbc6bSPrashanth Sreenivasa { 267*5cabbc6bSPrashanth Sreenivasa for (int i = 0; i < TXG_SIZE; i++) 268*5cabbc6bSPrashanth Sreenivasa list_destroy(&sci->sci_new_mapping_entries[i]); 269*5cabbc6bSPrashanth Sreenivasa 270*5cabbc6bSPrashanth Sreenivasa if (sci->sci_new_mapping != NULL) 271*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_close(sci->sci_new_mapping); 272*5cabbc6bSPrashanth Sreenivasa 273*5cabbc6bSPrashanth Sreenivasa kmem_free(sci, sizeof (*sci)); 274*5cabbc6bSPrashanth Sreenivasa } 275*5cabbc6bSPrashanth Sreenivasa 276*5cabbc6bSPrashanth Sreenivasa boolean_t 277*5cabbc6bSPrashanth Sreenivasa vdev_indirect_should_condense(vdev_t *vd) 278*5cabbc6bSPrashanth Sreenivasa { 279*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_t *vim = vd->vdev_indirect_mapping; 280*5cabbc6bSPrashanth Sreenivasa spa_t *spa = vd->vdev_spa; 281*5cabbc6bSPrashanth Sreenivasa 282*5cabbc6bSPrashanth Sreenivasa ASSERT(dsl_pool_sync_context(spa->spa_dsl_pool)); 283*5cabbc6bSPrashanth Sreenivasa 284*5cabbc6bSPrashanth Sreenivasa if (!zfs_condense_indirect_vdevs_enable) 285*5cabbc6bSPrashanth Sreenivasa return (B_FALSE); 286*5cabbc6bSPrashanth Sreenivasa 287*5cabbc6bSPrashanth Sreenivasa /* 288*5cabbc6bSPrashanth Sreenivasa * We can only condense one indirect vdev at a time. 289*5cabbc6bSPrashanth Sreenivasa */ 290*5cabbc6bSPrashanth Sreenivasa if (spa->spa_condensing_indirect != NULL) 291*5cabbc6bSPrashanth Sreenivasa return (B_FALSE); 292*5cabbc6bSPrashanth Sreenivasa 293*5cabbc6bSPrashanth Sreenivasa if (spa_shutting_down(spa)) 294*5cabbc6bSPrashanth Sreenivasa return (B_FALSE); 295*5cabbc6bSPrashanth Sreenivasa 296*5cabbc6bSPrashanth Sreenivasa /* 297*5cabbc6bSPrashanth Sreenivasa * The mapping object size must not change while we are 298*5cabbc6bSPrashanth Sreenivasa * condensing, so we can only condense indirect vdevs 299*5cabbc6bSPrashanth Sreenivasa * (not vdevs that are still in the middle of being removed). 300*5cabbc6bSPrashanth Sreenivasa */ 301*5cabbc6bSPrashanth Sreenivasa if (vd->vdev_ops != &vdev_indirect_ops) 302*5cabbc6bSPrashanth Sreenivasa return (B_FALSE); 303*5cabbc6bSPrashanth Sreenivasa 304*5cabbc6bSPrashanth Sreenivasa /* 305*5cabbc6bSPrashanth Sreenivasa * If nothing new has been marked obsolete, there is no 306*5cabbc6bSPrashanth Sreenivasa * point in condensing. 307*5cabbc6bSPrashanth Sreenivasa */ 308*5cabbc6bSPrashanth Sreenivasa if (vd->vdev_obsolete_sm == NULL) { 309*5cabbc6bSPrashanth Sreenivasa ASSERT0(vdev_obsolete_sm_object(vd)); 310*5cabbc6bSPrashanth Sreenivasa return (B_FALSE); 311*5cabbc6bSPrashanth Sreenivasa } 312*5cabbc6bSPrashanth Sreenivasa 313*5cabbc6bSPrashanth Sreenivasa ASSERT(vd->vdev_obsolete_sm != NULL); 314*5cabbc6bSPrashanth Sreenivasa 315*5cabbc6bSPrashanth Sreenivasa ASSERT3U(vdev_obsolete_sm_object(vd), ==, 316*5cabbc6bSPrashanth Sreenivasa space_map_object(vd->vdev_obsolete_sm)); 317*5cabbc6bSPrashanth Sreenivasa 318*5cabbc6bSPrashanth Sreenivasa uint64_t bytes_mapped = vdev_indirect_mapping_bytes_mapped(vim); 319*5cabbc6bSPrashanth Sreenivasa uint64_t bytes_obsolete = space_map_allocated(vd->vdev_obsolete_sm); 320*5cabbc6bSPrashanth Sreenivasa uint64_t mapping_size = vdev_indirect_mapping_size(vim); 321*5cabbc6bSPrashanth Sreenivasa uint64_t obsolete_sm_size = space_map_length(vd->vdev_obsolete_sm); 322*5cabbc6bSPrashanth Sreenivasa 323*5cabbc6bSPrashanth Sreenivasa ASSERT3U(bytes_obsolete, <=, bytes_mapped); 324*5cabbc6bSPrashanth Sreenivasa 325*5cabbc6bSPrashanth Sreenivasa /* 326*5cabbc6bSPrashanth Sreenivasa * If a high percentage of the bytes that are mapped have become 327*5cabbc6bSPrashanth Sreenivasa * obsolete, condense (unless the mapping is already small enough). 328*5cabbc6bSPrashanth Sreenivasa * This has a good chance of reducing the amount of memory used 329*5cabbc6bSPrashanth Sreenivasa * by the mapping. 330*5cabbc6bSPrashanth Sreenivasa */ 331*5cabbc6bSPrashanth Sreenivasa if (bytes_obsolete * 100 / bytes_mapped >= 332*5cabbc6bSPrashanth Sreenivasa zfs_indirect_condense_obsolete_pct && 333*5cabbc6bSPrashanth Sreenivasa mapping_size > zfs_condense_min_mapping_bytes) { 334*5cabbc6bSPrashanth Sreenivasa zfs_dbgmsg("should condense vdev %llu because obsolete " 335*5cabbc6bSPrashanth Sreenivasa "spacemap covers %d%% of %lluMB mapping", 336*5cabbc6bSPrashanth Sreenivasa (u_longlong_t)vd->vdev_id, 337*5cabbc6bSPrashanth Sreenivasa (int)(bytes_obsolete * 100 / bytes_mapped), 338*5cabbc6bSPrashanth Sreenivasa (u_longlong_t)bytes_mapped / 1024 / 1024); 339*5cabbc6bSPrashanth Sreenivasa return (B_TRUE); 340*5cabbc6bSPrashanth Sreenivasa } 341*5cabbc6bSPrashanth Sreenivasa 342*5cabbc6bSPrashanth Sreenivasa /* 343*5cabbc6bSPrashanth Sreenivasa * If the obsolete space map takes up too much space on disk, 344*5cabbc6bSPrashanth Sreenivasa * condense in order to free up this disk space. 345*5cabbc6bSPrashanth Sreenivasa */ 346*5cabbc6bSPrashanth Sreenivasa if (obsolete_sm_size >= zfs_condense_max_obsolete_bytes) { 347*5cabbc6bSPrashanth Sreenivasa zfs_dbgmsg("should condense vdev %llu because obsolete sm " 348*5cabbc6bSPrashanth Sreenivasa "length %lluMB >= max size %lluMB", 349*5cabbc6bSPrashanth Sreenivasa (u_longlong_t)vd->vdev_id, 350*5cabbc6bSPrashanth Sreenivasa (u_longlong_t)obsolete_sm_size / 1024 / 1024, 351*5cabbc6bSPrashanth Sreenivasa (u_longlong_t)zfs_condense_max_obsolete_bytes / 352*5cabbc6bSPrashanth Sreenivasa 1024 / 1024); 353*5cabbc6bSPrashanth Sreenivasa return (B_TRUE); 354*5cabbc6bSPrashanth Sreenivasa } 355*5cabbc6bSPrashanth Sreenivasa 356*5cabbc6bSPrashanth Sreenivasa return (B_FALSE); 357*5cabbc6bSPrashanth Sreenivasa } 358*5cabbc6bSPrashanth Sreenivasa 359*5cabbc6bSPrashanth Sreenivasa /* 360*5cabbc6bSPrashanth Sreenivasa * This sync task completes (finishes) a condense, deleting the old 361*5cabbc6bSPrashanth Sreenivasa * mapping and replacing it with the new one. 362*5cabbc6bSPrashanth Sreenivasa */ 363*5cabbc6bSPrashanth Sreenivasa static void 364*5cabbc6bSPrashanth Sreenivasa spa_condense_indirect_complete_sync(void *arg, dmu_tx_t *tx) 365*5cabbc6bSPrashanth Sreenivasa { 366*5cabbc6bSPrashanth Sreenivasa spa_condensing_indirect_t *sci = arg; 367*5cabbc6bSPrashanth Sreenivasa spa_t *spa = dmu_tx_pool(tx)->dp_spa; 368*5cabbc6bSPrashanth Sreenivasa spa_condensing_indirect_phys_t *scip = 369*5cabbc6bSPrashanth Sreenivasa &spa->spa_condensing_indirect_phys; 370*5cabbc6bSPrashanth Sreenivasa vdev_t *vd = vdev_lookup_top(spa, scip->scip_vdev); 371*5cabbc6bSPrashanth Sreenivasa vdev_indirect_config_t *vic = &vd->vdev_indirect_config; 372*5cabbc6bSPrashanth Sreenivasa objset_t *mos = spa->spa_meta_objset; 373*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_t *old_mapping = vd->vdev_indirect_mapping; 374*5cabbc6bSPrashanth Sreenivasa uint64_t old_count = vdev_indirect_mapping_num_entries(old_mapping); 375*5cabbc6bSPrashanth Sreenivasa uint64_t new_count = 376*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_num_entries(sci->sci_new_mapping); 377*5cabbc6bSPrashanth Sreenivasa 378*5cabbc6bSPrashanth Sreenivasa ASSERT(dmu_tx_is_syncing(tx)); 379*5cabbc6bSPrashanth Sreenivasa ASSERT3P(vd->vdev_ops, ==, &vdev_indirect_ops); 380*5cabbc6bSPrashanth Sreenivasa ASSERT3P(sci, ==, spa->spa_condensing_indirect); 381*5cabbc6bSPrashanth Sreenivasa for (int i = 0; i < TXG_SIZE; i++) { 382*5cabbc6bSPrashanth Sreenivasa ASSERT(list_is_empty(&sci->sci_new_mapping_entries[i])); 383*5cabbc6bSPrashanth Sreenivasa } 384*5cabbc6bSPrashanth Sreenivasa ASSERT(vic->vic_mapping_object != 0); 385*5cabbc6bSPrashanth Sreenivasa ASSERT3U(vd->vdev_id, ==, scip->scip_vdev); 386*5cabbc6bSPrashanth Sreenivasa ASSERT(scip->scip_next_mapping_object != 0); 387*5cabbc6bSPrashanth Sreenivasa ASSERT(scip->scip_prev_obsolete_sm_object != 0); 388*5cabbc6bSPrashanth Sreenivasa 389*5cabbc6bSPrashanth Sreenivasa /* 390*5cabbc6bSPrashanth Sreenivasa * Reset vdev_indirect_mapping to refer to the new object. 391*5cabbc6bSPrashanth Sreenivasa */ 392*5cabbc6bSPrashanth Sreenivasa rw_enter(&vd->vdev_indirect_rwlock, RW_WRITER); 393*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_close(vd->vdev_indirect_mapping); 394*5cabbc6bSPrashanth Sreenivasa vd->vdev_indirect_mapping = sci->sci_new_mapping; 395*5cabbc6bSPrashanth Sreenivasa rw_exit(&vd->vdev_indirect_rwlock); 396*5cabbc6bSPrashanth Sreenivasa 397*5cabbc6bSPrashanth Sreenivasa sci->sci_new_mapping = NULL; 398*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_free(mos, vic->vic_mapping_object, tx); 399*5cabbc6bSPrashanth Sreenivasa vic->vic_mapping_object = scip->scip_next_mapping_object; 400*5cabbc6bSPrashanth Sreenivasa scip->scip_next_mapping_object = 0; 401*5cabbc6bSPrashanth Sreenivasa 402*5cabbc6bSPrashanth Sreenivasa space_map_free_obj(mos, scip->scip_prev_obsolete_sm_object, tx); 403*5cabbc6bSPrashanth Sreenivasa spa_feature_decr(spa, SPA_FEATURE_OBSOLETE_COUNTS, tx); 404*5cabbc6bSPrashanth Sreenivasa scip->scip_prev_obsolete_sm_object = 0; 405*5cabbc6bSPrashanth Sreenivasa 406*5cabbc6bSPrashanth Sreenivasa scip->scip_vdev = 0; 407*5cabbc6bSPrashanth Sreenivasa 408*5cabbc6bSPrashanth Sreenivasa VERIFY0(zap_remove(mos, DMU_POOL_DIRECTORY_OBJECT, 409*5cabbc6bSPrashanth Sreenivasa DMU_POOL_CONDENSING_INDIRECT, tx)); 410*5cabbc6bSPrashanth Sreenivasa spa_condensing_indirect_destroy(spa->spa_condensing_indirect); 411*5cabbc6bSPrashanth Sreenivasa spa->spa_condensing_indirect = NULL; 412*5cabbc6bSPrashanth Sreenivasa 413*5cabbc6bSPrashanth Sreenivasa zfs_dbgmsg("finished condense of vdev %llu in txg %llu: " 414*5cabbc6bSPrashanth Sreenivasa "new mapping object %llu has %llu entries " 415*5cabbc6bSPrashanth Sreenivasa "(was %llu entries)", 416*5cabbc6bSPrashanth Sreenivasa vd->vdev_id, dmu_tx_get_txg(tx), vic->vic_mapping_object, 417*5cabbc6bSPrashanth Sreenivasa new_count, old_count); 418*5cabbc6bSPrashanth Sreenivasa 419*5cabbc6bSPrashanth Sreenivasa vdev_config_dirty(spa->spa_root_vdev); 420*5cabbc6bSPrashanth Sreenivasa } 421*5cabbc6bSPrashanth Sreenivasa 422*5cabbc6bSPrashanth Sreenivasa /* 423*5cabbc6bSPrashanth Sreenivasa * This sync task appends entries to the new mapping object. 424*5cabbc6bSPrashanth Sreenivasa */ 425*5cabbc6bSPrashanth Sreenivasa static void 426*5cabbc6bSPrashanth Sreenivasa spa_condense_indirect_commit_sync(void *arg, dmu_tx_t *tx) 427*5cabbc6bSPrashanth Sreenivasa { 428*5cabbc6bSPrashanth Sreenivasa spa_condensing_indirect_t *sci = arg; 429*5cabbc6bSPrashanth Sreenivasa uint64_t txg = dmu_tx_get_txg(tx); 430*5cabbc6bSPrashanth Sreenivasa spa_t *spa = dmu_tx_pool(tx)->dp_spa; 431*5cabbc6bSPrashanth Sreenivasa 432*5cabbc6bSPrashanth Sreenivasa ASSERT(dmu_tx_is_syncing(tx)); 433*5cabbc6bSPrashanth Sreenivasa ASSERT3P(sci, ==, spa->spa_condensing_indirect); 434*5cabbc6bSPrashanth Sreenivasa 435*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_add_entries(sci->sci_new_mapping, 436*5cabbc6bSPrashanth Sreenivasa &sci->sci_new_mapping_entries[txg & TXG_MASK], tx); 437*5cabbc6bSPrashanth Sreenivasa ASSERT(list_is_empty(&sci->sci_new_mapping_entries[txg & TXG_MASK])); 438*5cabbc6bSPrashanth Sreenivasa } 439*5cabbc6bSPrashanth Sreenivasa 440*5cabbc6bSPrashanth Sreenivasa /* 441*5cabbc6bSPrashanth Sreenivasa * Open-context function to add one entry to the new mapping. The new 442*5cabbc6bSPrashanth Sreenivasa * entry will be remembered and written from syncing context. 443*5cabbc6bSPrashanth Sreenivasa */ 444*5cabbc6bSPrashanth Sreenivasa static void 445*5cabbc6bSPrashanth Sreenivasa spa_condense_indirect_commit_entry(spa_t *spa, 446*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_entry_phys_t *vimep, uint32_t count) 447*5cabbc6bSPrashanth Sreenivasa { 448*5cabbc6bSPrashanth Sreenivasa spa_condensing_indirect_t *sci = spa->spa_condensing_indirect; 449*5cabbc6bSPrashanth Sreenivasa 450*5cabbc6bSPrashanth Sreenivasa ASSERT3U(count, <, DVA_GET_ASIZE(&vimep->vimep_dst)); 451*5cabbc6bSPrashanth Sreenivasa 452*5cabbc6bSPrashanth Sreenivasa dmu_tx_t *tx = dmu_tx_create_dd(spa_get_dsl(spa)->dp_mos_dir); 453*5cabbc6bSPrashanth Sreenivasa dmu_tx_hold_space(tx, sizeof (*vimep) + sizeof (count)); 454*5cabbc6bSPrashanth Sreenivasa VERIFY0(dmu_tx_assign(tx, TXG_WAIT)); 455*5cabbc6bSPrashanth Sreenivasa int txgoff = dmu_tx_get_txg(tx) & TXG_MASK; 456*5cabbc6bSPrashanth Sreenivasa 457*5cabbc6bSPrashanth Sreenivasa /* 458*5cabbc6bSPrashanth Sreenivasa * If we are the first entry committed this txg, kick off the sync 459*5cabbc6bSPrashanth Sreenivasa * task to write to the MOS on our behalf. 460*5cabbc6bSPrashanth Sreenivasa */ 461*5cabbc6bSPrashanth Sreenivasa if (list_is_empty(&sci->sci_new_mapping_entries[txgoff])) { 462*5cabbc6bSPrashanth Sreenivasa dsl_sync_task_nowait(dmu_tx_pool(tx), 463*5cabbc6bSPrashanth Sreenivasa spa_condense_indirect_commit_sync, sci, 464*5cabbc6bSPrashanth Sreenivasa 0, ZFS_SPACE_CHECK_NONE, tx); 465*5cabbc6bSPrashanth Sreenivasa } 466*5cabbc6bSPrashanth Sreenivasa 467*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_entry_t *vime = 468*5cabbc6bSPrashanth Sreenivasa kmem_alloc(sizeof (*vime), KM_SLEEP); 469*5cabbc6bSPrashanth Sreenivasa vime->vime_mapping = *vimep; 470*5cabbc6bSPrashanth Sreenivasa vime->vime_obsolete_count = count; 471*5cabbc6bSPrashanth Sreenivasa list_insert_tail(&sci->sci_new_mapping_entries[txgoff], vime); 472*5cabbc6bSPrashanth Sreenivasa 473*5cabbc6bSPrashanth Sreenivasa dmu_tx_commit(tx); 474*5cabbc6bSPrashanth Sreenivasa } 475*5cabbc6bSPrashanth Sreenivasa 476*5cabbc6bSPrashanth Sreenivasa static void 477*5cabbc6bSPrashanth Sreenivasa spa_condense_indirect_generate_new_mapping(vdev_t *vd, 478*5cabbc6bSPrashanth Sreenivasa uint32_t *obsolete_counts, uint64_t start_index) 479*5cabbc6bSPrashanth Sreenivasa { 480*5cabbc6bSPrashanth Sreenivasa spa_t *spa = vd->vdev_spa; 481*5cabbc6bSPrashanth Sreenivasa uint64_t mapi = start_index; 482*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_t *old_mapping = vd->vdev_indirect_mapping; 483*5cabbc6bSPrashanth Sreenivasa uint64_t old_num_entries = 484*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_num_entries(old_mapping); 485*5cabbc6bSPrashanth Sreenivasa 486*5cabbc6bSPrashanth Sreenivasa ASSERT3P(vd->vdev_ops, ==, &vdev_indirect_ops); 487*5cabbc6bSPrashanth Sreenivasa ASSERT3U(vd->vdev_id, ==, spa->spa_condensing_indirect_phys.scip_vdev); 488*5cabbc6bSPrashanth Sreenivasa 489*5cabbc6bSPrashanth Sreenivasa zfs_dbgmsg("starting condense of vdev %llu from index %llu", 490*5cabbc6bSPrashanth Sreenivasa (u_longlong_t)vd->vdev_id, 491*5cabbc6bSPrashanth Sreenivasa (u_longlong_t)mapi); 492*5cabbc6bSPrashanth Sreenivasa 493*5cabbc6bSPrashanth Sreenivasa while (mapi < old_num_entries && !spa_shutting_down(spa)) { 494*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_entry_phys_t *entry = 495*5cabbc6bSPrashanth Sreenivasa &old_mapping->vim_entries[mapi]; 496*5cabbc6bSPrashanth Sreenivasa uint64_t entry_size = DVA_GET_ASIZE(&entry->vimep_dst); 497*5cabbc6bSPrashanth Sreenivasa ASSERT3U(obsolete_counts[mapi], <=, entry_size); 498*5cabbc6bSPrashanth Sreenivasa if (obsolete_counts[mapi] < entry_size) { 499*5cabbc6bSPrashanth Sreenivasa spa_condense_indirect_commit_entry(spa, entry, 500*5cabbc6bSPrashanth Sreenivasa obsolete_counts[mapi]); 501*5cabbc6bSPrashanth Sreenivasa 502*5cabbc6bSPrashanth Sreenivasa /* 503*5cabbc6bSPrashanth Sreenivasa * This delay may be requested for testing, debugging, 504*5cabbc6bSPrashanth Sreenivasa * or performance reasons. 505*5cabbc6bSPrashanth Sreenivasa */ 506*5cabbc6bSPrashanth Sreenivasa delay(zfs_condense_indirect_commit_entry_delay_ticks); 507*5cabbc6bSPrashanth Sreenivasa } 508*5cabbc6bSPrashanth Sreenivasa 509*5cabbc6bSPrashanth Sreenivasa mapi++; 510*5cabbc6bSPrashanth Sreenivasa } 511*5cabbc6bSPrashanth Sreenivasa if (spa_shutting_down(spa)) { 512*5cabbc6bSPrashanth Sreenivasa zfs_dbgmsg("pausing condense of vdev %llu at index %llu", 513*5cabbc6bSPrashanth Sreenivasa (u_longlong_t)vd->vdev_id, 514*5cabbc6bSPrashanth Sreenivasa (u_longlong_t)mapi); 515*5cabbc6bSPrashanth Sreenivasa } 516*5cabbc6bSPrashanth Sreenivasa } 517*5cabbc6bSPrashanth Sreenivasa 518*5cabbc6bSPrashanth Sreenivasa static void 519*5cabbc6bSPrashanth Sreenivasa spa_condense_indirect_thread(void *arg) 520*5cabbc6bSPrashanth Sreenivasa { 521*5cabbc6bSPrashanth Sreenivasa vdev_t *vd = arg; 522*5cabbc6bSPrashanth Sreenivasa spa_t *spa = vd->vdev_spa; 523*5cabbc6bSPrashanth Sreenivasa spa_condensing_indirect_t *sci = spa->spa_condensing_indirect; 524*5cabbc6bSPrashanth Sreenivasa spa_condensing_indirect_phys_t *scip = 525*5cabbc6bSPrashanth Sreenivasa &spa->spa_condensing_indirect_phys; 526*5cabbc6bSPrashanth Sreenivasa uint32_t *counts; 527*5cabbc6bSPrashanth Sreenivasa uint64_t start_index; 528*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_t *old_mapping = vd->vdev_indirect_mapping; 529*5cabbc6bSPrashanth Sreenivasa space_map_t *prev_obsolete_sm = NULL; 530*5cabbc6bSPrashanth Sreenivasa 531*5cabbc6bSPrashanth Sreenivasa ASSERT3U(vd->vdev_id, ==, scip->scip_vdev); 532*5cabbc6bSPrashanth Sreenivasa ASSERT(scip->scip_next_mapping_object != 0); 533*5cabbc6bSPrashanth Sreenivasa ASSERT(scip->scip_prev_obsolete_sm_object != 0); 534*5cabbc6bSPrashanth Sreenivasa ASSERT3P(vd->vdev_ops, ==, &vdev_indirect_ops); 535*5cabbc6bSPrashanth Sreenivasa 536*5cabbc6bSPrashanth Sreenivasa for (int i = 0; i < TXG_SIZE; i++) { 537*5cabbc6bSPrashanth Sreenivasa /* 538*5cabbc6bSPrashanth Sreenivasa * The list must start out empty in order for the 539*5cabbc6bSPrashanth Sreenivasa * _commit_sync() sync task to be properly registered 540*5cabbc6bSPrashanth Sreenivasa * on the first call to _commit_entry(); so it's wise 541*5cabbc6bSPrashanth Sreenivasa * to double check and ensure we actually are starting 542*5cabbc6bSPrashanth Sreenivasa * with empty lists. 543*5cabbc6bSPrashanth Sreenivasa */ 544*5cabbc6bSPrashanth Sreenivasa ASSERT(list_is_empty(&sci->sci_new_mapping_entries[i])); 545*5cabbc6bSPrashanth Sreenivasa } 546*5cabbc6bSPrashanth Sreenivasa 547*5cabbc6bSPrashanth Sreenivasa VERIFY0(space_map_open(&prev_obsolete_sm, spa->spa_meta_objset, 548*5cabbc6bSPrashanth Sreenivasa scip->scip_prev_obsolete_sm_object, 0, vd->vdev_asize, 0)); 549*5cabbc6bSPrashanth Sreenivasa space_map_update(prev_obsolete_sm); 550*5cabbc6bSPrashanth Sreenivasa counts = vdev_indirect_mapping_load_obsolete_counts(old_mapping); 551*5cabbc6bSPrashanth Sreenivasa if (prev_obsolete_sm != NULL) { 552*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_load_obsolete_spacemap(old_mapping, 553*5cabbc6bSPrashanth Sreenivasa counts, prev_obsolete_sm); 554*5cabbc6bSPrashanth Sreenivasa } 555*5cabbc6bSPrashanth Sreenivasa space_map_close(prev_obsolete_sm); 556*5cabbc6bSPrashanth Sreenivasa 557*5cabbc6bSPrashanth Sreenivasa /* 558*5cabbc6bSPrashanth Sreenivasa * Generate new mapping. Determine what index to continue from 559*5cabbc6bSPrashanth Sreenivasa * based on the max offset that we've already written in the 560*5cabbc6bSPrashanth Sreenivasa * new mapping. 561*5cabbc6bSPrashanth Sreenivasa */ 562*5cabbc6bSPrashanth Sreenivasa uint64_t max_offset = 563*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_max_offset(sci->sci_new_mapping); 564*5cabbc6bSPrashanth Sreenivasa if (max_offset == 0) { 565*5cabbc6bSPrashanth Sreenivasa /* We haven't written anything to the new mapping yet. */ 566*5cabbc6bSPrashanth Sreenivasa start_index = 0; 567*5cabbc6bSPrashanth Sreenivasa } else { 568*5cabbc6bSPrashanth Sreenivasa /* 569*5cabbc6bSPrashanth Sreenivasa * Pick up from where we left off. _entry_for_offset() 570*5cabbc6bSPrashanth Sreenivasa * returns a pointer into the vim_entries array. If 571*5cabbc6bSPrashanth Sreenivasa * max_offset is greater than any of the mappings 572*5cabbc6bSPrashanth Sreenivasa * contained in the table NULL will be returned and 573*5cabbc6bSPrashanth Sreenivasa * that indicates we've exhausted our iteration of the 574*5cabbc6bSPrashanth Sreenivasa * old_mapping. 575*5cabbc6bSPrashanth Sreenivasa */ 576*5cabbc6bSPrashanth Sreenivasa 577*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_entry_phys_t *entry = 578*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_entry_for_offset_or_next(old_mapping, 579*5cabbc6bSPrashanth Sreenivasa max_offset); 580*5cabbc6bSPrashanth Sreenivasa 581*5cabbc6bSPrashanth Sreenivasa if (entry == NULL) { 582*5cabbc6bSPrashanth Sreenivasa /* 583*5cabbc6bSPrashanth Sreenivasa * We've already written the whole new mapping. 584*5cabbc6bSPrashanth Sreenivasa * This special value will cause us to skip the 585*5cabbc6bSPrashanth Sreenivasa * generate_new_mapping step and just do the sync 586*5cabbc6bSPrashanth Sreenivasa * task to complete the condense. 587*5cabbc6bSPrashanth Sreenivasa */ 588*5cabbc6bSPrashanth Sreenivasa start_index = UINT64_MAX; 589*5cabbc6bSPrashanth Sreenivasa } else { 590*5cabbc6bSPrashanth Sreenivasa start_index = entry - old_mapping->vim_entries; 591*5cabbc6bSPrashanth Sreenivasa ASSERT3U(start_index, <, 592*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_num_entries(old_mapping)); 593*5cabbc6bSPrashanth Sreenivasa } 594*5cabbc6bSPrashanth Sreenivasa } 595*5cabbc6bSPrashanth Sreenivasa 596*5cabbc6bSPrashanth Sreenivasa spa_condense_indirect_generate_new_mapping(vd, counts, start_index); 597*5cabbc6bSPrashanth Sreenivasa 598*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_free_obsolete_counts(old_mapping, counts); 599*5cabbc6bSPrashanth Sreenivasa 600*5cabbc6bSPrashanth Sreenivasa /* 601*5cabbc6bSPrashanth Sreenivasa * We may have bailed early from generate_new_mapping(), if 602*5cabbc6bSPrashanth Sreenivasa * the spa is shutting down. In this case, do not complete 603*5cabbc6bSPrashanth Sreenivasa * the condense. 604*5cabbc6bSPrashanth Sreenivasa */ 605*5cabbc6bSPrashanth Sreenivasa if (!spa_shutting_down(spa)) { 606*5cabbc6bSPrashanth Sreenivasa VERIFY0(dsl_sync_task(spa_name(spa), NULL, 607*5cabbc6bSPrashanth Sreenivasa spa_condense_indirect_complete_sync, sci, 0, 608*5cabbc6bSPrashanth Sreenivasa ZFS_SPACE_CHECK_NONE)); 609*5cabbc6bSPrashanth Sreenivasa } 610*5cabbc6bSPrashanth Sreenivasa 611*5cabbc6bSPrashanth Sreenivasa mutex_enter(&spa->spa_async_lock); 612*5cabbc6bSPrashanth Sreenivasa spa->spa_condense_thread = NULL; 613*5cabbc6bSPrashanth Sreenivasa cv_broadcast(&spa->spa_async_cv); 614*5cabbc6bSPrashanth Sreenivasa mutex_exit(&spa->spa_async_lock); 615*5cabbc6bSPrashanth Sreenivasa } 616*5cabbc6bSPrashanth Sreenivasa 617*5cabbc6bSPrashanth Sreenivasa /* 618*5cabbc6bSPrashanth Sreenivasa * Sync task to begin the condensing process. 619*5cabbc6bSPrashanth Sreenivasa */ 620*5cabbc6bSPrashanth Sreenivasa void 621*5cabbc6bSPrashanth Sreenivasa spa_condense_indirect_start_sync(vdev_t *vd, dmu_tx_t *tx) 622*5cabbc6bSPrashanth Sreenivasa { 623*5cabbc6bSPrashanth Sreenivasa spa_t *spa = vd->vdev_spa; 624*5cabbc6bSPrashanth Sreenivasa spa_condensing_indirect_phys_t *scip = 625*5cabbc6bSPrashanth Sreenivasa &spa->spa_condensing_indirect_phys; 626*5cabbc6bSPrashanth Sreenivasa 627*5cabbc6bSPrashanth Sreenivasa ASSERT0(scip->scip_next_mapping_object); 628*5cabbc6bSPrashanth Sreenivasa ASSERT0(scip->scip_prev_obsolete_sm_object); 629*5cabbc6bSPrashanth Sreenivasa ASSERT0(scip->scip_vdev); 630*5cabbc6bSPrashanth Sreenivasa ASSERT(dmu_tx_is_syncing(tx)); 631*5cabbc6bSPrashanth Sreenivasa ASSERT3P(vd->vdev_ops, ==, &vdev_indirect_ops); 632*5cabbc6bSPrashanth Sreenivasa ASSERT(spa_feature_is_active(spa, SPA_FEATURE_OBSOLETE_COUNTS)); 633*5cabbc6bSPrashanth Sreenivasa ASSERT(vdev_indirect_mapping_num_entries(vd->vdev_indirect_mapping)); 634*5cabbc6bSPrashanth Sreenivasa 635*5cabbc6bSPrashanth Sreenivasa uint64_t obsolete_sm_obj = vdev_obsolete_sm_object(vd); 636*5cabbc6bSPrashanth Sreenivasa ASSERT(obsolete_sm_obj != 0); 637*5cabbc6bSPrashanth Sreenivasa 638*5cabbc6bSPrashanth Sreenivasa scip->scip_vdev = vd->vdev_id; 639*5cabbc6bSPrashanth Sreenivasa scip->scip_next_mapping_object = 640*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_alloc(spa->spa_meta_objset, tx); 641*5cabbc6bSPrashanth Sreenivasa 642*5cabbc6bSPrashanth Sreenivasa scip->scip_prev_obsolete_sm_object = obsolete_sm_obj; 643*5cabbc6bSPrashanth Sreenivasa 644*5cabbc6bSPrashanth Sreenivasa /* 645*5cabbc6bSPrashanth Sreenivasa * We don't need to allocate a new space map object, since 646*5cabbc6bSPrashanth Sreenivasa * vdev_indirect_sync_obsolete will allocate one when needed. 647*5cabbc6bSPrashanth Sreenivasa */ 648*5cabbc6bSPrashanth Sreenivasa space_map_close(vd->vdev_obsolete_sm); 649*5cabbc6bSPrashanth Sreenivasa vd->vdev_obsolete_sm = NULL; 650*5cabbc6bSPrashanth Sreenivasa VERIFY0(zap_remove(spa->spa_meta_objset, vd->vdev_top_zap, 651*5cabbc6bSPrashanth Sreenivasa VDEV_TOP_ZAP_INDIRECT_OBSOLETE_SM, tx)); 652*5cabbc6bSPrashanth Sreenivasa 653*5cabbc6bSPrashanth Sreenivasa VERIFY0(zap_add(spa->spa_dsl_pool->dp_meta_objset, 654*5cabbc6bSPrashanth Sreenivasa DMU_POOL_DIRECTORY_OBJECT, 655*5cabbc6bSPrashanth Sreenivasa DMU_POOL_CONDENSING_INDIRECT, sizeof (uint64_t), 656*5cabbc6bSPrashanth Sreenivasa sizeof (*scip) / sizeof (uint64_t), scip, tx)); 657*5cabbc6bSPrashanth Sreenivasa 658*5cabbc6bSPrashanth Sreenivasa ASSERT3P(spa->spa_condensing_indirect, ==, NULL); 659*5cabbc6bSPrashanth Sreenivasa spa->spa_condensing_indirect = spa_condensing_indirect_create(spa); 660*5cabbc6bSPrashanth Sreenivasa 661*5cabbc6bSPrashanth Sreenivasa zfs_dbgmsg("starting condense of vdev %llu in txg %llu: " 662*5cabbc6bSPrashanth Sreenivasa "posm=%llu nm=%llu", 663*5cabbc6bSPrashanth Sreenivasa vd->vdev_id, dmu_tx_get_txg(tx), 664*5cabbc6bSPrashanth Sreenivasa (u_longlong_t)scip->scip_prev_obsolete_sm_object, 665*5cabbc6bSPrashanth Sreenivasa (u_longlong_t)scip->scip_next_mapping_object); 666*5cabbc6bSPrashanth Sreenivasa 667*5cabbc6bSPrashanth Sreenivasa ASSERT3P(spa->spa_condense_thread, ==, NULL); 668*5cabbc6bSPrashanth Sreenivasa spa->spa_condense_thread = thread_create(NULL, 0, 669*5cabbc6bSPrashanth Sreenivasa spa_condense_indirect_thread, vd, 0, &p0, TS_RUN, minclsyspri); 670*5cabbc6bSPrashanth Sreenivasa } 671*5cabbc6bSPrashanth Sreenivasa 672*5cabbc6bSPrashanth Sreenivasa /* 673*5cabbc6bSPrashanth Sreenivasa * Sync to the given vdev's obsolete space map any segments that are no longer 674*5cabbc6bSPrashanth Sreenivasa * referenced as of the given txg. 675*5cabbc6bSPrashanth Sreenivasa * 676*5cabbc6bSPrashanth Sreenivasa * If the obsolete space map doesn't exist yet, create and open it. 677*5cabbc6bSPrashanth Sreenivasa */ 678*5cabbc6bSPrashanth Sreenivasa void 679*5cabbc6bSPrashanth Sreenivasa vdev_indirect_sync_obsolete(vdev_t *vd, dmu_tx_t *tx) 680*5cabbc6bSPrashanth Sreenivasa { 681*5cabbc6bSPrashanth Sreenivasa spa_t *spa = vd->vdev_spa; 682*5cabbc6bSPrashanth Sreenivasa vdev_indirect_config_t *vic = &vd->vdev_indirect_config; 683*5cabbc6bSPrashanth Sreenivasa 684*5cabbc6bSPrashanth Sreenivasa ASSERT3U(vic->vic_mapping_object, !=, 0); 685*5cabbc6bSPrashanth Sreenivasa ASSERT(range_tree_space(vd->vdev_obsolete_segments) > 0); 686*5cabbc6bSPrashanth Sreenivasa ASSERT(vd->vdev_removing || vd->vdev_ops == &vdev_indirect_ops); 687*5cabbc6bSPrashanth Sreenivasa ASSERT(spa_feature_is_enabled(spa, SPA_FEATURE_OBSOLETE_COUNTS)); 688*5cabbc6bSPrashanth Sreenivasa 689*5cabbc6bSPrashanth Sreenivasa if (vdev_obsolete_sm_object(vd) == 0) { 690*5cabbc6bSPrashanth Sreenivasa uint64_t obsolete_sm_object = 691*5cabbc6bSPrashanth Sreenivasa space_map_alloc(spa->spa_meta_objset, tx); 692*5cabbc6bSPrashanth Sreenivasa 693*5cabbc6bSPrashanth Sreenivasa ASSERT(vd->vdev_top_zap != 0); 694*5cabbc6bSPrashanth Sreenivasa VERIFY0(zap_add(vd->vdev_spa->spa_meta_objset, vd->vdev_top_zap, 695*5cabbc6bSPrashanth Sreenivasa VDEV_TOP_ZAP_INDIRECT_OBSOLETE_SM, 696*5cabbc6bSPrashanth Sreenivasa sizeof (obsolete_sm_object), 1, &obsolete_sm_object, tx)); 697*5cabbc6bSPrashanth Sreenivasa ASSERT3U(vdev_obsolete_sm_object(vd), !=, 0); 698*5cabbc6bSPrashanth Sreenivasa 699*5cabbc6bSPrashanth Sreenivasa spa_feature_incr(spa, SPA_FEATURE_OBSOLETE_COUNTS, tx); 700*5cabbc6bSPrashanth Sreenivasa VERIFY0(space_map_open(&vd->vdev_obsolete_sm, 701*5cabbc6bSPrashanth Sreenivasa spa->spa_meta_objset, obsolete_sm_object, 702*5cabbc6bSPrashanth Sreenivasa 0, vd->vdev_asize, 0)); 703*5cabbc6bSPrashanth Sreenivasa space_map_update(vd->vdev_obsolete_sm); 704*5cabbc6bSPrashanth Sreenivasa } 705*5cabbc6bSPrashanth Sreenivasa 706*5cabbc6bSPrashanth Sreenivasa ASSERT(vd->vdev_obsolete_sm != NULL); 707*5cabbc6bSPrashanth Sreenivasa ASSERT3U(vdev_obsolete_sm_object(vd), ==, 708*5cabbc6bSPrashanth Sreenivasa space_map_object(vd->vdev_obsolete_sm)); 709*5cabbc6bSPrashanth Sreenivasa 710*5cabbc6bSPrashanth Sreenivasa space_map_write(vd->vdev_obsolete_sm, 711*5cabbc6bSPrashanth Sreenivasa vd->vdev_obsolete_segments, SM_ALLOC, tx); 712*5cabbc6bSPrashanth Sreenivasa space_map_update(vd->vdev_obsolete_sm); 713*5cabbc6bSPrashanth Sreenivasa range_tree_vacate(vd->vdev_obsolete_segments, NULL, NULL); 714*5cabbc6bSPrashanth Sreenivasa } 715*5cabbc6bSPrashanth Sreenivasa 716*5cabbc6bSPrashanth Sreenivasa int 717*5cabbc6bSPrashanth Sreenivasa spa_condense_init(spa_t *spa) 718*5cabbc6bSPrashanth Sreenivasa { 719*5cabbc6bSPrashanth Sreenivasa int error = zap_lookup(spa->spa_meta_objset, 720*5cabbc6bSPrashanth Sreenivasa DMU_POOL_DIRECTORY_OBJECT, 721*5cabbc6bSPrashanth Sreenivasa DMU_POOL_CONDENSING_INDIRECT, sizeof (uint64_t), 722*5cabbc6bSPrashanth Sreenivasa sizeof (spa->spa_condensing_indirect_phys) / sizeof (uint64_t), 723*5cabbc6bSPrashanth Sreenivasa &spa->spa_condensing_indirect_phys); 724*5cabbc6bSPrashanth Sreenivasa if (error == 0) { 725*5cabbc6bSPrashanth Sreenivasa if (spa_writeable(spa)) { 726*5cabbc6bSPrashanth Sreenivasa spa->spa_condensing_indirect = 727*5cabbc6bSPrashanth Sreenivasa spa_condensing_indirect_create(spa); 728*5cabbc6bSPrashanth Sreenivasa } 729*5cabbc6bSPrashanth Sreenivasa return (0); 730*5cabbc6bSPrashanth Sreenivasa } else if (error == ENOENT) { 731*5cabbc6bSPrashanth Sreenivasa return (0); 732*5cabbc6bSPrashanth Sreenivasa } else { 733*5cabbc6bSPrashanth Sreenivasa return (error); 734*5cabbc6bSPrashanth Sreenivasa } 735*5cabbc6bSPrashanth Sreenivasa } 736*5cabbc6bSPrashanth Sreenivasa 737*5cabbc6bSPrashanth Sreenivasa void 738*5cabbc6bSPrashanth Sreenivasa spa_condense_fini(spa_t *spa) 739*5cabbc6bSPrashanth Sreenivasa { 740*5cabbc6bSPrashanth Sreenivasa if (spa->spa_condensing_indirect != NULL) { 741*5cabbc6bSPrashanth Sreenivasa spa_condensing_indirect_destroy(spa->spa_condensing_indirect); 742*5cabbc6bSPrashanth Sreenivasa spa->spa_condensing_indirect = NULL; 743*5cabbc6bSPrashanth Sreenivasa } 744*5cabbc6bSPrashanth Sreenivasa } 745*5cabbc6bSPrashanth Sreenivasa 746*5cabbc6bSPrashanth Sreenivasa /* 747*5cabbc6bSPrashanth Sreenivasa * Restart the condense - called when the pool is opened. 748*5cabbc6bSPrashanth Sreenivasa */ 749*5cabbc6bSPrashanth Sreenivasa void 750*5cabbc6bSPrashanth Sreenivasa spa_condense_indirect_restart(spa_t *spa) 751*5cabbc6bSPrashanth Sreenivasa { 752*5cabbc6bSPrashanth Sreenivasa vdev_t *vd; 753*5cabbc6bSPrashanth Sreenivasa ASSERT(spa->spa_condensing_indirect != NULL); 754*5cabbc6bSPrashanth Sreenivasa spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER); 755*5cabbc6bSPrashanth Sreenivasa vd = vdev_lookup_top(spa, 756*5cabbc6bSPrashanth Sreenivasa spa->spa_condensing_indirect_phys.scip_vdev); 757*5cabbc6bSPrashanth Sreenivasa ASSERT(vd != NULL); 758*5cabbc6bSPrashanth Sreenivasa spa_config_exit(spa, SCL_VDEV, FTAG); 759*5cabbc6bSPrashanth Sreenivasa 760*5cabbc6bSPrashanth Sreenivasa ASSERT3P(spa->spa_condense_thread, ==, NULL); 761*5cabbc6bSPrashanth Sreenivasa spa->spa_condense_thread = thread_create(NULL, 0, 762*5cabbc6bSPrashanth Sreenivasa spa_condense_indirect_thread, vd, 0, &p0, TS_RUN, 763*5cabbc6bSPrashanth Sreenivasa minclsyspri); 764*5cabbc6bSPrashanth Sreenivasa } 765*5cabbc6bSPrashanth Sreenivasa 766*5cabbc6bSPrashanth Sreenivasa /* 767*5cabbc6bSPrashanth Sreenivasa * Gets the obsolete spacemap object from the vdev's ZAP. 768*5cabbc6bSPrashanth Sreenivasa * Returns the spacemap object, or 0 if it wasn't in the ZAP or the ZAP doesn't 769*5cabbc6bSPrashanth Sreenivasa * exist yet. 770*5cabbc6bSPrashanth Sreenivasa */ 771*5cabbc6bSPrashanth Sreenivasa int 772*5cabbc6bSPrashanth Sreenivasa vdev_obsolete_sm_object(vdev_t *vd) 773*5cabbc6bSPrashanth Sreenivasa { 774*5cabbc6bSPrashanth Sreenivasa ASSERT0(spa_config_held(vd->vdev_spa, SCL_ALL, RW_WRITER)); 775*5cabbc6bSPrashanth Sreenivasa if (vd->vdev_top_zap == 0) { 776*5cabbc6bSPrashanth Sreenivasa return (0); 777*5cabbc6bSPrashanth Sreenivasa } 778*5cabbc6bSPrashanth Sreenivasa 779*5cabbc6bSPrashanth Sreenivasa uint64_t sm_obj = 0; 780*5cabbc6bSPrashanth Sreenivasa int err = zap_lookup(vd->vdev_spa->spa_meta_objset, vd->vdev_top_zap, 781*5cabbc6bSPrashanth Sreenivasa VDEV_TOP_ZAP_INDIRECT_OBSOLETE_SM, sizeof (sm_obj), 1, &sm_obj); 782*5cabbc6bSPrashanth Sreenivasa 783*5cabbc6bSPrashanth Sreenivasa ASSERT(err == 0 || err == ENOENT); 784*5cabbc6bSPrashanth Sreenivasa 785*5cabbc6bSPrashanth Sreenivasa return (sm_obj); 786*5cabbc6bSPrashanth Sreenivasa } 787*5cabbc6bSPrashanth Sreenivasa 788*5cabbc6bSPrashanth Sreenivasa boolean_t 789*5cabbc6bSPrashanth Sreenivasa vdev_obsolete_counts_are_precise(vdev_t *vd) 790*5cabbc6bSPrashanth Sreenivasa { 791*5cabbc6bSPrashanth Sreenivasa ASSERT0(spa_config_held(vd->vdev_spa, SCL_ALL, RW_WRITER)); 792*5cabbc6bSPrashanth Sreenivasa if (vd->vdev_top_zap == 0) { 793*5cabbc6bSPrashanth Sreenivasa return (B_FALSE); 794*5cabbc6bSPrashanth Sreenivasa } 795*5cabbc6bSPrashanth Sreenivasa 796*5cabbc6bSPrashanth Sreenivasa uint64_t val = 0; 797*5cabbc6bSPrashanth Sreenivasa int err = zap_lookup(vd->vdev_spa->spa_meta_objset, vd->vdev_top_zap, 798*5cabbc6bSPrashanth Sreenivasa VDEV_TOP_ZAP_OBSOLETE_COUNTS_ARE_PRECISE, sizeof (val), 1, &val); 799*5cabbc6bSPrashanth Sreenivasa 800*5cabbc6bSPrashanth Sreenivasa ASSERT(err == 0 || err == ENOENT); 801*5cabbc6bSPrashanth Sreenivasa 802*5cabbc6bSPrashanth Sreenivasa return (val != 0); 803*5cabbc6bSPrashanth Sreenivasa } 804*5cabbc6bSPrashanth Sreenivasa 805*5cabbc6bSPrashanth Sreenivasa /* ARGSUSED */ 806*5cabbc6bSPrashanth Sreenivasa static void 807*5cabbc6bSPrashanth Sreenivasa vdev_indirect_close(vdev_t *vd) 808*5cabbc6bSPrashanth Sreenivasa { 809*5cabbc6bSPrashanth Sreenivasa } 810*5cabbc6bSPrashanth Sreenivasa 811*5cabbc6bSPrashanth Sreenivasa /* ARGSUSED */ 812*5cabbc6bSPrashanth Sreenivasa static void 813*5cabbc6bSPrashanth Sreenivasa vdev_indirect_io_done(zio_t *zio) 814*5cabbc6bSPrashanth Sreenivasa { 815*5cabbc6bSPrashanth Sreenivasa } 816*5cabbc6bSPrashanth Sreenivasa 817*5cabbc6bSPrashanth Sreenivasa /* ARGSUSED */ 818*5cabbc6bSPrashanth Sreenivasa static int 819*5cabbc6bSPrashanth Sreenivasa vdev_indirect_open(vdev_t *vd, uint64_t *psize, uint64_t *max_psize, 820*5cabbc6bSPrashanth Sreenivasa uint64_t *ashift) 821*5cabbc6bSPrashanth Sreenivasa { 822*5cabbc6bSPrashanth Sreenivasa *psize = *max_psize = vd->vdev_asize + 823*5cabbc6bSPrashanth Sreenivasa VDEV_LABEL_START_SIZE + VDEV_LABEL_END_SIZE; 824*5cabbc6bSPrashanth Sreenivasa *ashift = vd->vdev_ashift; 825*5cabbc6bSPrashanth Sreenivasa return (0); 826*5cabbc6bSPrashanth Sreenivasa } 827*5cabbc6bSPrashanth Sreenivasa 828*5cabbc6bSPrashanth Sreenivasa typedef struct remap_segment { 829*5cabbc6bSPrashanth Sreenivasa vdev_t *rs_vd; 830*5cabbc6bSPrashanth Sreenivasa uint64_t rs_offset; 831*5cabbc6bSPrashanth Sreenivasa uint64_t rs_asize; 832*5cabbc6bSPrashanth Sreenivasa uint64_t rs_split_offset; 833*5cabbc6bSPrashanth Sreenivasa list_node_t rs_node; 834*5cabbc6bSPrashanth Sreenivasa } remap_segment_t; 835*5cabbc6bSPrashanth Sreenivasa 836*5cabbc6bSPrashanth Sreenivasa remap_segment_t * 837*5cabbc6bSPrashanth Sreenivasa rs_alloc(vdev_t *vd, uint64_t offset, uint64_t asize, uint64_t split_offset) 838*5cabbc6bSPrashanth Sreenivasa { 839*5cabbc6bSPrashanth Sreenivasa remap_segment_t *rs = kmem_alloc(sizeof (remap_segment_t), KM_SLEEP); 840*5cabbc6bSPrashanth Sreenivasa rs->rs_vd = vd; 841*5cabbc6bSPrashanth Sreenivasa rs->rs_offset = offset; 842*5cabbc6bSPrashanth Sreenivasa rs->rs_asize = asize; 843*5cabbc6bSPrashanth Sreenivasa rs->rs_split_offset = split_offset; 844*5cabbc6bSPrashanth Sreenivasa return (rs); 845*5cabbc6bSPrashanth Sreenivasa } 846*5cabbc6bSPrashanth Sreenivasa 847*5cabbc6bSPrashanth Sreenivasa /* 848*5cabbc6bSPrashanth Sreenivasa * Goes through the relevant indirect mappings until it hits a concrete vdev 849*5cabbc6bSPrashanth Sreenivasa * and issues the callback. On the way to the concrete vdev, if any other 850*5cabbc6bSPrashanth Sreenivasa * indirect vdevs are encountered, then the callback will also be called on 851*5cabbc6bSPrashanth Sreenivasa * each of those indirect vdevs. For example, if the segment is mapped to 852*5cabbc6bSPrashanth Sreenivasa * segment A on indirect vdev 1, and then segment A on indirect vdev 1 is 853*5cabbc6bSPrashanth Sreenivasa * mapped to segment B on concrete vdev 2, then the callback will be called on 854*5cabbc6bSPrashanth Sreenivasa * both vdev 1 and vdev 2. 855*5cabbc6bSPrashanth Sreenivasa * 856*5cabbc6bSPrashanth Sreenivasa * While the callback passed to vdev_indirect_remap() is called on every vdev 857*5cabbc6bSPrashanth Sreenivasa * the function encounters, certain callbacks only care about concrete vdevs. 858*5cabbc6bSPrashanth Sreenivasa * These types of callbacks should return immediately and explicitly when they 859*5cabbc6bSPrashanth Sreenivasa * are called on an indirect vdev. 860*5cabbc6bSPrashanth Sreenivasa * 861*5cabbc6bSPrashanth Sreenivasa * Because there is a possibility that a DVA section in the indirect device 862*5cabbc6bSPrashanth Sreenivasa * has been split into multiple sections in our mapping, we keep track 863*5cabbc6bSPrashanth Sreenivasa * of the relevant contiguous segments of the new location (remap_segment_t) 864*5cabbc6bSPrashanth Sreenivasa * in a stack. This way we can call the callback for each of the new sections 865*5cabbc6bSPrashanth Sreenivasa * created by a single section of the indirect device. Note though, that in 866*5cabbc6bSPrashanth Sreenivasa * this scenario the callbacks in each split block won't occur in-order in 867*5cabbc6bSPrashanth Sreenivasa * terms of offset, so callers should not make any assumptions about that. 868*5cabbc6bSPrashanth Sreenivasa * 869*5cabbc6bSPrashanth Sreenivasa * For callbacks that don't handle split blocks and immediately return when 870*5cabbc6bSPrashanth Sreenivasa * they encounter them (as is the case for remap_blkptr_cb), the caller can 871*5cabbc6bSPrashanth Sreenivasa * assume that its callback will be applied from the first indirect vdev 872*5cabbc6bSPrashanth Sreenivasa * encountered to the last one and then the concrete vdev, in that order. 873*5cabbc6bSPrashanth Sreenivasa */ 874*5cabbc6bSPrashanth Sreenivasa static void 875*5cabbc6bSPrashanth Sreenivasa vdev_indirect_remap(vdev_t *vd, uint64_t offset, uint64_t asize, 876*5cabbc6bSPrashanth Sreenivasa void (*func)(uint64_t, vdev_t *, uint64_t, uint64_t, void *), void *arg) 877*5cabbc6bSPrashanth Sreenivasa { 878*5cabbc6bSPrashanth Sreenivasa list_t stack; 879*5cabbc6bSPrashanth Sreenivasa spa_t *spa = vd->vdev_spa; 880*5cabbc6bSPrashanth Sreenivasa 881*5cabbc6bSPrashanth Sreenivasa list_create(&stack, sizeof (remap_segment_t), 882*5cabbc6bSPrashanth Sreenivasa offsetof(remap_segment_t, rs_node)); 883*5cabbc6bSPrashanth Sreenivasa 884*5cabbc6bSPrashanth Sreenivasa for (remap_segment_t *rs = rs_alloc(vd, offset, asize, 0); 885*5cabbc6bSPrashanth Sreenivasa rs != NULL; rs = list_remove_head(&stack)) { 886*5cabbc6bSPrashanth Sreenivasa vdev_t *v = rs->rs_vd; 887*5cabbc6bSPrashanth Sreenivasa 888*5cabbc6bSPrashanth Sreenivasa /* 889*5cabbc6bSPrashanth Sreenivasa * Note: this can be called from open context 890*5cabbc6bSPrashanth Sreenivasa * (eg. zio_read()), so we need the rwlock to prevent 891*5cabbc6bSPrashanth Sreenivasa * the mapping from being changed by condensing. 892*5cabbc6bSPrashanth Sreenivasa */ 893*5cabbc6bSPrashanth Sreenivasa rw_enter(&v->vdev_indirect_rwlock, RW_READER); 894*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_t *vim = v->vdev_indirect_mapping; 895*5cabbc6bSPrashanth Sreenivasa ASSERT3P(vim, !=, NULL); 896*5cabbc6bSPrashanth Sreenivasa 897*5cabbc6bSPrashanth Sreenivasa ASSERT(spa_config_held(spa, SCL_ALL, RW_READER) != 0); 898*5cabbc6bSPrashanth Sreenivasa ASSERT(rs->rs_asize > 0); 899*5cabbc6bSPrashanth Sreenivasa 900*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_entry_phys_t *mapping = 901*5cabbc6bSPrashanth Sreenivasa vdev_indirect_mapping_entry_for_offset(vim, rs->rs_offset); 902*5cabbc6bSPrashanth Sreenivasa ASSERT3P(mapping, !=, NULL); 903*5cabbc6bSPrashanth Sreenivasa 904*5cabbc6bSPrashanth Sreenivasa while (rs->rs_asize > 0) { 905*5cabbc6bSPrashanth Sreenivasa /* 906*5cabbc6bSPrashanth Sreenivasa * Note: the vdev_indirect_mapping can not change 907*5cabbc6bSPrashanth Sreenivasa * while we are running. It only changes while the 908*5cabbc6bSPrashanth Sreenivasa * removal is in progress, and then only from syncing 909*5cabbc6bSPrashanth Sreenivasa * context. While a removal is in progress, this 910*5cabbc6bSPrashanth Sreenivasa * function is only called for frees, which also only 911*5cabbc6bSPrashanth Sreenivasa * happen from syncing context. 912*5cabbc6bSPrashanth Sreenivasa */ 913*5cabbc6bSPrashanth Sreenivasa 914*5cabbc6bSPrashanth Sreenivasa uint64_t size = DVA_GET_ASIZE(&mapping->vimep_dst); 915*5cabbc6bSPrashanth Sreenivasa uint64_t dst_offset = 916*5cabbc6bSPrashanth Sreenivasa DVA_GET_OFFSET(&mapping->vimep_dst); 917*5cabbc6bSPrashanth Sreenivasa uint64_t dst_vdev = DVA_GET_VDEV(&mapping->vimep_dst); 918*5cabbc6bSPrashanth Sreenivasa 919*5cabbc6bSPrashanth Sreenivasa ASSERT3U(rs->rs_offset, >=, 920*5cabbc6bSPrashanth Sreenivasa DVA_MAPPING_GET_SRC_OFFSET(mapping)); 921*5cabbc6bSPrashanth Sreenivasa ASSERT3U(rs->rs_offset, <, 922*5cabbc6bSPrashanth Sreenivasa DVA_MAPPING_GET_SRC_OFFSET(mapping) + size); 923*5cabbc6bSPrashanth Sreenivasa ASSERT3U(dst_vdev, !=, v->vdev_id); 924*5cabbc6bSPrashanth Sreenivasa 925*5cabbc6bSPrashanth Sreenivasa uint64_t inner_offset = rs->rs_offset - 926*5cabbc6bSPrashanth Sreenivasa DVA_MAPPING_GET_SRC_OFFSET(mapping); 927*5cabbc6bSPrashanth Sreenivasa uint64_t inner_size = 928*5cabbc6bSPrashanth Sreenivasa MIN(rs->rs_asize, size - inner_offset); 929*5cabbc6bSPrashanth Sreenivasa 930*5cabbc6bSPrashanth Sreenivasa vdev_t *dst_v = vdev_lookup_top(spa, dst_vdev); 931*5cabbc6bSPrashanth Sreenivasa ASSERT3P(dst_v, !=, NULL); 932*5cabbc6bSPrashanth Sreenivasa 933*5cabbc6bSPrashanth Sreenivasa if (dst_v->vdev_ops == &vdev_indirect_ops) { 934*5cabbc6bSPrashanth Sreenivasa list_insert_head(&stack, 935*5cabbc6bSPrashanth Sreenivasa rs_alloc(dst_v, dst_offset + inner_offset, 936*5cabbc6bSPrashanth Sreenivasa inner_size, rs->rs_split_offset)); 937*5cabbc6bSPrashanth Sreenivasa 938*5cabbc6bSPrashanth Sreenivasa } 939*5cabbc6bSPrashanth Sreenivasa 940*5cabbc6bSPrashanth Sreenivasa if ((zfs_flags & ZFS_DEBUG_INDIRECT_REMAP) && 941*5cabbc6bSPrashanth Sreenivasa IS_P2ALIGNED(inner_size, 2 * SPA_MINBLOCKSIZE)) { 942*5cabbc6bSPrashanth Sreenivasa /* 943*5cabbc6bSPrashanth Sreenivasa * Note: This clause exists only solely for 944*5cabbc6bSPrashanth Sreenivasa * testing purposes. We use it to ensure that 945*5cabbc6bSPrashanth Sreenivasa * split blocks work and that the callbacks 946*5cabbc6bSPrashanth Sreenivasa * using them yield the same result if issued 947*5cabbc6bSPrashanth Sreenivasa * in reverse order. 948*5cabbc6bSPrashanth Sreenivasa */ 949*5cabbc6bSPrashanth Sreenivasa uint64_t inner_half = inner_size / 2; 950*5cabbc6bSPrashanth Sreenivasa 951*5cabbc6bSPrashanth Sreenivasa func(rs->rs_split_offset + inner_half, dst_v, 952*5cabbc6bSPrashanth Sreenivasa dst_offset + inner_offset + inner_half, 953*5cabbc6bSPrashanth Sreenivasa inner_half, arg); 954*5cabbc6bSPrashanth Sreenivasa 955*5cabbc6bSPrashanth Sreenivasa func(rs->rs_split_offset, dst_v, 956*5cabbc6bSPrashanth Sreenivasa dst_offset + inner_offset, 957*5cabbc6bSPrashanth Sreenivasa inner_half, arg); 958*5cabbc6bSPrashanth Sreenivasa } else { 959*5cabbc6bSPrashanth Sreenivasa func(rs->rs_split_offset, dst_v, 960*5cabbc6bSPrashanth Sreenivasa dst_offset + inner_offset, 961*5cabbc6bSPrashanth Sreenivasa inner_size, arg); 962*5cabbc6bSPrashanth Sreenivasa } 963*5cabbc6bSPrashanth Sreenivasa 964*5cabbc6bSPrashanth Sreenivasa rs->rs_offset += inner_size; 965*5cabbc6bSPrashanth Sreenivasa rs->rs_asize -= inner_size; 966*5cabbc6bSPrashanth Sreenivasa rs->rs_split_offset += inner_size; 967*5cabbc6bSPrashanth Sreenivasa mapping++; 968*5cabbc6bSPrashanth Sreenivasa } 969*5cabbc6bSPrashanth Sreenivasa 970*5cabbc6bSPrashanth Sreenivasa rw_exit(&v->vdev_indirect_rwlock); 971*5cabbc6bSPrashanth Sreenivasa kmem_free(rs, sizeof (remap_segment_t)); 972*5cabbc6bSPrashanth Sreenivasa } 973*5cabbc6bSPrashanth Sreenivasa list_destroy(&stack); 974*5cabbc6bSPrashanth Sreenivasa } 975*5cabbc6bSPrashanth Sreenivasa 976*5cabbc6bSPrashanth Sreenivasa static void 977*5cabbc6bSPrashanth Sreenivasa vdev_indirect_child_io_done(zio_t *zio) 978*5cabbc6bSPrashanth Sreenivasa { 979*5cabbc6bSPrashanth Sreenivasa zio_t *pio = zio->io_private; 980*5cabbc6bSPrashanth Sreenivasa 981*5cabbc6bSPrashanth Sreenivasa mutex_enter(&pio->io_lock); 982*5cabbc6bSPrashanth Sreenivasa pio->io_error = zio_worst_error(pio->io_error, zio->io_error); 983*5cabbc6bSPrashanth Sreenivasa mutex_exit(&pio->io_lock); 984*5cabbc6bSPrashanth Sreenivasa 985*5cabbc6bSPrashanth Sreenivasa abd_put(zio->io_abd); 986*5cabbc6bSPrashanth Sreenivasa } 987*5cabbc6bSPrashanth Sreenivasa 988*5cabbc6bSPrashanth Sreenivasa static void 989*5cabbc6bSPrashanth Sreenivasa vdev_indirect_io_start_cb(uint64_t split_offset, vdev_t *vd, uint64_t offset, 990*5cabbc6bSPrashanth Sreenivasa uint64_t size, void *arg) 991*5cabbc6bSPrashanth Sreenivasa { 992*5cabbc6bSPrashanth Sreenivasa zio_t *zio = arg; 993*5cabbc6bSPrashanth Sreenivasa 994*5cabbc6bSPrashanth Sreenivasa ASSERT3P(vd, !=, NULL); 995*5cabbc6bSPrashanth Sreenivasa 996*5cabbc6bSPrashanth Sreenivasa if (vd->vdev_ops == &vdev_indirect_ops) 997*5cabbc6bSPrashanth Sreenivasa return; 998*5cabbc6bSPrashanth Sreenivasa 999*5cabbc6bSPrashanth Sreenivasa zio_nowait(zio_vdev_child_io(zio, NULL, vd, offset, 1000*5cabbc6bSPrashanth Sreenivasa abd_get_offset(zio->io_abd, split_offset), 1001*5cabbc6bSPrashanth Sreenivasa size, zio->io_type, zio->io_priority, 1002*5cabbc6bSPrashanth Sreenivasa 0, vdev_indirect_child_io_done, zio)); 1003*5cabbc6bSPrashanth Sreenivasa } 1004*5cabbc6bSPrashanth Sreenivasa 1005*5cabbc6bSPrashanth Sreenivasa static void 1006*5cabbc6bSPrashanth Sreenivasa vdev_indirect_io_start(zio_t *zio) 1007*5cabbc6bSPrashanth Sreenivasa { 1008*5cabbc6bSPrashanth Sreenivasa spa_t *spa = zio->io_spa; 1009*5cabbc6bSPrashanth Sreenivasa 1010*5cabbc6bSPrashanth Sreenivasa ASSERT(spa_config_held(spa, SCL_ALL, RW_READER) != 0); 1011*5cabbc6bSPrashanth Sreenivasa if (zio->io_type != ZIO_TYPE_READ) { 1012*5cabbc6bSPrashanth Sreenivasa ASSERT3U(zio->io_type, ==, ZIO_TYPE_WRITE); 1013*5cabbc6bSPrashanth Sreenivasa ASSERT((zio->io_flags & 1014*5cabbc6bSPrashanth Sreenivasa (ZIO_FLAG_SELF_HEAL | ZIO_FLAG_INDUCE_DAMAGE)) != 0); 1015*5cabbc6bSPrashanth Sreenivasa } 1016*5cabbc6bSPrashanth Sreenivasa 1017*5cabbc6bSPrashanth Sreenivasa vdev_indirect_remap(zio->io_vd, zio->io_offset, zio->io_size, 1018*5cabbc6bSPrashanth Sreenivasa vdev_indirect_io_start_cb, zio); 1019*5cabbc6bSPrashanth Sreenivasa 1020*5cabbc6bSPrashanth Sreenivasa zio_execute(zio); 1021*5cabbc6bSPrashanth Sreenivasa } 1022*5cabbc6bSPrashanth Sreenivasa 1023*5cabbc6bSPrashanth Sreenivasa vdev_ops_t vdev_indirect_ops = { 1024*5cabbc6bSPrashanth Sreenivasa vdev_indirect_open, 1025*5cabbc6bSPrashanth Sreenivasa vdev_indirect_close, 1026*5cabbc6bSPrashanth Sreenivasa vdev_default_asize, 1027*5cabbc6bSPrashanth Sreenivasa vdev_indirect_io_start, 1028*5cabbc6bSPrashanth Sreenivasa vdev_indirect_io_done, 1029*5cabbc6bSPrashanth Sreenivasa NULL, 1030*5cabbc6bSPrashanth Sreenivasa NULL, 1031*5cabbc6bSPrashanth Sreenivasa NULL, 1032*5cabbc6bSPrashanth Sreenivasa vdev_indirect_remap, 1033*5cabbc6bSPrashanth Sreenivasa VDEV_TYPE_INDIRECT, /* name of this vdev type */ 1034*5cabbc6bSPrashanth Sreenivasa B_FALSE /* leaf vdev */ 1035*5cabbc6bSPrashanth Sreenivasa }; 1036