17c478bdstevel@tonic-gate/*
27c478bdstevel@tonic-gate * CDDL HEADER START
37c478bdstevel@tonic-gate *
47c478bdstevel@tonic-gate * The contents of this file are subject to the terms of the
5ee88d2bkchow * Common Development and Distribution License (the "License").
6ee88d2bkchow * You may not use this file except in compliance with the License.
77c478bdstevel@tonic-gate *
87c478bdstevel@tonic-gate * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
97c478bdstevel@tonic-gate * or http://www.opensolaris.org/os/licensing.
107c478bdstevel@tonic-gate * See the License for the specific language governing permissions
117c478bdstevel@tonic-gate * and limitations under the License.
127c478bdstevel@tonic-gate *
137c478bdstevel@tonic-gate * When distributing Covered Code, include this CDDL HEADER in each
147c478bdstevel@tonic-gate * file and include the License file at usr/src/OPENSOLARIS.LICENSE.
157c478bdstevel@tonic-gate * If applicable, add the following below this CDDL HEADER, with the
167c478bdstevel@tonic-gate * fields enclosed by brackets "[]" replaced with your own identifying
177c478bdstevel@tonic-gate * information: Portions Copyright [yyyy] [name of copyright owner]
187c478bdstevel@tonic-gate *
197c478bdstevel@tonic-gate * CDDL HEADER END
207c478bdstevel@tonic-gate */
217c478bdstevel@tonic-gate/*
226e5580cFrank Van Der Linden * Copyright (c) 2004, 2010, Oracle and/or its affiliates. All rights reserved.
23c271038Dan Kimmel * Copyright (c) 2011, 2016 by Delphix. All rights reserved.
2479ec9daYuri Pankov * Copyright 2013 Nexenta Systems, Inc. All rights reserved.
256eedf6aJosef 'Jeff' Sipek * Copyright 2014 Josef "Jeff" Sipek <jeffpc@josefsipek.net>
267c478bdstevel@tonic-gate */
27cef70d2Bill Holler/*
2841afdfaKrishnendu Sadhukhan - Sun Microsystems * Copyright (c) 2010, Intel Corporation.
29cef70d2Bill Holler * All rights reserved.
30cef70d2Bill Holler */
318031591Srihari Venkatesan/*
328031591Srihari Venkatesan * Portions Copyright 2009 Advanced Micro Devices, Inc.
338031591Srihari Venkatesan */
34faa2016Bryan Cantrill/*
35a9cc46cRobert Mustacchi * Copyright 2019 Joyent, Inc.
36faa2016Bryan Cantrill */
370ce813fRobert Mustacchi
387c478bdstevel@tonic-gate/*
390ce813fRobert Mustacchi * CPU Identification logic
400ce813fRobert Mustacchi *
410ce813fRobert Mustacchi * The purpose of this file and its companion, cpuid_subr.c, is to help deal
420ce813fRobert Mustacchi * with the identification of CPUs, their features, and their topologies. More
430ce813fRobert Mustacchi * specifically, this file helps drive the following:
440ce813fRobert Mustacchi *
450ce813fRobert Mustacchi * 1. Enumeration of features of the processor which are used by the kernel to
460ce813fRobert Mustacchi *    determine what features to enable or disable. These may be instruction set
470ce813fRobert Mustacchi *    enhancements or features that we use.
480ce813fRobert Mustacchi *
490ce813fRobert Mustacchi * 2. Enumeration of instruction set architecture (ISA) additions that userland
500ce813fRobert Mustacchi *    will be told about through the auxiliary vector.
510ce813fRobert Mustacchi *
520ce813fRobert Mustacchi * 3. Understanding the physical topology of the CPU such as the number of
530ce813fRobert Mustacchi *    caches, how many cores it has, whether or not it supports symmetric
540ce813fRobert Mustacchi *    multi-processing (SMT), etc.
550ce813fRobert Mustacchi *
560ce813fRobert Mustacchi * ------------------------
570ce813fRobert Mustacchi * CPUID History and Basics
580ce813fRobert Mustacchi * ------------------------
590ce813fRobert Mustacchi *
600ce813fRobert Mustacchi * The cpuid instruction was added by Intel roughly around the time that the
610ce813fRobert Mustacchi * original Pentium was introduced. The purpose of cpuid was to tell in a
620ce813fRobert Mustacchi * programmatic fashion information about the CPU that previously was guessed
630ce813fRobert Mustacchi * at. For example, an important part of cpuid is that we can know what
640ce813fRobert Mustacchi * extensions to the ISA exist. If you use an invalid opcode you would get a
650ce813fRobert Mustacchi * #UD, so this method allows a program (whether a user program or the kernel)
660ce813fRobert Mustacchi * to determine what exists without crashing or getting a SIGILL. Of course,
670ce813fRobert Mustacchi * this was also during the era of the clones and the AMD Am5x86. The vendor
680ce813fRobert Mustacchi * name shows up first in cpuid for a reason.
690ce813fRobert Mustacchi *
700ce813fRobert Mustacchi * cpuid information is broken down into ranges called a 'leaf'. Each leaf puts
710ce813fRobert Mustacchi * unique values into the registers %eax, %ebx, %ecx, and %edx and each leaf has
720ce813fRobert Mustacchi * its own meaning. The different leaves are broken down into different regions:
730ce813fRobert Mustacchi *
740ce813fRobert Mustacchi *	[ 0, 7fffffff ]			This region is called the 'basic'
750ce813fRobert Mustacchi *					region. This region is generally defined
760ce813fRobert Mustacchi *					by Intel, though some of the original
770ce813fRobert Mustacchi *					portions have different meanings based
780ce813fRobert Mustacchi *					on the manufacturer. These days, Intel
790ce813fRobert Mustacchi *					adds most new features to this region.
800ce813fRobert Mustacchi *					AMD adds non-Intel compatible
810ce813fRobert Mustacchi *					information in the third, extended
820ce813fRobert Mustacchi *					region. Intel uses this for everything
830ce813fRobert Mustacchi *					including ISA extensions, CPU
840ce813fRobert Mustacchi *					features, cache information, topology,
850ce813fRobert Mustacchi *					and more.
860ce813fRobert Mustacchi *
870ce813fRobert Mustacchi *					There is a hole carved out of this
880ce813fRobert Mustacchi *					region which is reserved for
890ce813fRobert Mustacchi *					hypervisors.
900ce813fRobert Mustacchi *
910ce813fRobert Mustacchi *	[ 40000000, 4fffffff ]		This region, which is found in the
920ce813fRobert Mustacchi *					middle of the previous region, is
930ce813fRobert Mustacchi *					explicitly promised to never be used by
940ce813fRobert Mustacchi *					CPUs. Instead, it is used by hypervisors
950ce813fRobert Mustacchi *					to communicate information about
960ce813fRobert Mustacchi *					themselves to the operating system. The
970ce813fRobert Mustacchi *					values and details are unique for each
980ce813fRobert Mustacchi *					hypervisor.
990ce813fRobert Mustacchi *
1000ce813fRobert Mustacchi *	[ 80000000, ffffffff ]		This region is called the 'extended'
1010ce813fRobert Mustacchi *					region. Some of the low leaves mirror
1020ce813fRobert Mustacchi *					parts of the basic leaves. This region
1030ce813fRobert Mustacchi *					has generally been used by AMD for
1040ce813fRobert Mustacchi *					various extensions. For example, AMD-
1050ce813fRobert Mustacchi *					specific information about caches,
1060ce813fRobert Mustacchi *					features, and topology are found in this
1070ce813fRobert Mustacchi *					region.
1080ce813fRobert Mustacchi *
1090ce813fRobert Mustacchi * To specify a range, you place the desired leaf into %eax, zero %ebx, %ecx,
1100ce813fRobert Mustacchi * and %edx, and then issue the cpuid instruction. At the first leaf in each of
1110ce813fRobert Mustacchi * the ranges, one of the primary things returned is the maximum valid leaf in
1120ce813fRobert Mustacchi * that range. This allows for discovery of what range of CPUID is valid.
1130ce813fRobert Mustacchi *
1140ce813fRobert Mustacchi * The CPUs have potentially surprising behavior when using an invalid leaf or
1150ce813fRobert Mustacchi * unimplemented leaf. If the requested leaf is within the valid basic or
1160ce813fRobert Mustacchi * extended range, but is unimplemented, then %eax, %ebx, %ecx, and %edx will be
1170ce813fRobert Mustacchi * set to zero. However, if you specify a leaf that is outside of a valid range,
1180ce813fRobert Mustacchi * then instead it will be filled with the last valid _basic_ leaf. For example,
1190ce813fRobert Mustacchi * if the maximum basic value is on leaf 0x3, then issuing a cpuid for leaf 4 or
1200ce813fRobert Mustacchi * an invalid extended leaf will return the information for leaf 3.
1210ce813fRobert Mustacchi *
1220ce813fRobert Mustacchi * Some leaves are broken down into sub-leaves. This means that the value
1230ce813fRobert Mustacchi * depends on both the leaf asked for in %eax and a secondary register. For
1240ce813fRobert Mustacchi * example, Intel uses the value in %ecx on leaf 7 to indicate a sub-leaf to get
1250ce813fRobert Mustacchi * additional information. Or when getting topology information in leaf 0xb, the
1260ce813fRobert Mustacchi * initial value in %ecx changes which level of the topology that you are
1270ce813fRobert Mustacchi * getting information about.
1280ce813fRobert Mustacchi *
1290ce813fRobert Mustacchi * cpuid values are always kept to 32 bits regardless of whether or not the
1300ce813fRobert Mustacchi * program is in 64-bit mode. When executing in 64-bit mode, the upper
1310ce813fRobert Mustacchi * 32 bits of the register are always set to zero so that way the values are the
1320ce813fRobert Mustacchi * same regardless of execution mode.
1330ce813fRobert Mustacchi *
1340ce813fRobert Mustacchi * ----------------------
1350ce813fRobert Mustacchi * Identifying Processors
1360ce813fRobert Mustacchi * ----------------------
1370ce813fRobert Mustacchi *
1380ce813fRobert Mustacchi * We can identify a processor in two steps. The first step looks at cpuid leaf
1390ce813fRobert Mustacchi * 0. Leaf 0 contains the processor's vendor information. This is done by
1400ce813fRobert Mustacchi * putting a 12 character string in %ebx, %ecx, and %edx. On AMD, it is
1410ce813fRobert Mustacchi * 'AuthenticAMD' and on Intel it is 'GenuineIntel'.
1420ce813fRobert Mustacchi *
1430ce813fRobert Mustacchi * From there, a processor is identified by a combination of three different
1440ce813fRobert Mustacchi * values:
1450ce813fRobert Mustacchi *
1460ce813fRobert Mustacchi *  1. Family
1470ce813fRobert Mustacchi *  2. Model
1480ce813fRobert Mustacchi *  3. Stepping
1490ce813fRobert Mustacchi *
1500ce813fRobert Mustacchi * Each vendor uses the family and model to uniquely identify a processor. The
1510ce813fRobert Mustacchi * way that family and model are changed depends on the vendor. For example,
1520ce813fRobert Mustacchi * Intel has been using family 0x6 for almost all of their processor since the
1530ce813fRobert Mustacchi * Pentium Pro/Pentium II era, often called the P6. The model is used to
1540ce813fRobert Mustacchi * identify the exact processor. Different models are often used for the client
1550ce813fRobert Mustacchi * (consumer) and server parts. Even though each processor often has major
1560ce813fRobert Mustacchi * architectural differences, they still are considered the same family by
1570ce813fRobert Mustacchi * Intel.
1580ce813fRobert Mustacchi *
1590ce813fRobert Mustacchi * On the other hand, each major AMD architecture generally has its own family.
1600ce813fRobert Mustacchi * For example, the K8 is family 0x10, Bulldozer 0x15, and Zen 0x17. Within it
1610ce813fRobert Mustacchi * the model number is used to help identify specific processors.
1620ce813fRobert Mustacchi *
1630ce813fRobert Mustacchi * The stepping is used to refer to a revision of a specific microprocessor. The
1640ce813fRobert Mustacchi * term comes from equipment used to produce masks that are used to create
1650ce813fRobert Mustacchi * integrated circuits.
1660ce813fRobert Mustacchi *
1670ce813fRobert Mustacchi * The information is present in leaf 1, %eax. In technical documentation you
1680ce813fRobert Mustacchi * will see the terms extended model and extended family. The original family,
1690ce813fRobert Mustacchi * model, and stepping fields were each 4 bits wide. If the values in either
1700ce813fRobert Mustacchi * are 0xf, then one is to consult the extended model and extended family, which
1710ce813fRobert Mustacchi * take previously reserved bits and allow for a larger number of models and add
1720ce813fRobert Mustacchi * 0xf to them.
1730ce813fRobert Mustacchi *
1740ce813fRobert Mustacchi * When we process this information, we store the full family, model, and
1750ce813fRobert Mustacchi * stepping in the struct cpuid_info members cpi_family, cpi_model, and
1760ce813fRobert Mustacchi * cpi_step, respectively. Whenever you are performing comparisons with the
1770ce813fRobert Mustacchi * family, model, and stepping, you should use these members and not the raw
1780ce813fRobert Mustacchi * values from cpuid. If you must use the raw values from cpuid directly, you
1790ce813fRobert Mustacchi * must make sure that you add the extended model and family to the base model
1800ce813fRobert Mustacchi * and family.
1810ce813fRobert Mustacchi *
1820ce813fRobert Mustacchi * In general, we do not use information about the family, model, and stepping
1830ce813fRobert Mustacchi * to determine whether or not a feature is present; that is generally driven by
1840ce813fRobert Mustacchi * specific leaves. However, when something we care about on the processor is
1850ce813fRobert Mustacchi * not considered 'architectural' meaning that it is specific to a set of
1860ce813fRobert Mustacchi * processors and not promised in the architecture model to be consistent from
1870ce813fRobert Mustacchi * generation to generation, then we will fall back on this information. The
1880ce813fRobert Mustacchi * most common cases where this comes up is when we have to workaround errata in
1890ce813fRobert Mustacchi * the processor, are dealing with processor-specific features such as CPU
1900ce813fRobert Mustacchi * performance counters, or we want to provide additional information for things
1910ce813fRobert Mustacchi * such as fault management.
1920ce813fRobert Mustacchi *
1930ce813fRobert Mustacchi * While processors also do have a brand string, which is the name that people
1940ce813fRobert Mustacchi * are familiar with when buying the processor, they are not meant for
1950ce813fRobert Mustacchi * programmatic consumption. That is what the family, model, and stepping are
1960ce813fRobert Mustacchi * for.
1970ce813fRobert Mustacchi *
1980ce813fRobert Mustacchi * ------------
1990ce813fRobert Mustacchi * CPUID Passes
2000ce813fRobert Mustacchi * ------------
2010ce813fRobert Mustacchi *
2020ce813fRobert Mustacchi * As part of performing feature detection, we break this into several different
2030ce813fRobert Mustacchi * passes. The passes are as follows:
2040ce813fRobert Mustacchi *
2050ce813fRobert Mustacchi *	Pass 0		This is a primordial pass done in locore.s to deal with
2060ce813fRobert Mustacchi *			Cyrix CPUs that don't support cpuid. The reality is that
2070ce813fRobert Mustacchi *			we likely don't run on them any more, but there is still
2080ce813fRobert Mustacchi *			logic for handling them.
2090ce813fRobert Mustacchi *
2100ce813fRobert Mustacchi *	Pass 1		This is the primary pass and is responsible for doing a
2110ce813fRobert Mustacchi *			large number of different things:
2120ce813fRobert Mustacchi *
2130ce813fRobert Mustacchi *			1. Determine which vendor manufactured the CPU and
2140ce813fRobert Mustacchi *			determining the family, model, and stepping information.
2150ce813fRobert Mustacchi *
2160ce813fRobert Mustacchi *			2. Gathering a large number of feature flags to
2170ce813fRobert Mustacchi *			determine which features the CPU support and which
2180ce813fRobert Mustacchi *			indicate things that we need to do other work in the OS
2190ce813fRobert Mustacchi *			to enable. Features detected this way are added to the
2200ce813fRobert Mustacchi *			x86_featureset which can be queried to
2210ce813fRobert Mustacchi *			determine what we should do. This includes processing
2220ce813fRobert Mustacchi *			all of the basic and extended CPU features that we care
2230ce813fRobert Mustacchi *			about.
2240ce813fRobert Mustacchi *
2250ce813fRobert Mustacchi *			3. Determining the CPU's topology. This includes
2260ce813fRobert Mustacchi *			information about how many cores and threads are present
2270ce813fRobert Mustacchi *			in the package. It also is responsible for figuring out
2280ce813fRobert Mustacchi *			which logical CPUs are potentially part of the same core
2290ce813fRobert Mustacchi *			and what other resources they might share. For more
2300ce813fRobert Mustacchi *			information see the 'Topology' section.
2310ce813fRobert Mustacchi *
2320ce813fRobert Mustacchi *			4. Determining the set of CPU security-specific features
2330ce813fRobert Mustacchi *			that we need to worry about and determine the
2340ce813fRobert Mustacchi *			appropriate set of workarounds.
2350ce813fRobert Mustacchi *
2360ce813fRobert Mustacchi *			Pass 1 on the boot CPU occurs before KMDB is started.
2370ce813fRobert Mustacchi *
2380ce813fRobert Mustacchi *	Pass 2		The second pass is done after startup(). Here, we check
2390ce813fRobert Mustacchi *			other miscellaneous features. Most of this is gathering
2400ce813fRobert Mustacchi *			additional basic and extended features that we'll use in
2410ce813fRobert Mustacchi *			later passes or for debugging support.
2420ce813fRobert Mustacchi *
2430ce813fRobert Mustacchi *	Pass 3		The third pass occurs after the kernel memory allocator
2440ce813fRobert Mustacchi *			has been fully initialized. This gathers information
2450ce813fRobert Mustacchi *			where we might need dynamic memory available for our
2460ce813fRobert Mustacchi *			uses. This includes several varying width leaves that
2470ce813fRobert Mustacchi *			have cache information and the processor's brand string.
2480ce813fRobert Mustacchi *
2490ce813fRobert Mustacchi *	Pass 4		The fourth and final normal pass is performed after the
2500ce813fRobert Mustacchi *			kernel has brought most everything online. This is
2510ce813fRobert Mustacchi *			invoked from post_startup(). In this pass, we go through
2520ce813fRobert Mustacchi *			the set of features that we have enabled and turn that
2530ce813fRobert Mustacchi *			into the hardware auxiliary vector features that
2540ce813fRobert Mustacchi *			userland receives. This is used by userland, primarily
2550ce813fRobert Mustacchi *			by the run-time link-editor (RTLD), though userland
2560ce813fRobert Mustacchi *			software could also refer to it directly.
2570ce813fRobert Mustacchi *
2580ce813fRobert Mustacchi *	Microcode	After a microcode update, we do a selective rescan of
2590ce813fRobert Mustacchi *			the cpuid leaves to determine what features have
2600ce813fRobert Mustacchi *			changed. Microcode updates can provide more details
2610ce813fRobert Mustacchi *			about security related features to deal with issues like
2620ce813fRobert Mustacchi *			Spectre and L1TF. On occasion, vendors have violated
2630ce813fRobert Mustacchi *			their contract and removed bits. However, we don't try
2640ce813fRobert Mustacchi *			to detect that because that puts us in a situation that
2650ce813fRobert Mustacchi *			we really can't deal with. As such, the only thing we
2660ce813fRobert Mustacchi *			rescan are security related features today. See
2670ce813fRobert Mustacchi *			cpuid_pass_ucode().
2680ce813fRobert Mustacchi *
2690ce813fRobert Mustacchi * All of the passes (except pass 0) are run on all CPUs. However, for the most
2700ce813fRobert Mustacchi * part we only care about what the boot CPU says about this information and use
2710ce813fRobert Mustacchi * the other CPUs as a rough guide to sanity check that we have the same feature
2720ce813fRobert Mustacchi * set.
2730ce813fRobert Mustacchi *
2740ce813fRobert Mustacchi * We do not support running multiple logical CPUs with disjoint, let alone
2750ce813fRobert Mustacchi * different, feature sets.
2760ce813fRobert Mustacchi *
2770ce813fRobert Mustacchi * ------------------
2780ce813fRobert Mustacchi * Processor Topology
2790ce813fRobert Mustacchi * ------------------
2800ce813fRobert Mustacchi *
2810ce813fRobert Mustacchi * One of the important things that we need to do is to understand the topology
2820ce813fRobert Mustacchi * of the underlying processor. When we say topology in this case, we're trying
2830ce813fRobert Mustacchi * to understand the relationship between the logical CPUs that the operating
2840ce813fRobert Mustacchi * system sees and the underlying physical layout. Different logical CPUs may
2850ce813fRobert Mustacchi * share different resources which can have important consequences for the
2860ce813fRobert Mustacchi * performance of the system. For example, they may share caches, execution
2870ce813fRobert Mustacchi * units, and more.
2880ce813fRobert Mustacchi *
2890ce813fRobert Mustacchi * The topology of the processor changes from generation to generation and
2900ce813fRobert Mustacchi * vendor to vendor.  Along with that, different vendors use different
2910ce813fRobert Mustacchi * terminology, and the operating system itself uses occasionally overlapping
2920ce813fRobert Mustacchi * terminology. It's important to understand what this topology looks like so
2930ce813fRobert Mustacchi * one can understand the different things that we try to calculate and
2940ce813fRobert Mustacchi * determine.
2950ce813fRobert Mustacchi *
2960ce813fRobert Mustacchi * To get started, let's talk about a little bit of terminology that we've used
2970ce813fRobert Mustacchi * so far, is used throughout this file, and is fairly generic across multiple
2980ce813fRobert Mustacchi * vendors:
2990ce813fRobert Mustacchi *
3000ce813fRobert Mustacchi * CPU
3010ce813fRobert Mustacchi *	A central processing unit (CPU) refers to a logical and/or virtual
3020ce813fRobert Mustacchi *	entity that the operating system can execute instructions on. The
3030ce813fRobert Mustacchi *	underlying resources for this CPU may be shared between multiple
3040ce813fRobert Mustacchi *	entities; however, to the operating system it is a discrete unit.
3050ce813fRobert Mustacchi *
3060ce813fRobert Mustacchi * PROCESSOR and PACKAGE
3070ce813fRobert Mustacchi *
3080ce813fRobert Mustacchi *	Generally, when we use the term 'processor' on its own, we are referring
3090ce813fRobert Mustacchi *	to the physical entity that one buys and plugs into a board. However,
3100ce813fRobert Mustacchi *	because processor has been overloaded and one might see it used to mean
3110ce813fRobert Mustacchi *	multiple different levels, we will instead use the term 'package' for
3120ce813fRobert Mustacchi *	the rest of this file. The term package comes from the electrical
3130ce813fRobert Mustacchi *	engineering side and refers to the physical entity that encloses the
3140ce813fRobert Mustacchi *	electronics inside. Strictly speaking the package can contain more than
3150ce813fRobert Mustacchi *	just the CPU, for example, on many processors it may also have what's
3160ce813fRobert Mustacchi *	called an 'integrated graphical processing unit (GPU)'. Because the
3170ce813fRobert Mustacchi *	package can encapsulate multiple units, it is the largest physical unit
3180ce813fRobert Mustacchi *	that we refer to.
3190ce813fRobert Mustacchi *
3200ce813fRobert Mustacchi * SOCKET
3210ce813fRobert Mustacchi *
3220ce813fRobert Mustacchi *	A socket refers to unit on a system board (generally the motherboard)
3230ce813fRobert Mustacchi *	that can receive a package. A single package, or processor, is plugged
3240ce813fRobert Mustacchi *	into a single socket. A system may have multiple sockets. Often times,
3250ce813fRobert Mustacchi *	the term socket is used interchangeably with package and refers to the
3260ce813fRobert Mustacchi *	electrical component that has plugged in, and not the receptacle itself.
3270ce813fRobert Mustacchi *
3280ce813fRobert Mustacchi * CORE
3290ce813fRobert Mustacchi *
3300ce813fRobert Mustacchi *	A core refers to the physical instantiation of a CPU, generally, with a
3310ce813fRobert Mustacchi *	full set of hardware resources available to it. A package may contain
3320ce813fRobert Mustacchi *	multiple cores inside of it or it may just have a single one. A
3330ce813fRobert Mustacchi *	processor with more than one core is often referred to as 'multi-core'.
3340ce813fRobert Mustacchi *	In illumos, we will use the feature X86FSET_CMP to refer to a system
3350ce813fRobert Mustacchi *	that has 'multi-core' processors.
3360ce813fRobert Mustacchi *
3370ce813fRobert Mustacchi *	A core may expose a single logical CPU to the operating system, or it
3380ce813fRobert Mustacchi *	may expose multiple CPUs, which we call threads, defined below.
3390ce813fRobert Mustacchi *
3400ce813fRobert Mustacchi *	Some resources may still be shared by cores in the same package. For
3410ce813fRobert Mustacchi *	example, many processors will share the level 3 cache between cores.
3420ce813fRobert Mustacchi *	Some AMD generations share hardware resources between cores. For more
3430ce813fRobert Mustacchi *	information on that see the section 'AMD Topology'.
3440ce813fRobert Mustacchi *
3450ce813fRobert Mustacchi * THREAD and STRAND
3460ce813fRobert Mustacchi *
3470ce813fRobert Mustacchi *	In this file, generally a thread refers to a hardware resources and not
3480ce813fRobert Mustacchi *	the operating system's logical abstraction. A thread is always exposed
3490ce813fRobert Mustacchi *	as an independent logical CPU to the operating system. A thread belongs
3500ce813fRobert Mustacchi *	to a specific core. A core may have more than one thread. When that is
3510ce813fRobert Mustacchi *	the case, the threads that are part of the same core are often referred
3520ce813fRobert Mustacchi *	to as 'siblings'.
3530ce813fRobert Mustacchi *
3540ce813fRobert Mustacchi *	When multiple threads exist, this is generally referred to as
3550ce813fRobert Mustacchi *	simultaneous multi-threading (SMT). When Intel introduced this in their
3560ce813fRobert Mustacchi *	processors they called it hyper-threading (HT). When multiple threads
3570ce813fRobert Mustacchi *	are active in a core, they split the resources of the core. For example,
3580ce813fRobert Mustacchi *	two threads may share the same set of hardware execution units.
3590ce813fRobert Mustacchi *
3600ce813fRobert Mustacchi *	The operating system often uses the term 'strand' to refer to a thread.
3610ce813fRobert Mustacchi *	This helps disambiguate it from the software concept.
3620ce813fRobert Mustacchi *
3630ce813fRobert Mustacchi * CHIP
3640ce813fRobert Mustacchi *
3650ce813fRobert Mustacchi *	Unfortunately, the term 'chip' is dramatically overloaded. At its most
3660ce813fRobert Mustacchi *	base meaning, it is used to refer to a single integrated circuit, which
3670ce813fRobert Mustacchi *	may or may not be the only thing in the package. In illumos, when you
3680ce813fRobert Mustacchi *	see the term 'chip' it is almost always referring to the same thing as
3690ce813fRobert Mustacchi *	the 'package'. However, many vendors may use chip to refer to one of
3700ce813fRobert Mustacchi *	many integrated circuits that have been placed in the package. As an
3710ce813fRobert Mustacchi *	example, see the subsequent definition.
3720ce813fRobert Mustacchi *
3730ce813fRobert Mustacchi *	To try and keep things consistent, we will only use chip when referring
3740ce813fRobert Mustacchi *	to the entire integrated circuit package, with the exception of the
3750ce813fRobert Mustacchi *	definition of multi-chip module (because it is in the name) and use the
3760ce813fRobert Mustacchi *	term 'die' when we want the more general, potential sub-component
3770ce813fRobert Mustacchi *	definition.
3780ce813fRobert Mustacchi *
3790ce813fRobert Mustacchi * DIE
3800ce813fRobert Mustacchi *
3810ce813fRobert Mustacchi *	A die refers to an integrated circuit. Inside of the package there may
3820ce813fRobert Mustacchi *	be a single die or multiple dies. This is sometimes called a 'chip' in
3830ce813fRobert Mustacchi *	vendor's parlance, but in this file, we use the term die to refer to a
3840ce813fRobert Mustacchi *	subcomponent.
3850ce813fRobert Mustacchi *
3860ce813fRobert Mustacchi * MULTI-CHIP MODULE
3870ce813fRobert Mustacchi *
3880ce813fRobert Mustacchi *	A multi-chip module (MCM) refers to putting multiple distinct chips that
3890ce813fRobert Mustacchi *	are connected together in the same package. When a multi-chip design is
3900ce813fRobert Mustacchi *	used, generally each chip is manufactured independently and then joined
3910ce813fRobert Mustacchi *	together in the package. For example, on AMD's Zen microarchitecture
3920ce813fRobert Mustacchi *	(family 0x17), the package contains several dies (the second meaning of
3930ce813fRobert Mustacchi *	chip from above) that are connected together.
3940ce813fRobert Mustacchi *
3950ce813fRobert Mustacchi * CACHE
3960ce813fRobert Mustacchi *
3970ce813fRobert Mustacchi *	A cache is a part of the processor that maintains copies of recently
3980ce813fRobert Mustacchi *	accessed memory. Caches are split into levels and then into types.
3990ce813fRobert Mustacchi *	Commonly there are one to three levels, called level one, two, and
4000ce813fRobert Mustacchi *	three. The lower the level, the smaller it is, the closer it is to the
4010ce813fRobert Mustacchi *	execution units of the CPU, and the faster it is to access. The layout
4020ce813fRobert Mustacchi *	and design of the cache come in many different flavors, consult other
4030ce813fRobert Mustacchi *	resources for a discussion of those.
4040ce813fRobert Mustacchi *
4050ce813fRobert Mustacchi *	Caches are generally split into two types, the instruction and data
4060ce813fRobert Mustacchi *	cache. The caches contain what their names suggest, the instruction
4070ce813fRobert Mustacchi *	cache has executable program text, while the data cache has all other
4080ce813fRobert Mustacchi *	memory that the processor accesses. As of this writing, data is kept
4090ce813fRobert Mustacchi *	coherent between all of the caches on x86, so if one modifies program
4100ce813fRobert Mustacchi *	text before it is executed, that will be in the data cache, and the
4110ce813fRobert Mustacchi *	instruction cache will be synchronized with that change when the
4120ce813fRobert Mustacchi *	processor actually executes those instructions. This coherency also
4130ce813fRobert Mustacchi *	covers the fact that data could show up in multiple caches.
4140ce813fRobert Mustacchi *
4150ce813fRobert Mustacchi *	Generally, the lowest level caches are specific to a core. However, the
4160ce813fRobert Mustacchi *	last layer cache is shared between some number of cores. The number of
4170ce813fRobert Mustacchi *	CPUs sharing this last level cache is important. This has implications
4180ce813fRobert Mustacchi *	for the choices that the scheduler makes, as accessing memory that might
4190ce813fRobert Mustacchi *	be in a remote cache after thread migration can be quite expensive.
4200ce813fRobert Mustacchi *
4210ce813fRobert Mustacchi *	Sometimes, the word cache is abbreviated with a '$', because in US
4220ce813fRobert Mustacchi *	English the word cache is pronounced the same as cash. So L1D$ refers to
4230ce813fRobert Mustacchi *	the L1 data cache, and L2$ would be the L2 cache. This will not be used
4240ce813fRobert Mustacchi *	in the rest of this theory statement for clarity.
4250ce813fRobert Mustacchi *
4260ce813fRobert Mustacchi * MEMORY CONTROLLER
4270ce813fRobert Mustacchi *
4280ce813fRobert Mustacchi *	The memory controller is a component that provides access to DRAM. Each
4290ce813fRobert Mustacchi *	memory controller can access a set number of DRAM channels. Each channel
4300ce813fRobert Mustacchi *	can have a number of DIMMs (sticks of memory) associated with it. A
4310ce813fRobert Mustacchi *	given package may have more than one memory controller. The association
4320ce813fRobert Mustacchi *	of the memory controller to a group of cores is important as it is
4330ce813fRobert Mustacchi *	cheaper to access memory on the controller that you are associated with.
4340ce813fRobert Mustacchi *
4350ce813fRobert Mustacchi * NUMA
4360ce813fRobert Mustacchi *
4370ce813fRobert Mustacchi *	NUMA or non-uniform memory access, describes a way that systems are
4380ce813fRobert Mustacchi *	built. On x86, any processor core can address all of the memory in the
4390ce813fRobert Mustacchi *	system. However, When using multiple sockets or possibly within a
4400ce813fRobert Mustacchi *	multi-chip module, some of that memory is physically closer and some of
4410ce813fRobert Mustacchi *	it is further. Memory that is further away is more expensive to access.
4420ce813fRobert Mustacchi *	Consider the following image of multiple sockets with memory:
4430ce813fRobert Mustacchi *
4440ce813fRobert Mustacchi *	+--------+                                                +--------+
4450ce813fRobert Mustacchi *	| DIMM A |         +----------+      +----------+         | DIMM D |
4460ce813fRobert Mustacchi *	+--------+-+       |          |      |          |       +-+------+-+
4470ce813fRobert Mustacchi *	  | DIMM B |=======| Socket 0 |======| Socket 1 |=======| DIMM E |
4480ce813fRobert Mustacchi *	  +--------+-+     |          |      |          |     +-+------+-+
4490ce813fRobert Mustacchi *	    | DIMM C |     +----------+      +----------+     | DIMM F |
4500ce813fRobert Mustacchi *	    +--------+                                        +--------+
4510ce813fRobert Mustacchi *
4520ce813fRobert Mustacchi *	In this example, Socket 0 is closer to DIMMs A-C while Socket 1 is
4530ce813fRobert Mustacchi *	closer to DIMMs D-F. This means that it is cheaper for socket 0 to
4540ce813fRobert Mustacchi *	access DIMMs A-C and more expensive to access D-F as it has to go
4550ce813fRobert Mustacchi *	through Socket 1 to get there. The inverse is true for Socket 1. DIMMs
4560ce813fRobert Mustacchi *	D-F are cheaper than A-C. While the socket form is the most common, when
4570ce813fRobert Mustacchi *	using multi-chip modules, this can also sometimes occur. For another
4580ce813fRobert Mustacchi *	example of this that's more involved, see the AMD topology section.
4590ce813fRobert Mustacchi *
4600ce813fRobert Mustacchi *
4610ce813fRobert Mustacchi * Intel Topology
4620ce813fRobert Mustacchi * --------------
4630ce813fRobert Mustacchi *
4640ce813fRobert Mustacchi * Most Intel processors since Nehalem, (as of this writing the current gen
4650ce813fRobert Mustacchi * is Skylake / Cannon Lake) follow a fairly similar pattern. The CPU portion of
4660ce813fRobert Mustacchi * the package is a single monolithic die. MCMs currently aren't used. Most
4670ce813fRobert Mustacchi * parts have three levels of caches, with the L3 cache being shared between
4680ce813fRobert Mustacchi * all of the cores on the package. The L1/L2 cache is generally specific to
4690ce813fRobert Mustacchi * an individual core. The following image shows at a simplified level what
4700ce813fRobert Mustacchi * this looks like. The memory controller is commonly part of something called
4710ce813fRobert Mustacchi * the 'Uncore', that used to be separate physical chips that were not a part of
4720ce813fRobert Mustacchi * the package, but are now part of the same chip.
4730ce813fRobert Mustacchi *
4740ce813fRobert Mustacchi *  +-----------------------------------------------------------------------+
4750ce813fRobert Mustacchi *  | Package                                                               |
4760ce813fRobert Mustacchi *  |  +-------------------+  +-------------------+  +-------------------+  |
4770ce813fRobert Mustacchi *  |  | Core              |  | Core              |  | Core              |  |
4780ce813fRobert Mustacchi *  |  |  +--------+ +---+ |  |  +--------+ +---+ |  |  +--------+ +---+ |  |
4790ce813fRobert Mustacchi *  |  |  | Thread | | L | |  |  | Thread | | L | |  |  | Thread | | L | |  |
4800ce813fRobert Mustacchi *  |  |  +--------+ | 1 | |  |  +--------+ | 1 | |  |  +--------+ | 1 | |  |
4810ce813fRobert Mustacchi *  |  |  +--------+ |   | |  |  +--------+ |   | |  |  +--------+ |   | |  |
4820ce813fRobert Mustacchi *  |  |  | Thread | |   | |  |  | Thread | |   | |  |  | Thread | |   | |  |
4830ce813fRobert Mustacchi *  |  |  +--------+ +---+ |  |  +--------+ +---+ |  |  +--------+ +---+ |  |
4840ce813fRobert Mustacchi *  |  |  +--------------+ |  |  +--------------+ |  |  +--------------+ |  |
4850ce813fRobert Mustacchi *  |  |  | L2 Cache     | |  |  | L2 Cache     | |  |  | L2 Cache     | |  |
4860ce813fRobert Mustacchi *  |  |  +--------------+ |  |  +--------------+ |  |  +--------------+ |  |
4870ce813fRobert Mustacchi *  |  +-------------------+  +-------------------+  +-------------------+  |
4880ce813fRobert Mustacchi *  | +-------------------------------------------------------------------+ |
4890ce813fRobert Mustacchi *  | |                         Shared L3 Cache                           | |
4900ce813fRobert Mustacchi *  | +-------------------------------------------------------------------+ |
4910ce813fRobert Mustacchi *  | +-------------------------------------------------------------------+ |
4920ce813fRobert Mustacchi *  | |                        Memory Controller                          | |
4930ce813fRobert Mustacchi *  | +-------------------------------------------------------------------+ |
4940ce813fRobert Mustacchi *  +-----------------------------------------------------------------------+
4950ce813fRobert Mustacchi *
4960ce813fRobert Mustacchi * A side effect of this current architecture is that what we care about from a
4970ce813fRobert Mustacchi * scheduling and topology perspective, is simplified. In general we care about
4980ce813fRobert Mustacchi * understanding which logical CPUs are part of the same core and socket.
4990ce813fRobert Mustacchi *
5000ce813fRobert Mustacchi * To determine the relationship between threads and cores, Intel initially used
5010ce813fRobert Mustacchi * the identifier in the advanced programmable interrupt controller (APIC). They
5020ce813fRobert Mustacchi * also added cpuid leaf 4 to give additional information about the number of
5030ce813fRobert Mustacchi * threads and CPUs in the processor. With the addition of x2apic (which
5040ce813fRobert Mustacchi * increased the number of addressable logical CPUs from 8-bits to 32-bits), an
5050ce813fRobert Mustacchi * additional cpuid topology leaf 0xB was added.
5060ce813fRobert Mustacchi *
5070ce813fRobert Mustacchi * AMD Topology
5080ce813fRobert Mustacchi * ------------
5090ce813fRobert Mustacchi *
5100ce813fRobert Mustacchi * When discussing AMD topology, we want to break this into three distinct
5110ce813fRobert Mustacchi * generations of topology. There's the basic topology that has been used in
5120ce813fRobert Mustacchi * family 0xf+ (Opteron, Athlon64), there's the topology that was introduced
5130ce813fRobert Mustacchi * with family 0x15 (Bulldozer), and there's the topology that was introduced
5140ce813fRobert Mustacchi * with family 0x17 (Zen). AMD also has some additional terminology that's worth
5150ce813fRobert Mustacchi * talking about.
5160ce813fRobert Mustacchi *
5170ce813fRobert Mustacchi * Until the introduction of family 0x17 (Zen), AMD did not implement something
5180ce813fRobert Mustacchi * that they considered SMT. Whether or not the AMD processors have SMT
5190ce813fRobert Mustacchi * influences many things including scheduling and reliability, availability,
5200ce813fRobert Mustacchi * and serviceability (RAS) features.
5210ce813fRobert Mustacchi *
5220ce813fRobert Mustacchi * NODE
5230ce813fRobert Mustacchi *
5240ce813fRobert Mustacchi *	AMD uses the term node to refer to a die that contains a number of cores
5250ce813fRobert Mustacchi *	and I/O resources. Depending on the processor family and model, more
5260ce813fRobert Mustacchi *	than one node can be present in the package. When there is more than one
5270ce813fRobert Mustacchi *	node this indicates a multi-chip module. Usually each node has its own
5280ce813fRobert Mustacchi *	access to memory and I/O devices. This is important and generally
5290ce813fRobert Mustacchi *	different from the corresponding Intel Nehalem-Skylake+ processors. As a
5300ce813fRobert Mustacchi *	result, we track this relationship in the operating system.
5310ce813fRobert Mustacchi *
5320ce813fRobert Mustacchi *	In processors with an L3 cache, the L3 cache is generally shared across
5330ce813fRobert Mustacchi *	the entire node, though the way this is carved up varies from generation
5340ce813fRobert Mustacchi *	to generation.
5350ce813fRobert Mustacchi *
5360ce813fRobert Mustacchi * BULLDOZER
5370ce813fRobert Mustacchi *
5380ce813fRobert Mustacchi *	Starting with the Bulldozer family (0x15) and continuing until the
5390ce813fRobert Mustacchi *	introduction of the Zen microarchitecture, AMD introduced the idea of a
5400ce813fRobert Mustacchi *	compute unit. In a compute unit, two traditional cores share a number of
5410ce813fRobert Mustacchi *	hardware resources. Critically, they share the FPU, L1 instruction
5420ce813fRobert Mustacchi *	cache, and the L2 cache. Several compute units were then combined inside
5430ce813fRobert Mustacchi *	of a single node.  Because the integer execution units, L1 data cache,
5440ce813fRobert Mustacchi *	and some other resources were not shared between the cores, AMD never
5450ce813fRobert Mustacchi *	considered this to be SMT.
5460ce813fRobert Mustacchi *
5470ce813fRobert Mustacchi * ZEN
5480ce813fRobert Mustacchi *
5490ce813fRobert Mustacchi *	The Zen family (0x17) uses a multi-chip module (MCM) design, the module
5500ce813fRobert Mustacchi *	is called Zeppelin. These modules are similar to the idea of nodes used
5510ce813fRobert Mustacchi *	previously. Each of these nodes has two DRAM channels which all of the
5520ce813fRobert Mustacchi *	cores in the node can access uniformly. These nodes are linked together
5530ce813fRobert Mustacchi *	in the package, creating a NUMA environment.
5540ce813fRobert Mustacchi *
5550ce813fRobert Mustacchi *	The Zeppelin die itself contains two different 'core complexes'. Each
5560ce813fRobert Mustacchi *	core complex consists of four cores which each have two threads, for a
5570ce813fRobert Mustacchi *	total of 8 logical CPUs per complex. Unlike other generations,
5580ce813fRobert Mustacchi *	where all the logical CPUs in a given node share the L3 cache, here each
5590ce813fRobert Mustacchi *	core complex has its own shared L3 cache.
5600ce813fRobert Mustacchi *
5610ce813fRobert Mustacchi *	A further thing that we need to consider is that in some configurations,
5620ce813fRobert Mustacchi *	particularly with the Threadripper line of processors, not every die
5630ce813fRobert Mustacchi *	actually has its memory controllers wired up to actual memory channels.
5640ce813fRobert Mustacchi *	This means that some cores have memory attached to them and others
5650ce813fRobert Mustacchi *	don't.
5660ce813fRobert Mustacchi *
5670ce813fRobert Mustacchi *	To put Zen in perspective, consider the following images:
5680ce813fRobert Mustacchi *
5690ce813fRobert Mustacchi *      +--------------------------------------------------------+
5700ce813fRobert Mustacchi *      | Core Complex                                           |
5710ce813fRobert Mustacchi *      | +-------------------+    +-------------------+  +---+  |
5720ce813fRobert Mustacchi *      | | Core       +----+ |    | Core       +----+ |  |   |  |
5730ce813fRobert Mustacchi *      | | +--------+ | L2 | |    | +--------+ | L2 | |  |   |  |
5740ce813fRobert Mustacchi *      | | | Thread | +----+ |    | | Thread | +----+ |  |   |  |
5750ce813fRobert Mustacchi *      | | +--------+-+ +--+ |    | +--------+-+ +--+ |  | L |  |
5760ce813fRobert Mustacchi *      | |   | Thread | |L1| |    |   | Thread | |L1| |  | 3 |  |
5770ce813fRobert Mustacchi *      | |   +--------+ +--+ |    |   +--------+ +--+ |  |   |  |
5780ce813fRobert Mustacchi *      | +-------------------+    +-------------------+  | C |  |
5790ce813fRobert Mustacchi *      | +-------------------+    +-------------------+  | a |  |
5800ce813fRobert Mustacchi *      | | Core       +----+ |    | Core       +----+ |  | c |  |
5810ce813fRobert Mustacchi *      | | +--------+ | L2 | |    | +--------+ | L2 | |  | h |  |
5820ce813fRobert Mustacchi *      | | | Thread | +----+ |    | | Thread | +----+ |  | e |  |
5830ce813fRobert Mustacchi *      | | +--------+-+ +--+ |    | +--------+-+ +--+ |  |   |  |
5840ce813fRobert Mustacchi *      | |   | Thread | |L1| |    |   | Thread | |L1| |  |   |  |
5850ce813fRobert Mustacchi *      | |   +--------+ +--+ |    |   +--------+ +--+ |  |   |  |
5860ce813fRobert Mustacchi *      | +-------------------+    +-------------------+  +---+  |
5870ce813fRobert Mustacchi *      |                                                        |
5880ce813fRobert Mustacchi *	+--------------------------------------------------------+
5890ce813fRobert Mustacchi *
5900ce813fRobert Mustacchi *  This first image represents a single Zen core complex that consists of four
5910ce813fRobert Mustacchi *  cores.
5920ce813fRobert Mustacchi *
5930ce813fRobert Mustacchi *
5940ce813fRobert Mustacchi *	+--------------------------------------------------------+
5950ce813fRobert Mustacchi *	| Zeppelin Die                                           |
5960ce813fRobert Mustacchi *	|  +--------------------------------------------------+  |
5970ce813fRobert Mustacchi *	|  |         I/O Units (PCIe, SATA, USB, etc.)        |  |
5980ce813fRobert Mustacchi *	|  +--------------------------------------------------+  |
5990ce813fRobert Mustacchi *      |                           HH                           |
6000ce813fRobert Mustacchi *	|          +-----------+    HH    +-----------+          |
6010ce813fRobert Mustacchi *	|          |           |    HH    |           |          |
6020ce813fRobert Mustacchi *	|          |    Core   |==========|    Core   |          |
6030ce813fRobert Mustacchi *	|          |  Complex  |==========|  Complex  |          |
6040ce813fRobert Mustacchi *	|          |           |    HH    |           |          |
6050ce813fRobert Mustacchi *	|          +-----------+    HH    +-----------+          |
6060ce813fRobert Mustacchi *      |                           HH                           |
6070ce813fRobert Mustacchi *	|  +--------------------------------------------------+  |
6080ce813fRobert Mustacchi *	|  |                Memory Controller                 |  |
6090ce813fRobert Mustacchi *	|  +--------------------------------------------------+  |
6100ce813fRobert Mustacchi *      |                                                        |
6110ce813fRobert Mustacchi *	+--------------------------------------------------------+
6120ce813fRobert Mustacchi *
6130ce813fRobert Mustacchi *  This image represents a single Zeppelin Die. Note how both cores are
6140ce813fRobert Mustacchi *  connected to the same memory controller and I/O units. While each core
6150ce813fRobert Mustacchi *  complex has its own L3 cache as seen in the first image, they both have
6160ce813fRobert Mustacchi *  uniform access to memory.
6170ce813fRobert Mustacchi *
6180ce813fRobert Mustacchi *
6190ce813fRobert Mustacchi *                      PP                     PP
6200ce813fRobert Mustacchi *                      PP                     PP
6210ce813fRobert Mustacchi *           +----------PP---------------------PP---------+
6220ce813fRobert Mustacchi *           |          PP                     PP         |
6230ce813fRobert Mustacchi *           |    +-----------+          +-----------+    |
6240ce813fRobert Mustacchi *           |    |           |          |           |    |
6250ce813fRobert Mustacchi *       MMMMMMMMM|  Zeppelin |==========|  Zeppelin |MMMMMMMMM
6260ce813fRobert Mustacchi *       MMMMMMMMM|    Die    |==========|    Die    |MMMMMMMMM
6270ce813fRobert Mustacchi *           |    |           |          |           |    |
6280ce813fRobert Mustacchi *           |    +-----------+ooo    ...+-----------+    |
6290ce813fRobert Mustacchi *           |          HH      ooo  ...       HH         |
6300ce813fRobert Mustacchi *           |          HH        oo..         HH         |
6310ce813fRobert Mustacchi *           |          HH        ..oo         HH         |
6320ce813fRobert Mustacchi *           |          HH      ...  ooo       HH         |
6330ce813fRobert Mustacchi *           |    +-----------+...    ooo+-----------+    |
6340ce813fRobert Mustacchi *           |    |           |          |           |    |
6350ce813fRobert Mustacchi *       MMMMMMMMM|  Zeppelin |==========|  Zeppelin |MMMMMMMMM
6360ce813fRobert Mustacchi *       MMMMMMMMM|    Die    |==========|    Die    |MMMMMMMMM
6370ce813fRobert Mustacchi *           |    |           |          |           |    |
6380ce813fRobert Mustacchi *           |    +-----------+          +-----------+    |
6390ce813fRobert Mustacchi *           |          PP                     PP         |
6400ce813fRobert Mustacchi *           +----------PP---------------------PP---------+
6410ce813fRobert Mustacchi *                      PP                     PP
6420ce813fRobert Mustacchi *                      PP                     PP
6430ce813fRobert Mustacchi *
6440ce813fRobert Mustacchi *  This image represents a single Zen package. In this example, it has four
6450ce813fRobert Mustacchi *  Zeppelin dies, though some configurations only have a single one. In this
6460ce813fRobert Mustacchi *  example, each die is directly connected to the next. Also, each die is
6470ce813fRobert Mustacchi *  represented as being connected to memory by the 'M' character and connected
6480ce813fRobert Mustacchi *  to PCIe devices and other I/O, by the 'P' character. Because each Zeppelin
6490ce813fRobert Mustacchi *  die is made up of two core complexes, we have multiple different NUMA
6500ce813fRobert Mustacchi *  domains that we care about for these systems.
6510ce813fRobert Mustacchi *
6520ce813fRobert Mustacchi * CPUID LEAVES
6530ce813fRobert Mustacchi *
6540ce813fRobert Mustacchi * There are a few different CPUID leaves that we can use to try and understand
6550ce813fRobert Mustacchi * the actual state of the world. As part of the introduction of family 0xf, AMD
6560ce813fRobert Mustacchi * added CPUID leaf 0x80000008. This leaf tells us the number of logical
6570ce813fRobert Mustacchi * processors that are in the system. Because families before Zen didn't have
6580ce813fRobert Mustacchi * SMT, this was always the number of cores that were in the system. However, it
6590ce813fRobert Mustacchi * should always be thought of as the number of logical threads to be consistent
6600ce813fRobert Mustacchi * between generations. In addition we also get the size of the APIC ID that is
6610ce813fRobert Mustacchi * used to represent the number of logical processors. This is important for
6620ce813fRobert Mustacchi * deriving topology information.
6630ce813fRobert Mustacchi *
6640ce813fRobert Mustacchi * In the Bulldozer family, AMD added leaf 0x8000001E. The information varies a
6650ce813fRobert Mustacchi * bit between Bulldozer and later families, but it is quite useful in
6660ce813fRobert Mustacchi * determining the topology information. Because this information has changed
6670ce813fRobert Mustacchi * across family generations, it's worth calling out what these mean
6680ce813fRobert Mustacchi * explicitly. The registers have the following meanings:
6690ce813fRobert Mustacchi *
6700ce813fRobert Mustacchi *	%eax	The APIC ID. The entire register is defined to have a 32-bit
6710ce813fRobert Mustacchi *		APIC ID, even though on systems without x2apic support, it will
6720ce813fRobert Mustacchi *		be limited to 8 bits.
6730ce813fRobert Mustacchi *
6740ce813fRobert Mustacchi *	%ebx	On Bulldozer-era systems this contains information about the
6750ce813fRobert Mustacchi *		number of cores that are in a compute unit (cores that share
6760ce813fRobert Mustacchi *		resources). It also contains a per-package compute unit ID that
6770ce813fRobert Mustacchi *		identifies which compute unit the logical CPU is a part of.
6780ce813fRobert Mustacchi *
6790ce813fRobert Mustacchi *		On Zen-era systems this instead contains the number of threads
6800ce813fRobert Mustacchi *		per core and the ID of the core that the logical CPU is a part
6810ce813fRobert Mustacchi *		of. Note, this ID is unique only to the package, it is not
6820ce813fRobert Mustacchi *		globally unique across the entire system.
6830ce813fRobert Mustacchi *
6840ce813fRobert Mustacchi *	%ecx	This contains the number of nodes that exist in the package. It
6850ce813fRobert Mustacchi *		also contains an ID that identifies which node the logical CPU
6860ce813fRobert Mustacchi *		is a part of.
6870ce813fRobert Mustacchi *
6880ce813fRobert Mustacchi * Finally, we also use cpuid leaf 0x8000001D to determine information about the
6890ce813fRobert Mustacchi * cache layout to determine which logical CPUs are sharing which caches.
6900ce813fRobert Mustacchi *
6910ce813fRobert Mustacchi * illumos Topology
6920ce813fRobert Mustacchi * ----------------
6930ce813fRobert Mustacchi *
6940ce813fRobert Mustacchi * Based on the above we synthesize the information into several different
6950ce813fRobert Mustacchi * variables that we store in the 'struct cpuid_info'. We'll go into the details
6960ce813fRobert Mustacchi * of what each member is supposed to represent and their uniqueness. In
6970ce813fRobert Mustacchi * general, there are two levels of uniqueness that we care about. We care about
6980ce813fRobert Mustacchi * an ID that is globally unique. That means that it will be unique across all
6990ce813fRobert Mustacchi * entities in the system. For example, the default logical CPU ID is globally
7000ce813fRobert Mustacchi * unique. On the other hand, there is some information that we only care about
7010ce813fRobert Mustacchi * being unique within the context of a single package / socket. Here are the
7020ce813fRobert Mustacchi * variables that we keep track of and their meaning.
7030ce813fRobert Mustacchi *
7040ce813fRobert Mustacchi * Several of the values that are asking for an identifier, with the exception
7050ce813fRobert Mustacchi * of cpi_apicid, are allowed to be synthetic.
7060ce813fRobert Mustacchi *
7070ce813fRobert Mustacchi *
7080ce813fRobert Mustacchi * cpi_apicid
7090ce813fRobert Mustacchi *
7100ce813fRobert Mustacchi *	This is the value of the CPU's APIC id. This should be the full 32-bit
7110ce813fRobert Mustacchi *	ID if the CPU is using the x2apic. Otherwise, it should be the 8-bit
7120ce813fRobert Mustacchi *	APIC ID. This value is globally unique between all logical CPUs across
7130ce813fRobert Mustacchi *	all packages. This is usually required by the APIC.
7140ce813fRobert Mustacchi *
7150ce813fRobert Mustacchi * cpi_chipid
7160ce813fRobert Mustacchi *
7170ce813fRobert Mustacchi *	This value indicates the ID of the package that the logical CPU is a
7180ce813fRobert Mustacchi *	part of. This value is allowed to be synthetic. It is usually derived by
7190ce813fRobert Mustacchi *	taking the CPU's APIC ID and determining how many bits are used to
7200ce813fRobert Mustacchi *	represent CPU cores in the package. All logical CPUs that are part of
7210ce813fRobert Mustacchi *	the same package must have the same value.
7220ce813fRobert Mustacchi *
7230ce813fRobert Mustacchi * cpi_coreid
7240ce813fRobert Mustacchi *
7250ce813fRobert Mustacchi *	This represents the ID of a CPU core. Two logical CPUs should only have
7260ce813fRobert Mustacchi *	the same cpi_coreid value if they are part of the same core. These
7270ce813fRobert Mustacchi *	values may be synthetic. On systems that support SMT, this value is
7280ce813fRobert Mustacchi *	usually derived from the APIC ID, otherwise it is often synthetic and
7290ce813fRobert Mustacchi *	just set to the value of the cpu_id in the cpu_t.
7300ce813fRobert Mustacchi *
7310ce813fRobert Mustacchi * cpi_pkgcoreid
7320ce813fRobert Mustacchi *
7330ce813fRobert Mustacchi *	This is similar to the cpi_coreid in that logical CPUs that are part of
7340ce813fRobert Mustacchi *	the same core should have the same ID. The main difference is that these
7350ce813fRobert Mustacchi *	values are only required to be unique to a given socket.
7360ce813fRobert Mustacchi *
7370ce813fRobert Mustacchi * cpi_clogid
7380ce813fRobert Mustacchi *
7390ce813fRobert Mustacchi *	This represents the logical ID of a logical CPU. This value should be
7400ce813fRobert Mustacchi *	unique within a given socket for each logical CPU. This is allowed to be
7410ce813fRobert Mustacchi *	synthetic, though it is usually based off of the CPU's apic ID. The
7420ce813fRobert Mustacchi *	broader system expects that logical CPUs that have are part of the same
7430ce813fRobert Mustacchi *	core have contiguous numbers. For example, if there were two threads per
7440ce813fRobert Mustacchi *	core, then the core IDs divided by two should be the same and the first
7450ce813fRobert Mustacchi *	modulus two should be zero and the second one. For example, IDs 4 and 5
7460ce813fRobert Mustacchi *	indicate two logical CPUs that are part of the same core. But IDs 5 and
7470ce813fRobert Mustacchi *	6 represent two logical CPUs that are part of different cores.
7480ce813fRobert Mustacchi *
7490ce813fRobert Mustacchi *	While it is common for the cpi_coreid and the cpi_clogid to be derived
7500ce813fRobert Mustacchi *	from the same source, strictly speaking, they don't have to be and the
7510ce813fRobert Mustacchi *	two values should be considered logically independent. One should not
7520ce813fRobert Mustacchi *	try to compare a logical CPU's cpi_coreid and cpi_clogid to determine
7530ce813fRobert Mustacchi *	some kind of relationship. While this is tempting, we've seen cases on
7540ce813fRobert Mustacchi *	AMD family 0xf where the system's cpu id is not related to its APIC ID.
7550ce813fRobert Mustacchi *
7560ce813fRobert Mustacchi * cpi_ncpu_per_chip
7570ce813fRobert Mustacchi *
7580ce813fRobert Mustacchi *	This value indicates the total number of logical CPUs that exist in the
7590ce813fRobert Mustacchi *	physical package. Critically, this is not the number of logical CPUs
7600ce813fRobert Mustacchi *	that exist for just the single core.
7610ce813fRobert Mustacchi *
7620ce813fRobert Mustacchi *	This value should be the same for all logical CPUs in the same package.
7630ce813fRobert Mustacchi *
7640ce813fRobert Mustacchi * cpi_ncore_per_chip
7650ce813fRobert Mustacchi *
7660ce813fRobert Mustacchi *	This value indicates the total number of physical CPU cores that exist
7670ce813fRobert Mustacchi *	in the package. The system compares this value with cpi_ncpu_per_chip to
7680ce813fRobert Mustacchi *	determine if simultaneous multi-threading (SMT) is enabled. When
7690ce813fRobert Mustacchi *	cpi_ncpu_per_chip equals cpi_ncore_per_chip, then there is no SMT and
7700ce813fRobert Mustacchi *	the X86FSET_HTT feature is not set. If this value is greater than one,
7710ce813fRobert Mustacchi *	than we consider the processor to have the feature X86FSET_CMP, to
7720ce813fRobert Mustacchi *	indicate that there is support for more than one core.
7730ce813fRobert Mustacchi *
7740ce813fRobert Mustacchi *	This value should be the same for all logical CPUs in the same package.
7750ce813fRobert Mustacchi *
7760ce813fRobert Mustacchi * cpi_procnodes_per_pkg
7770ce813fRobert Mustacchi *
7780ce813fRobert Mustacchi *	This value indicates the number of 'nodes' that exist in the package.
7790ce813fRobert Mustacchi *	When processors are actually a multi-chip module, this represents the
7800ce813fRobert Mustacchi *	number of such modules that exist in the package. Currently, on Intel
7810ce813fRobert Mustacchi *	based systems this member is always set to 1.
7820ce813fRobert Mustacchi *
7830ce813fRobert Mustacchi *	This value should be the same for all logical CPUs in the same package.
7840ce813fRobert Mustacchi *
7850ce813fRobert Mustacchi * cpi_procnodeid
7860ce813fRobert Mustacchi *
7870ce813fRobert Mustacchi *	This value indicates the ID of the node that the logical CPU is a part
7880ce813fRobert Mustacchi *	of. All logical CPUs that are in the same node must have the same value
7890ce813fRobert Mustacchi *	here. This value must be unique across all of the packages in the
7900ce813fRobert Mustacchi *	system.  On Intel based systems, this is currently set to the value in
7910ce813fRobert Mustacchi *	cpi_chipid because there is only one node.
7920ce813fRobert Mustacchi *
7930ce813fRobert Mustacchi * cpi_cores_per_compunit
7940ce813fRobert Mustacchi *
7950ce813fRobert Mustacchi *	This value indicates the number of cores that are part of a compute
7960ce813fRobert Mustacchi *	unit. See the AMD topology section for this. This member only has real
7970ce813fRobert Mustacchi *	meaning currently for AMD Bulldozer family processors. For all other
7980ce813fRobert Mustacchi *	processors, this should currently be set to 1.
7990ce813fRobert Mustacchi *
8000ce813fRobert Mustacchi * cpi_compunitid
8010ce813fRobert Mustacchi *
8020ce813fRobert Mustacchi *	This indicates the compute unit that the logical CPU belongs to. For
8030ce813fRobert Mustacchi *	processors without AMD Bulldozer-style compute units this should be set
8040ce813fRobert Mustacchi *	to the value of cpi_coreid.
8050ce813fRobert Mustacchi *
8060ce813fRobert Mustacchi * cpi_ncpu_shr_last_cache
8070ce813fRobert Mustacchi *
8080ce813fRobert Mustacchi *	This indicates the number of logical CPUs that are sharing the same last
8090ce813fRobert Mustacchi *	level cache. This value should be the same for all CPUs that are sharing
8100ce813fRobert Mustacchi *	that cache. The last cache refers to the cache that is closest to memory
8110ce813fRobert Mustacchi *	and furthest away from the CPU.
8120ce813fRobert Mustacchi *
8130ce813fRobert Mustacchi * cpi_last_lvl_cacheid
8140ce813fRobert Mustacchi *
8150ce813fRobert Mustacchi *	This indicates the ID of the last cache that the logical CPU uses. This
8160ce813fRobert Mustacchi *	cache is often shared between multiple logical CPUs and is the cache
8170ce813fRobert Mustacchi *	that is closest to memory and furthest away from the CPU. This value
8180ce813fRobert Mustacchi *	should be the same for a group of logical CPUs only if they actually
8190ce813fRobert Mustacchi *	share the same last level cache. IDs should not overlap between
8200ce813fRobert Mustacchi *	packages.
8210ce813fRobert Mustacchi *
822d6517bbRobert Mustacchi * cpi_ncore_bits
823d6517bbRobert Mustacchi *
824d6517bbRobert Mustacchi *	This indicates the number of bits that are required to represent all of
825d6517bbRobert Mustacchi *	the cores in the system. As cores are derived based on their APIC IDs,
826d6517bbRobert Mustacchi *	we aren't guaranteed a run of APIC IDs starting from zero. It's OK for
827d6517bbRobert Mustacchi *	this value to be larger than the actual number of IDs that are present
828d6517bbRobert Mustacchi *	in the system. This is used to size tables by the CMI framework. It is
829d6517bbRobert Mustacchi *	only filled in for Intel and AMD CPUs.
830d6517bbRobert Mustacchi *
831d6517bbRobert Mustacchi * cpi_nthread_bits
832d6517bbRobert Mustacchi *
833d6517bbRobert Mustacchi *	This indicates the number of bits required to represent all of the IDs
834d6517bbRobert Mustacchi *	that cover the logical CPUs that exist on a given core. It's OK for this
835d6517bbRobert Mustacchi *	value to be larger than the actual number of IDs that are present in the
836d6517bbRobert Mustacchi *	system.  This is used to size tables by the CMI framework. It is
837d6517bbRobert Mustacchi *	only filled in for Intel and AMD CPUs.
838d6517bbRobert Mustacchi *
8390ce813fRobert Mustacchi * -----------
8400ce813fRobert Mustacchi * Hypervisors
8410ce813fRobert Mustacchi * -----------
8420ce813fRobert Mustacchi *
8430ce813fRobert Mustacchi * If trying to manage the differences between vendors wasn't bad enough, it can
8440ce813fRobert Mustacchi * get worse thanks to our friend hardware virtualization. Hypervisors are given
8450ce813fRobert Mustacchi * the ability to interpose on all cpuid instructions and change them to suit
8460ce813fRobert Mustacchi * their purposes. In general, this is necessary as the hypervisor wants to be
8470ce813fRobert Mustacchi * able to present a more uniform set of features or not necessarily give the
8480ce813fRobert Mustacchi * guest operating system kernel knowledge of all features so it can be
8490ce813fRobert Mustacchi * more easily migrated between systems.
8500ce813fRobert Mustacchi *
8510ce813fRobert Mustacchi * When it comes to trying to determine topology information, this can be a
8520ce813fRobert Mustacchi * double edged sword. When a hypervisor doesn't actually implement a cpuid
8530ce813fRobert Mustacchi * leaf, it'll often return all zeros. Because of that, you'll often see various
8540ce813fRobert Mustacchi * checks scattered about fields being non-zero before we assume we can use
8550ce813fRobert Mustacchi * them.
8560ce813fRobert Mustacchi *
8570ce813fRobert Mustacchi * When it comes to topology information, the hypervisor is often incentivized
8580ce813fRobert Mustacchi * to lie to you about topology. This is because it doesn't always actually
8590ce813fRobert Mustacchi * guarantee that topology at all. The topology path we take in the system
8600ce813fRobert Mustacchi * depends on how the CPU advertises itself. If it advertises itself as an Intel
8610ce813fRobert Mustacchi * or AMD CPU, then we basically do our normal path. However, when they don't
8620ce813fRobert Mustacchi * use an actual vendor, then that usually turns into multiple one-core CPUs
8630ce813fRobert Mustacchi * that we enumerate that are often on different sockets. The actual behavior
8640ce813fRobert Mustacchi * depends greatly on what the hypervisor actually exposes to us.
8650ce813fRobert Mustacchi *
8660ce813fRobert Mustacchi * --------------------
8670ce813fRobert Mustacchi * Exposing Information
8680ce813fRobert Mustacchi * --------------------
8690ce813fRobert Mustacchi *
8700ce813fRobert Mustacchi * We expose CPUID information in three different forms in the system.
8710ce813fRobert Mustacchi *
8720ce813fRobert Mustacchi * The first is through the x86_featureset variable. This is used in conjunction
8730ce813fRobert Mustacchi * with the is_x86_feature() function. This is queried by x86-specific functions
8740ce813fRobert Mustacchi * to determine which features are or aren't present in the system and to make
8750ce813fRobert Mustacchi * decisions based upon them. For example, users of this include everything from
8760ce813fRobert Mustacchi * parts of the system dedicated to reliability, availability, and
8770ce813fRobert Mustacchi * serviceability (RAS), to making decisions about how to handle security
8780ce813fRobert Mustacchi * mitigations, to various x86-specific drivers. General purpose or
8790ce813fRobert Mustacchi * architecture independent drivers should never be calling this function.
8800ce813fRobert Mustacchi *
8810ce813fRobert Mustacchi * The second means is through the auxiliary vector. The auxiliary vector is a
8820ce813fRobert Mustacchi * series of tagged data that the kernel passes down to a user program when it
8830ce813fRobert Mustacchi * begins executing. This information is used to indicate to programs what
8840ce813fRobert Mustacchi * instruction set extensions are present. For example, information about the
8850ce813fRobert Mustacchi * CPU supporting the machine check architecture (MCA) wouldn't be passed down
8860ce813fRobert Mustacchi * since user programs cannot make use of it. However, things like the AVX
8870ce813fRobert Mustacchi * instruction sets are. Programs use this information to make run-time
8880ce813fRobert Mustacchi * decisions about what features they should use. As an example, the run-time
8890ce813fRobert Mustacchi * link-editor (rtld) can relocate different functions depending on the hardware
8900ce813fRobert Mustacchi * support available.
8910ce813fRobert Mustacchi *
8920ce813fRobert Mustacchi * The final form is through a series of accessor functions that all have the
8930ce813fRobert Mustacchi * form cpuid_get*. This is used by a number of different subsystems in the
8940ce813fRobert Mustacchi * kernel to determine more detailed information about what we're running on,
8950ce813fRobert Mustacchi * topology information, etc. Some of these subsystems include processor groups
8960ce813fRobert Mustacchi * (uts/common/os/pg.c.), CPU Module Interface (uts/i86pc/os/cmi.c), ACPI,
8970ce813fRobert Mustacchi * microcode, and performance monitoring. These functions all ASSERT that the
8980ce813fRobert Mustacchi * CPU they're being called on has reached a certain cpuid pass. If the passes
8990ce813fRobert Mustacchi * are rearranged, then this needs to be adjusted.
90065f2042Robert Mustacchi *
90165f2042Robert Mustacchi * -----------------------------------------------
90265f2042Robert Mustacchi * Speculative Execution CPU Side Channel Security
90365f2042Robert Mustacchi * -----------------------------------------------
90465f2042Robert Mustacchi *
90565f2042Robert Mustacchi * With the advent of the Spectre and Meltdown attacks which exploit speculative
90665f2042Robert Mustacchi * execution in the CPU to create side channels there have been a number of
90765f2042Robert Mustacchi * different attacks and corresponding issues that the operating system needs to
90865f2042Robert Mustacchi * mitigate against. The following list is some of the common, but not
90965f2042Robert Mustacchi * exhaustive, set of issues that we know about and have done some or need to do
91065f2042Robert Mustacchi * more work in the system to mitigate against:
91165f2042Robert Mustacchi *
91265f2042Robert Mustacchi *   - Spectre v1
91342cd193John Levon *   - swapgs (Spectre v1 variant)
91465f2042Robert Mustacchi *   - Spectre v2
91565f2042Robert Mustacchi *   - Meltdown (Spectre v3)
91665f2042Robert Mustacchi *   - Rogue Register Read (Spectre v3a)
91765f2042Robert Mustacchi *   - Speculative Store Bypass (Spectre v4)
91865f2042Robert Mustacchi *   - ret2spec, SpectreRSB
91965f2042Robert Mustacchi *   - L1 Terminal Fault (L1TF)
92065f2042Robert Mustacchi *   - Microarchitectural Data Sampling (MDS)
92165f2042Robert Mustacchi *
92265f2042Robert Mustacchi * Each of these requires different sets of mitigations and has different attack
92365f2042Robert Mustacchi * surfaces. For the most part, this discussion is about protecting the kernel
92465f2042Robert Mustacchi * from non-kernel executing environments such as user processes and hardware
92565f2042Robert Mustacchi * virtual machines. Unfortunately, there are a number of user vs. user
92665f2042Robert Mustacchi * scenarios that exist with these. The rest of this section will describe the
92765f2042Robert Mustacchi * overall approach that the system has taken to address these as well as their
92865f2042Robert Mustacchi * shortcomings. Unfortunately, not all of the above have been handled today.
92965f2042Robert Mustacchi *
93042cd193John Levon * SPECTRE v2, ret2spec, SpectreRSB
93165f2042Robert Mustacchi *
93265f2042Robert Mustacchi * The second variant of the spectre attack focuses on performing branch target
93365f2042Robert Mustacchi * injection. This generally impacts indirect call instructions in the system.
93465f2042Robert Mustacchi * There are three different ways to mitigate this issue that are commonly
93565f2042Robert Mustacchi * described today:
93665f2042Robert Mustacchi *
93765f2042Robert Mustacchi *  1. Using Indirect Branch Restricted Speculation (IBRS).
93865f2042Robert Mustacchi *  2. Using Retpolines and RSB Stuffing
93965f2042Robert Mustacchi *  3. Using Enhanced Indirect Branch Restricted Speculation (EIBRS)
94065f2042Robert Mustacchi *
94165f2042Robert Mustacchi * IBRS uses a feature added to microcode to restrict speculation, among other
94265f2042Robert Mustacchi * things. This form of mitigation has not been used as it has been generally
94365f2042Robert Mustacchi * seen as too expensive and requires reactivation upon various transitions in
94465f2042Robert Mustacchi * the system.
94565f2042Robert Mustacchi *
94665f2042Robert Mustacchi * As a less impactful alternative to IBRS, retpolines were developed by
94765f2042Robert Mustacchi * Google. These basically require one to replace indirect calls with a specific
94865f2042Robert Mustacchi * trampoline that will cause speculation to fail and break the attack.
94965f2042Robert Mustacchi * Retpolines require compiler support. We always build with retpolines in the
95065f2042Robert Mustacchi * external thunk mode. This means that a traditional indirect call is replaced
95165f2042Robert Mustacchi * with a call to one of the __x86_indirect_thunk_<reg> functions. A side effect
95265f2042Robert Mustacchi * of this is that all indirect function calls are performed through a register.
95365f2042Robert Mustacchi *
95465f2042Robert Mustacchi * We have to use a common external location of the thunk and not inline it into
95565f2042Robert Mustacchi * the callsite so that way we can have a single place to patch these functions.
95665f2042Robert Mustacchi * As it turns out, we actually have three different forms of retpolines that
95765f2042Robert Mustacchi * exist in the system:
95865f2042Robert Mustacchi *
95965f2042Robert Mustacchi *  1. A full retpoline
96065f2042Robert Mustacchi *  2. An AMD-specific optimized retpoline
96165f2042Robert Mustacchi *  3. A no-op version
96265f2042Robert Mustacchi *
96365f2042Robert Mustacchi * The first one is used in the general case. The second one is used if we can
96465f2042Robert Mustacchi * determine that we're on an AMD system and we can successfully toggle the
96565f2042Robert Mustacchi * lfence serializing MSR that exists on the platform. Basically with this
96665f2042Robert Mustacchi * present, an lfence is sufficient and we don't need to do anywhere near as
96765f2042Robert Mustacchi * complicated a dance to successfully use retpolines.
96865f2042Robert Mustacchi *
96965f2042Robert Mustacchi * The third form described above is the most curious. It turns out that the way
97065f2042Robert Mustacchi * that retpolines are implemented is that they rely on how speculation is
97165f2042Robert Mustacchi * performed on a 'ret' instruction. Intel has continued to optimize this
97265f2042Robert Mustacchi * process (which is partly why we need to have return stack buffer stuffing,
97365f2042Robert Mustacchi * but more on that in a bit) and in processors starting with Cascade Lake
97465f2042Robert Mustacchi * on the server side, it's dangerous to rely on retpolines. Instead, a new
97565f2042Robert Mustacchi * mechanism has been introduced called Enhanced IBRS (EIBRS).
97665f2042Robert Mustacchi *
97765f2042Robert Mustacchi * Unlike IBRS, EIBRS is designed to be enabled once at boot and left on each
97865f2042Robert Mustacchi * physical core. However, if this is the case, we don't want to use retpolines
97965f2042Robert Mustacchi * any more. Therefore if EIBRS is present, we end up turning each retpoline
98065f2042Robert Mustacchi * function (called a thunk) into a jmp instruction. This means that we're still
98165f2042Robert Mustacchi * paying the cost of an extra jump to the external thunk, but it gives us
98265f2042Robert Mustacchi * flexibility and the ability to have a single kernel image that works across a
98365f2042Robert Mustacchi * wide variety of systems and hardware features.
98465f2042Robert Mustacchi *
98565f2042Robert Mustacchi * Unfortunately, this alone is insufficient. First, Skylake systems have
98665f2042Robert Mustacchi * additional speculation for the Return Stack Buffer (RSB) which is used to
98765f2042Robert Mustacchi * return from call instructions which retpolines take advantage of. However,
98865f2042Robert Mustacchi * this problem is not just limited to Skylake and is actually more pernicious.
98965f2042Robert Mustacchi * The SpectreRSB paper introduces several more problems that can arise with
99065f2042Robert Mustacchi * dealing with this. The RSB can be poisoned just like the indirect branch
99165f2042Robert Mustacchi * predictor. This means that one needs to clear the RSB when transitioning
99265f2042Robert Mustacchi * between two different privilege domains. Some examples include:
99365f2042Robert Mustacchi *
99465f2042Robert Mustacchi *  - Switching between two different user processes
99565f2042Robert Mustacchi *  - Going between user land and the kernel
99665f2042Robert Mustacchi *  - Returning to the kernel from a hardware virtual machine
99765f2042Robert Mustacchi *
99865f2042Robert Mustacchi * Mitigating this involves combining a couple of different things. The first is
99965f2042Robert Mustacchi * SMEP (supervisor mode execution protection) which was introduced in Ivy
100065f2042Robert Mustacchi * Bridge. When an RSB entry refers to a user address and we're executing in the
100165f2042Robert Mustacchi * kernel, speculation through it will be stopped when SMEP is enabled. This
100265f2042Robert Mustacchi * protects against a number of the different cases that we would normally be
100365f2042Robert Mustacchi * worried about such as when we enter the kernel from user land.
100465f2042Robert Mustacchi *
100565f2042Robert Mustacchi * To prevent against additional manipulation of the RSB from other contexts
100665f2042Robert Mustacchi * such as a non-root VMX context attacking the kernel we first look to enhanced
100765f2042Robert Mustacchi * IBRS. When EIBRS is present and enabled, then there is nothing else that we
100865f2042Robert Mustacchi * need to do to protect the kernel at this time.
100965f2042Robert Mustacchi *
101065f2042Robert Mustacchi * On CPUs without EIBRS we need to manually overwrite the contents of the
101165f2042Robert Mustacchi * return stack buffer. We do this through the x86_rsb_stuff() function.
101265f2042Robert Mustacchi * Currently this is employed on context switch. The x86_rsb_stuff() function is
101365f2042Robert Mustacchi * disabled when enhanced IBRS is present because Intel claims on such systems
101465f2042Robert Mustacchi * it will be ineffective. Stuffing the RSB in context switch helps prevent user
101565f2042Robert Mustacchi * to user attacks via the RSB.
101665f2042Robert Mustacchi *
101765f2042Robert Mustacchi * If SMEP is not present, then we would have to stuff the RSB every time we
101865f2042Robert Mustacchi * transitioned from user mode to the kernel, which isn't very practical right
101965f2042Robert Mustacchi * now.
102065f2042Robert Mustacchi *
102165f2042Robert Mustacchi * To fully protect user to user and vmx to vmx attacks from these classes of
102265f2042Robert Mustacchi * issues, we would also need to allow them to opt into performing an Indirect
102365f2042Robert Mustacchi * Branch Prediction Barrier (IBPB) on switch. This is not currently wired up.
102465f2042Robert Mustacchi *
102565f2042Robert Mustacchi * By default, the system will enable RSB stuffing and the required variant of
102665f2042Robert Mustacchi * retpolines and store that information in the x86_spectrev2_mitigation value.
102765f2042Robert Mustacchi * This will be evaluated after a microcode update as well, though it is
102865f2042Robert Mustacchi * expected that microcode updates will not take away features. This may mean
102965f2042Robert Mustacchi * that a late loaded microcode may not end up in the optimal configuration
103065f2042Robert Mustacchi * (though this should be rare).
103165f2042Robert Mustacchi *
103265f2042Robert Mustacchi * Currently we do not build kmdb with retpolines or perform any additional side
103365f2042Robert Mustacchi * channel security mitigations for it. One complication with kmdb is that it
103465f2042Robert Mustacchi * requires its own retpoline thunks and it would need to adjust itself based on
103565f2042Robert Mustacchi * what the kernel does. The threat model of kmdb is more limited and therefore
103665f2042Robert Mustacchi * it may make more sense to investigate using prediction barriers as the whole
103765f2042Robert Mustacchi * system is only executing a single instruction at a time while in kmdb.
103865f2042Robert Mustacchi *
103942cd193John Levon * SPECTRE v1, v4
104065f2042Robert Mustacchi *
104165f2042Robert Mustacchi * The v1 and v4 variants of spectre are not currently mitigated in the
104265f2042Robert Mustacchi * system and require other classes of changes to occur in the code.
104365f2042Robert Mustacchi *
104442cd193John Levon * SPECTRE v1 (SWAPGS VARIANT)
104542cd193John Levon *
104642cd193John Levon * The class of Spectre v1 vulnerabilities aren't all about bounds checks, but
104742cd193John Levon * can generally affect any branch-dependent code. The swapgs issue is one
104842cd193John Levon * variant of this. If we are coming in from userspace, we can have code like
104942cd193John Levon * this:
105042cd193John Levon *
105142cd193John Levon *	cmpw	$KCS_SEL, REGOFF_CS(%rsp)
1052