Lines Matching defs:of

21 %Fpr some reason, this doesn't work inside \item so we can't have this as part of \myfunc
42 \input{structure} % Insert the commands.tex file which contains the majority of the structure behind the template
129 % Upper part of the page. The '~' is needed because \\
159 % Bottom of the page
167 \tableofcontents % Print the table of contents itself
181 The Everest architecture, with programmable fastpath processors (Storms), host-based device-dedicated memory (ILT), and minimal on-chip management presents a device which requires a driver with significant portions of device-specific code.
186 A large mass of code for operating and interacting with the Everest 4 device, to be incorporated into and used by OS drivers.
190 Ecore flows may be driven by the HW (e.g. by an interrupt) or by the OS specific portion of the driver (e.g. driver load/unload).
194 \item Slowpath flows tend to reside largely in ecore and less so in OS specific layers. As much of the functionality as possible is placed in the ecore to leverage it across multiple platforms. \\
196 \item Fastpath flows tend to be in the OS specific layer as too much layering and abstraction is out of place in fastpath.
205 All of these may exists in the same system.
206 All of these will have an ecore instance incorporated in them.
207 Either one of the drivers might use the ecore to initialize the device, or the sections of the device pertaining to that driver’s operation.
208 A storage driver may use the ecore for storage specific purposes, such as the initialization and allocation of task context.
211 The ecore is not a driver in its own capacity, but only code which is used by other drivers. Thus, separate drivers, including separate instances of the same driver within an OS, have separate instances of the ecore within them, which are concurrently active.
215 The first parts of the document deal with the concept of the ecore, and its place in the software layers between the device and the OS.
216 The rest of the document deals with the content of the ecore.
217 This document does not deal with the needs and use cases of any specific OS or tool, but only with the common ground which is the ecore.
219 The document sometimes delves in-depth into the inner-workings of the ecore; Since the programmer coming to utilize the ecore might not need [or want] to know those inner workings, such a person should look into specific sections in each chapter, specifically:
221 \item Chapter \ref{cha:overview}'s introduction and section \ref{sec:overview-api} for a listing of the ecore API files and their locations.
227 \item Initialization and De-initialization of the HW [ \ref{sec:init-init}, \ref{sec:init-de-init}].
251 Most of the ecore consists of the \textit{inner} parts, i.e., HW-oriented implementation to which the upper-layer driver writer is oblivious.
256 The Ecore API contains two types of files:
258 \item Files of the format \texttt{ecore\_<module>\_api.h} -- these files are the SW API between the ecore and the upper-layer driver:
272 \item Files of the format \texttt{ecore\_hsi\_<protocol>.h} -- these files contain the API between FW/HW and the the ecore/upper-layer driver:
283 Upper-layer driver should not include any other ecore header file, as the rest of the header files are internal, with the following exceptions:
287 \item \texttt{ecore\_status.h} -- contains \texttt{enum \_ecore\_status\_t}. Many of the ecore return values are of this type.
303 \item \texttt{ecore\_cxt\_api.[ch]} -- Handles the allocation, configuration and distribution of contexts to the various clients.
309 \item \texttt{ecore\_dev.[ch]} -- Contains much of the functionality of starting/stopping the hardware. See chapter \ref{cha:hwinit}.
313 \item \texttt{ecore.h} -- contains the defintion of the most \textit{elementary} structures in the ecore, the \texttt{ecore\_dev} and the \texttt{ecore\_hwfn}.
315 \item \texttt{ecore\_init\_defs.h, ecore\_init\_fw\_funcs.[ch], ecore\_init\_ops.[ch], \\ ecore\_init\_values.h, ecore\_rt\_defs} -- Code responsible for initialization and configuration of the HW and loading of the FW, mostly in relation with the init-tool. See chapter \ref{cha:hwinit}.
317 Chapter \ref{cha:hwinit} doesn't really give a thorough explanation of the init tool - at most it mentions it. Do we want a section/chapter of it somewhere?
335 %As the ecore contains most of the lowlevel code operating the non-fastpath parts of the working with the HW and FW, it can be thought of as some sort of library – it contains bits of code meant to be operated from an outside source. Each OS needs to implement its own driver, calling the various functions in the ecore API in a place fitting for that OS driver flows.
336 %Each OS will independently need to create a driver that incorporates the ecore, both filling the OS dependent callbacks required by the ecore to perform and supply an upper level of abstraction which best suits that OS. Notice this upper layer is sometimes also, mistakenly, referred to as ecore [e.g., bnx2c for linux drivers] but there’s an important distinction:
342 %It’s possible [and likely] that an operating system will break the various protocols into different sub-drivers, where each sub-driver will be designated for a specific protocol. Notice that if such separation is made, the preferred implementation is that the OS will implement a ‘core’ driver consisting of the Ecore and an upper-layer, and define an API through which the various protocol drivers communicate with the OS core driver\footnote{Although notice there should be no inter-dependencies between HW-functions in the ecore, so the alternative method where each contains the ecore is also feasible}.
344 The ecore utilizes various functions which should be implemented by the upper layer. There are two main ‘types’ of functions:
347 \item Hooks by which the upper-layer can run additional OS specific code, or make decisions affecting the work of the ecore. E.g., in the SRIOV flows, the mechanism for passing messages from VF to PF is implemented in the ecore but the decision whether a request is valid or not might be OS specific – as in the case of unicast filters.
355 In order to support this, the verbosity mechanism contains two distinct values \myindex{\texttt{DP\_LEVEL}} and \myindex{\texttt{DP\_MODULE}} [both can be found in \texttt{ecore.h}]. Since the printing scheme in the ecore was defined with the linux limitations in mind – that is, the API [via ethtool] allowing the setting of the debug message level is only 32-bit long, both \texttt{DP\_MODULE} and \texttt{DP\_LEVEL} together contain only 32-bits.
357 The \texttt{DP\_MODULE} is relevant only when level is set to verbose, and it defines which of the verbose prints should reach system logs, based mostly on component/flow. When setting the module level, a bit mask of the requested components/flows is set.
363 \item ASIC\_ONLY -- By default, this is `off'. Setting this would remove content that is relevant only for simulations of the hardware, I.e., emulations and FPGAs.
367 \item REMOVE\_DBG -- By default, this is `off'. There are several structures and field in ecore which aren't functional; there sole purpose is to store interesting data for memory dumps in case of failures. Setting this would remove all such data items.
378 To access the entire range, windows are defined that can be configured to point to a certain address within the device and allow reading and writing of registers / memory from that address.
379 There are two types of windows, \textbf{PTT} (per PF Translation Table) and \textbf{GTT} (Global Translation Table).
390 For more details on the E4 BAR access scheme the reader is referred to the “Reduced PF BAR0 size” section of \cite{doc:PXP}. \\
397 Ecore requires an OSAL implementation of the macros:
405 Several ecore interface functions require a PTT. There is a pool of PTTs maintained by ecore.
407 The PTT is reserved per flow, and it is the responsibility of the upper layer to make sure it does not use the same PTT in flows that can run concurrently. Upper-layer requests for a PTT entry using \myfunc{ptt\_acquire}{ptt_acquire}.
408 However, to avoid running out of this resource, it is also the responsibility of the upper layer not to acquire too many PTTs without releasing them. Returning a PTT entry back to the pool is done via \myfunc{ptt\_release}{ptt_release}.
418 \section{Basic concepts -- inner-working of the ecore}
420 \item \myindex{ILT} – one of the features of our device is that the memories used by various HW blocks are allocated on the host memory [as opposed to large embedded memory segment on chip]. The driver is responsible for allocating the memory needed for those HW blocks [DMA-coherent memory] and configure both the HW blocks themselves and a communal sub-block known as ILT. The ecore contains complicated code that decides exactly how much memory each such block needs, allocates it in an ‘ilt\_shadow’, and then uses that shadow to configure the ILT itself with all the allocated chunks of memory.
424 \item \myindex{RT array} – when the ecore initializes the HW, it utilizes a common, tool-generated code known as the init-tool. Since there are quite a few values which depend upon actual setup configuration and thus must receive feedback during the initialization from the ecore, instead of adding many such hooks there’s the concept of the RunTime array – an array of values filled by the ecore prior to the init-tool run based on the complex ecore logic. The init-tool will then utilize the values in that array to configure the HW according to the correct order of configuration [i.e., writing the values set by ecore in the array in the correct place in the initialization flow where they’re required/the block that contains them is configured].
429 The functions required for initializing the HW/FW mostly reside in \texttt{ecore\_dev.[ch]}; More accurately, most of the outside API [toward the upper-layer] is in \texttt{ecore\_dev.h} – the functions themselves utilize many other ecore files.
430 This section gives a brief description of the functions that need to be called, what they do, requirements, etc., in order to successfully initialize the ecore structs and load the HW/FW.
438 \item \myfunc{init\_struct}{init\_struct} – After allocating and setting of zeroes of the ecore\_dev [the upper-layer responsibility], a pointer to it should be passed to this function for some early initialization of the data structure. \\
449 \item \myfunc{resc\_alloc}{resc_alloc} – Allocates the various ecore-related memory, e.g., contexts, slowpath queue, SRIOV information, etc. Notice that before calling this function, each HW-function of the \texttt{ecore\_dev} should have its `pf\_params’ set, as the function depends upon the protocol-specific resources for its calculations. \\
451 \item \myfunc{resc\_setup}{resc_setup} – Configures the various slowpath elements. Notice that since there’s no guarantee chip is alive at this point [i.e., it’s very likely the chip is reset at this point], it fills the configuration in the runtime array instead of actually writing it to chip. \\
454 As part of the slowpath interrupt enablement, ecore invokes OSAL\_SLOWPATH\_IRQ\_REQ() callback for each HW function. The client implementation should setup the IRQ handlers for slowpath interrupt handling.
455 This is required since as part of the flow the \texttt{function\_start} ramrod will be sent to FW; Once FW finishes handling it, an \myindex{EQE} [Event Queue Element] will be placed in the slowpath event queue and an interrupt will be fired. The flow is dependent on the EQE being processed.
457 Some interesting sub-functions of the \texttt{ecore\_hw\_init()} method, at least for debugging purposes as many possible errors can be caught there:
461 % \item ecore\_mcp\_load\_req() – the MFW [assuming it is present] will answer with one of 3 possible answers: ENGINE, PORT or FUNCTION.
474 \item \myindex{Zipped Firmware} - There are two types of firmware files generated in ecore.\\
476 ecore\_init\_values\_zipped.bin] files. Each type of file is generated in two formats that is a C header file and binary file,
477 where each has all relevant data needed to initialize the firmware. Either of these file types can be used for firmware initialization.
478 The difference is that Zipped firmware files has lot of dmae firmware data zipped which is beneficiary in reducing the code size.\\
480 By default, the non-zipped variant is used. If ecore clients want to use zipped version of firmware then they need to have
485 \item \myindex{Binary Firmware} - As explained above there are two formats of firmware files
487 binary firmware files [ecore\_init\_values.bin and ecore\_init\_values\_zipped.bin]. Either of those files formats
488 can be used by ecore clients to utilize firmware data. By default, ecore uses the .h files which are compiled as part of the ecore,
511 \item \myfunc{int\_disable\_post\_isr\_release}{ecore_int_disable_post_isr_release} – this function performs the required IRQ related cleanup post the ISR release. The function need to be called after releasing all slowpath IRQs of the device.
536 \section{Status blocks - host point of view}
537 The \myindex{status block} structures are allocated on host memory. The status block is an array of indices which are updated by firmware (mainly ring consumer values).
540 When one of the indices on a status block is updated (because some event occurred at the device), the status block is copied from internal device memory to host memory, and an interrupt is generated.
541 The CAU unit may aggregate several events and generate a single update of the status block and a single interrupt, in order to lower the number of interrupts sent to host CPU.
543 The indices of the status blocks are referred to as \myindex{protocol indices} (abbreviated to \textit{pi}).
546 Multiple indices are used for L2 to differentiate between RX / TX and different class of service operations.
564 The device can work in one of the following interrupt modes:
568 \item MSIX – Large number of interrupts (up to 2048) and each one gets a separate target address, making it possible to designate different interrupts to different processors.
570 \item POLL – HW increments producers on status blocks in case of interrupts but it doesn't generate any message nor does it assert any physical line. It's the upper-layer responsibility to periodically poll on those changes to identify interrupts. \\
573 Enabling and disabling interrupts is OS specific and done differently by the OS specific layer of the driver.
585 In MSIX mode, each status block should generate it's own interrupt message, meaning in reasonable OSes it should be possible to connect each interrupt with the specific handler of that interrupt's source.
586 The \textit{sb\_id} passed as value to \textit{ecore\_int\_sb\_init()} will indicate the index of the vector in the MSI-X table that would be used to generate interrupts for this specific SB.
589 When working in INTA / MSI we work in single-ISR multiple-DPC mode; The same interrupt line can signify interrupts from many possible status blocks. In this case the information of which status block generated an interrupt needs to be read from a register in the IGU. Use \myfunc{int\_igu\_read\_sisr\_reg}{int_igu_read_sisr_reg} to get the information [returned value is a bitmask of status blocks which asserted the interrupt].
594 The IGU block has a mapping of status blocks to interrupts.
596 In case of INTA / MSI, each function has a register in the IGU stating which status block gave the interrupt.
603 CAU also handles coalescing of status block writes and interrupt generation.
604 The CAU unit may aggregate several events and generate a single update of the status block and a single interrupt, in order to lower the number of interrupts sent to host CPU.
608 The flow of handling an interrupt in the device and driver is as follows:
618 \item (Possible upper-half handling and bottom-half scheduling, or other OS-specifics which are outside the scope of this document).
620 \item Driver identifies a producer update on the status block (as the producer is written as part of the status block on host memory) using \myfunc{sb\_update\_sb\_idx}{sb_update_sb_idx}.
634 Assume an Rx packet is received by device. After FW places the packet in the Rx rings, it updates the status block of that Rx ring; This in turn is copied into host memory and an MSI-X interrupt for the appropriate Rx queue's status block is triggered.
645 HW functions and the \myindex{MFW} may interact with each other in both ways – driver may send messages to the MFW in the form of commands on a buffer, while the MFW generates attentions for the driver and posts messages in a designated mailbox in the SHMEM. The implementation of the interface resides in \texttt{ecore\_mcp.[ch]}, with the addition of \texttt{.h} files generated by the MFW owners, e.g., \texttt{mcp\_public.h} which contains the SHMEM structure and the list of commands.
648 The interface between driver and MFW is initialized as early as possible in the initial initialization flow [specifically as part of \texttt{ecore\_hw\_prepare()}], as this initializes the Driver access to SHMEM which is used later during initialization to learn about the chip configuration [which was read from NVRAM by MFW and written into SHMEM].
649 The upper layer doesn’t need to take care of allocating/releasing of this interface – it’s part of the greater initialization/de-initialization of the ecore.
652 The \myindex{shared memory} is a segment of memory accessible to all functions as well as the MFW. The memory is used for various purposes:
661 An upper-layer driver is not supposed to access the SHMEM directly; It should only do so by using ecore functions and accessing ecore structs. The ecore \textit{mcp\_info} struct contains as one of its fields \textit{func\_info} which is filled by the ecore during early device initialization with all the function-specific static\footnote{i.e., data that shouldn't change while driver is running} data. Upper-layer driver can read those values for its own usage.
666 A message is a u32 consisting of a command bit-mask which indicates of the message the HW-functions sends and a cyclic sequential number.
672 Per-HW-function there’s an array of message producers in SHMEM, of which the ecore maintains a copy.
673 Before sending the attention, the MFW will increment the producer of the message it wishes to inform the driver and the driver will recognize the message by noticing the difference in producers.
678 Notice the commands’ content vary -- some of the commands will require additional parameters to be filled in specific fields in the SHMEM before the commands are passed.
681 \myfunc{mcp\_cmd}{mcp_cmd} -- this is the very core of message-passing from driver to MFW. Upper-layer driver should pass the command (FW\_MSG\_CODE\_* from \texttt{mcp\_public.h}) and a parameter, as well as pointers for the MFW response and additional possible parameter. The function will pass the command for MFW and await [sleep] for its reply. \\
683 A ‘special’ instance of this function is \texttt{ecore\_mcp\_load\_req()} [which isn’t an API function] - that function sends an indication to the MCP that the HW-function is being loaded.
684 The MFW is used as both a book-keeper and synchronization mechanism for the loading of PFs, as there are communal resources. The response will be (FW\_MSG\_CODE\_DRV\_LOAD\_<X>), where X can be either ENGINE, PORT or FUNCTION:
693 Dependent on the exact message received from the MFW, it’s possible that this will eventually will call some OSAL which will need to be implemented by the upper-layer driver, e.g., in case of link change indication [The upper-layer needs to be notified and should decide on its own what to do with that information].
707 \item Link\_output – The ecore fills the structure from attention handling context whenever the MFW indicates that a link change has occurred. Upper layer driver can read this to get information about the current state of the physical link. It can access this struct by calling \myfunc{mcp\_get\_link\_state}{mcp_get_link_state}.\\
715 Notice the logic for link-flap-avoidance should be contained in MFW, e.g., in multi-function mode there’s no need for the upper-layer driver to count the number of functions loaded in order to decide whether during unload it should request a link reset; It should do it regardless.
719 EEE feature enables the device to put its transistors in sleep mode when there is no data activity on the wire. Hence achieves the significant reduction in the power consumption of the device. It's a Base-T feature, more details of which are captured under IEEE 802.3az standard. MFW negotiates the EEE parameters with the peer device and the results will be shared to the ecore as part of link notification. Following are the negotiated parameters which will be encapsulated in the struct \texttt{ecore\_mcp\_link\_state}.
730 \item eee\_tx\_lpi\_timer – EEE delay timer value, i.e., amount of time device should stay in idle mode prior to asserting its Tx LPI (in microseconds).
732 Upper layer driver can configure the one or more of the EEE following parameters.
747 \item \myfunc{dcbx\_query\_params}{dcbx\_query\_params} – The API returns the current dcbx configuration. It expects type (i.e., local/remote/operational) and the buffer for storing the dcbx parameters of that type.\\
759 MFW needs various bits of information from the driver, and it gathers those in one of two methods:
764 In some cases, ‘Push’ is done without involvement of the ecore-client. If that’s not possible, it becomes more risky as the responsibility of doing things correctly passes to the ecore-client. Ecore-client shouldn’t presume to do ‘push’ only for calls which match the configured management mode. Instead it should always do them and let the ecore be the arbiter of whether those are needed by MFW or not. Ecore provides the following APIs for updating the configuration attributes, it is the client's responsibility to invoke these APIs at the appropriate time.
773 \item \myfunc{mcp\_ov\_update\_current\_config}{mcp\_ov\_update\_current\_config} – Drivers need to call this API when user updates one (or more) of the following: mtu, primary mac or Wake on LAN settings (to a non-default value). In addition, it also needs to call a unique API per each:
791 \item DISABLED - After firmware is successfully loaded on the device, ecore updates the driver state as DISABLED (as part of ecore\_hw\_init() implementation). \\
792 \item NOT\_LOADED - Ecore sets this state when the protocol driver is unloaded (as part of ecore\_hw\_remove()).\\
803 Ecore also provides the TLV request interface for MFW for querying the driver/device attributes. MFW uses mailbox interface to notify ecore on the required TLV information. Ecore parses the request, populates the required information with the help of ecore clients and sends it to the MFW. Ecore client need to provide necessary infrastructure and the OSALs for implementing this interface.
805 \item OSAL\_MFW\_TLV\_REQ - The call indicates that ecore has received a TLV request notification from the MFW. The execution context in interrupt mode, hence ecore client need to schedule a thread/bottom-half context to handle this task, and return the control immediately. The bottom-half thread will need to invoke \myfunc{mfw\_process\_tlv\_req}{mfw_process_tlv_req} for further processing of the TLV request.\\
806 \item OSAL\_MFW\_FILL\_TLV\_DATA - Ecore invokes this callback to get the TLV values of a given type. Ecore client need to fill in the values for all the fields that it's aware of, and also need to set the flags associated with the respective fields. For instance, if client sets value for 'npiv\_enabled' field, it needs to set the flag 'npiv\_enabled\_set' to true.\\
817 This section describes in a very highlevel manner several FW objects which are related to L2. Developers implementing L2 support over the ecore should be familiar with these\footnote{Probably even more than is in the scope of this document.}.
819 \item Virtual port [\myindex{VPORT}] -- Can simply be seen as a collection of queues, each HW-function will have at least one VPORT configured\footnote{And in most scenarios one will suffice.}. Classifications are configured per-VPORT. \\
829 Although VPORTs' and queues' indices are shared between all HW-function on the same engine, the resource allocation scheme determines a range of VPORTs per-HW-function to use for configuration [i.e., developer can assume starting index is always 0 per-HW-function].
843 There are 2 identifier of the queue - the queue index to add and the VPORT index to add it to. The queue-index should be unique for the Rx-queue; No 2 Rx-queues of the same PF should use the same id.
844 There are quite a few parameters that need to be supplied, e.g., status block, physical addresses of rings, etc.
856 Very similar to the Rx queue start method, with some slight differences in the parameters [BD ring address instead of Rx rings, etc.]. For Tx-queues, the same queue-id can be shared between 2 different queues. That would cause those queues to share the same coalescing configuration.
859 Doorbell addresses are on a different BAR than that of other memories/registers accessed by driver, and the PTT/GTT scheme does not apply to it; Thus the address can simply be accessed using the necessary memory barriers.
862 \item \myfunc{sp\_vport\_update}{sp_vport_update} -- This is required to enable the VPORT. It should be called after the Tx/Rx queues were already added, and this will enable the VPORT to send and receive packets\footnote{Notice that without classification configuration Rx won't actually work. Also notice this function can do a lot of things; Enabling the VPORT is only one of them.}.
874 Classification configuration consists [mostly] of three thing:
876 \item Configuration of the \myindex{Rx mode} -- This defines which datagrams [unicast, multicast, broadcast] should be accepted by the VPORT, and whether all such datagrams or only if a filter is configured for them.
877 \item Configuration of unicast / multicast filters -- defining filters for specific unicast / multicast addresses which should be matched, given that Rx mode agrees.
878 \item Configuration of vlan filters -- by default, all vlans will be accepted. If at least one vlan [or vlan-mac] filter will be configured only traffic which matches one of the configured vlan filters will pass through.
887 \item \myfunc{filter\_accept\_cmd}{filter_accept_cmd} -- configures the Rx mode of the device.
888 \item \myfunc{sp\_vport\_update}{sp_vport_update} -- although not exactly a classification function, calling this will re-set the Rx mode [this calls \texttt{ecore\_filter\_accept\_cmd()} as part of its work].
907 This is pretty straight forward, and works in reverse-order to the initialization of the L2 device.
926 Following the completion of the \texttt{vport\_stop}, no further traffic should be working. Interrupts can be released, and resources can freed.
927 Notice this on its own doesn't return the device into a 'clean-slate' state; There are still several non-L2 things that needs to be done [e.g., cleaning the status blocks of the queues]
934 Our device supports \myindex{100G} link. However, the fastpath pipeline of each HW engine isn't fast enough for that line-rate. The Hardware function term is a catchphrase for the HW resource and identifications normally required by a single pci function. In 100G mode, the device will enumerate as a single pci function\footnote{Or more in multi-function mode; But we will stick with single-function mode for simplicity here.}, but the driver running over this pci function will utilize multiple HW functions.
935 From pci standpoint, the distinction between the HW functions (and thus the HW engines) is done via the bar address. Access to the first half of each of the pci function's bars will be translated into an access into a HW function on the first engine, while access to the second half will be translated into an access into a HW function on the second engine.
940 After the early initialization phase of the ecore (i.e., following ecore\_hw\_prepare()), the \textit{ecore\_dev} field \myindex{num\_hwfns} will be filled with the correct number of HW-functions under the PCI device. The ecore and its client should access only the first num\_hwfns entries in the \textit{hwfns} array.
944 Each path has its own IGU CAM, meaning it has its own set of available status block. But as both HW-functions share the same PCI function, there is a single MSI-X table for that device.
952 Then for $\forall n \in \mathbb{N}_{+}$, $\text{MSIX}_{2n}$ is connected to $\text{hwfn}_0$'s status block of index $n$, and $\text{MSIX}_{2n+1}$ is connected to $hwfn_1$'s status block of index $n$.
956 Ecore handles almost all the difference between CMT and regular mode on it's own, i.e., it reads the number of HW-functions under the devices and iterates when needed to configure both engines correctly (where as in the non-CMT mode it would have simply configured one).
965 following Example [\ref{ex:CMT1}], $\text{MSIX}_0$ should be enabled and connected to the DPC of $\text{hwfn}_0$ and $\text{MSIX}_1$ should be enabled and connected to the DPC of $\text{hwfn}_1$.
972 Since each HW-function is running on a different path and is an independent entity (as perceived by FW/HW), configuration should be almost symmetric for both HW-functions. E.g., Following the flow of section \ref{sec:l2-start}, ecore\_sp\_vport\_start() should be called separately for each HW-function, queues should be opened separately for each, etc..
981 There is an issue between the user control of the number of queues and the actual configuration of queues - e.g., assume user wants $X$ queues. If we use a symmetric configuration what we actually do is open $X$ queues on each path, meaning we actually open $2X$ queues.
983 We can either only open $X/2$ queues on each engine, in which case we lose some abilities, e.g., control the keys of the RSS hash-function, or open $2X$ queues and try to hide this fact from user, but this most likely will either incur a performance penalty, hard-to-maintain code or both.
990 This chapter describes the ecore interface for the upper-layer driver of the iSCSI protocol.
1006 \texttt{num\_sq\_pages\_in\_ring} & Number of outstanding tasks on the connection * 8B / page-size. \newline Suggested default value for number of outstanding tasks on the connection 256 \\ \hline
1008 \texttt{num\_uhq\_pages\_in\_ring} & Number of outstanding un-ACKed PDUs, suggested default value -- same as \texttt{num\_sq\_pages\_in\_ring} \\ \hline
1009 \texttt{num\_queues} & Number of global queues (CQ / CmdQ / RQ). \newline This should be $\leq$ number of available MSIX vectors for the PF \\ \hline
1011 \texttt{glbl\_q\_params\_addr} & The physical address of the list of pointers to the arrays of pointers to global queues pages. \newline The list is built as follows: CQ\#0 PBL pointer, RQ\#0 PBL pointer, CmdQ\#0 PBL pointer, CQ\#1 PBL pointer, RQ\#1 PBL pointer, CmdQ\#1 PBL pointer, etc. \newline Each PBL pointer points to the physical address which contains an array of pointers to the physical addresses of the specific queue pages. \\ \hline
1013 \texttt{rq\_num\_entries} & Number of RQ entries, suggested value for Initiator 16 (4KB RQ), for Target 128 \\ \hline
1015 \texttt{cmdq\_num\_entries} & Number of CmdQ entries, suggested default value \texttt{num\_tasks} \\ \hline
1018 \texttt{max\_fin\_rt} & Number of FIN retransmits before abortive close, suggested default value 3 \\ \hline
1042 \texttt{sq\_pbl\_addr} & The physical address of the array of pointers to the physical addresses of the SQ pages \\ \hline
1050 \texttt{ka\_max\_probe\_cnt} & Number of keep-alive probe retransmits before indicating connection error, suggested default value 10 \\ \hline
1065 \texttt{tos\_or\_tc} & Type-of-service for IPv4, traffic-class for IPv6 \\ \hline
1109 \item The rest of the de-initialization process is described in section \ref{sec:init-de-init} for all protocols.
1121 This chapter describes the ecore interface for the upper-layer driver of the FCoE protocol.
1136 \texttt{glbl\_q\_params\_addr} & The physical address of the list of pointers to the arrays of pointers to global queues pages. \newline The list is built as follows: CQ\#0 PBL pointer, RQ\#0 PBL pointer, CmdQ\#0 PBL pointer, CQ\#1 PBL pointer, RQ\#1 PBL pointer, CmdQ\#1 PBL pointer, etc. \newline Each PBL pointer points to the physical address which contains an array of pointers to the physical addresses of the specific queue pages. \\ \hline
1137 \texttt{sq\_num\_pbl\_pages} & Number of outstanding tasks on the connection * 8B / page-size. \newline Suggested default value for number of outstanding tasks on the connection 256 \\ \hline
1138 \texttt{rq\_num\_entries} & Number of RQ entries, suggested value for Initiator 16 (4KB RQ), for Target 128 \\ \hline
1140 \texttt{cmdq\_num\_entries} & Number of CmdQ entries, suggested value \texttt{num\_tasks} \\ \hline
1142 \texttt{num\_cqs} & Number of global queues (CQ / CmdQ / RQ). This should be $\leq$ number of available MSIX vectors for the PF \\ \hline
1166 \texttt{sq\_pbl\_addr} & The physical address of the array of pointers to the physical addresses of the SQ pages \\ \hline
1167 \texttt{sq\_curr\_page\_addr} & The physical address of the first SQ page \\ \hline
1168 \texttt{sq\_next\_page\_addr} & The physical address of the second SQ page (or the first in case of a single page SQ) \\ \hline
1205 \item The rest of the de-initialization process is described in section \ref{sec:init-de-init} for all protocols.
1217 This chapter describes the ecore interface for the upper-layer driver of the RDMA protocol. The interface aims at sharing as much as possible between RoCE and iWARP. This chapter is not complete, and currently only details changes for iWARP. (Except for dcqcn which was already detailed before )
1226 Ecore per function context has a field “personality” which is set based on the protocol being iWARP/FCoE/iSCSI/RoCE/Ethernet. In context of network driver, ecore personality could be ECORE\_PCI\_ETH\_ROCE, ECORE\_PCI\_IWARP, or ECORE\_PCI\_ETH. (enum ecore\_pci\_personality)
1244 \texttt{min\_dpis} & the minimum number of device contexts required (i.e. the number of times open device can be called simultaneously) \\ \hline
1245 \texttt{num\_qps} & requested number of queue pairs\\ \hline
1247 \texttt{num\_mrs} & Number of supported MRs. Up to 4K are supported, suggested default value 1K \\ \hline
1249 \texttt{roce\_enable\_dcqcn} & If enabled maximum number of rate limiters will be allocated during hardware initialization which can later be initialized and configured during roce start. Must be set to enabled dcqcn during roce initialization. This field is relevant to RoCE only.\\ \hline
1253 The values of num\_qps, num\_mrs will impact the amount of memory allocated in the ILT. Note that although these parameters are rdma specific, they are actually used during common hw initialization phase. The amount of ilt memory will differ between RoCE and iWARP as iWARP requires only one cid per QP and RoCE requires two.
1262 \texttt{desired\_cnq} & desired number of cnqs to be used. Upper layer driver needs to make sure enough resources are available for this number (number of msix vectors and cnq resource\\ \hline
1263 \texttt{cnq\_pbl\_list} & Array of pbls used per cnq. The array should be initialized according to the value set in desired\_cnq\\ \hline
1274 \texttt{iwarp\_rcv\_wnd\_size} & ecore will calculate the receive window scale from this. This number should be provided in bytes. There is a minimum of 64K, any number below this will result in the default window size being set which is 1MB \\ \hline
1279 \item \myfunc{rdma\_query\_device}{rdma_query_device} -- this function returns a struct of type ecore\_rdma\_device which contains the capabilities and set options for the given device.
1285 When configuring the device to act as notification point, the ecore will initialize the NIG block accordingly and pass the priority vlan and cnp send timeout values to FW. When configuring the device to act as reaction point, the ecore will send a ramrod to FW that configures the rate limiters allocated for dcqcn support with the values received from the upper layer driver ( such as maximum rate, byte counter limit, active increase rate etc... full detail in ecore\_roce\_api.h file ). At this point all rate limiters will be configured with the same values. If in the future there will be a need to configure different rate limiters with different values an additional API function will be provided. During initialization, ecore will map between physical queues used for RoCE and rate limiters. The number of rate limiters allocated is handled by resource management and is currently divided equally between the functions. During modify\_qp, ecore will configure the responder and requester to work with a unique physical queue, which is configured to work with a unique rate limiter. QPs that are opened after rate limiters are used out will be configured to run on a default physical queue which does not have a rate limiter. FW assumes that the qp\_id is equal to the physical queue id. For simplicity, the implementation assumes that Ethernet is not run simultaneously with RoCE (i.e. Roce only personality). If dcqcn is enabled and ethernet is run, ethernet will run on the same physical queue as the first qp that is allocated.
1289 Unlike RoCE in which connection management is implemented completely in host, connection management for iWARP which involves the TCP 3 way handshake and MPA exchanges is implemented in F/W. The host is nevertheless involved in offloading TCP and MPA and exchanging connection parameters as part of the connection establishment/teardown process.
1305 For both passive and active connect, basic information on host and peer is required. We define a structure called \texttt{ecore\_iwarp\_cm\_info} which will be passed between driver and ecore on both downcalls and upcalls. Throughout the rest of the chapter we'll refer to this as the cm\_info.
1310 \texttt{remote\_ip} & Ip address of remote side. In host byte order. \\ \hline
1311 \texttt{local\_ip} & Ip address of local side. In host byte order. \\ \hline
1312 \texttt{remote\_port} & Port of the remote side. In host byte order. \\ \hline
1313 \texttt{local\_port} & Port of the local side. In host byte order. \\ \hline
1316 \texttt{private\_data\_len} & Length of the private data. \\ \hline
1320 Ecore sends same parameter for all different event types, though the required parameters for an event type is a subset of the complete set of parameters. It does this so that it can pass a pointer to a data structure instead of reformatting the parameters in a different structure.
1323 On the active side of iWARP connection establishment, it is assumed that create\_qp will be called prior to calling connect. QP created will be sent as a parameter to the accept function call.
1325 This function will take care of initiating the TCP 3-way handshake and MPA negotiation. Once the MPA response is received the event EVENT\_ACTIVE\_COMPLETE will be issued to upper-layer driver. This function is asynchronous. The function will receive cm\_info (detailed in \ref{sec:cminfo} ), mss, local and remote mac address. The mac address will be acquired by upper-layer driver using OS ip routing functions (such as find\_route in linux). In addition, it will require a pointer to the associated QP and a pointer to a callback function and callback context which will be used to indicate events to the driver which are related to this connection. \newline
1329 \texttt{ECORE\_NO\_MEM} & Memory is required for driver context of a connection. If it can't allocate it will return this failure. \\ \hline
1330 \texttt{ECORE\_SUCCESS} & Means tcp offload was performed. Does not mean connection was established. The status of connection establishment will be passed with the EVENT\_ACTIVE\_COMPLETE. \\ \hline
1346 This function will receive socket local and remote addresses (port, ip and vlan) and add them to its listening database. In addition a callback function and callback context will be provided which will be used by ecore to send events of connection requests to the driver.
1368 \texttt{private\_data\_len} & length of private data.\\ \hline
1376 This function should be called when upper layer driver / App wants to reject a connection request for whatever reason. As a result of ecore's EVENT\_MPA\_REQUEST.
1387 \texttt{private\_data\_len} & length of private data.\\ \hline
1397 \texttt{cm\_info} & \ref{sec:cminfo} with values negotiated. Dont care incase of mpa\_reject \\ \hline
1417 The interface into ecore is done with the states of RoCE and translated internally to iwarp states. This was done
1434 To initiate a graceful disconnect sequence, the active side will perform a modify\_qp to ECORE\_ROCE\_QP\_STATE\_SQD. This will be translated to ECORE\_IWARP\_QP\_STATE\_CLOSING and initiate a graceful teardown sequence with FW. Currently, due to existing FW implementation a modify qp to error will be sent fo FW before closing the connection. In the future, FW HSI will be changed so that a CLOSING state is added to FW as well. Once the disconnect is complete, whether gracefully or abortively ( in some cases a graceful disconnect will turn into an abortive one, timeouts, errors in close etc... ) an ECORE\_IWARP\_EVENT\_CLOSE event will be sent to upper layer driver. Ecore will transition to ERROR state in any case at the end of the flow.
1437 To initiate an abortive disconnect sequence, the active side will perform a modify\_qp to ECORE\_ROCE\_QP\_STATE\_ERR. This will be translated to ECORE\_IWARP\_QP\_STATE\_ERROR and initiate an abortive teardown sequence with FW. Once the disconnect is completed, an ECORE\_IWARP\_EVENT\_CLOSE event will be sent to upper layer driver. Ecore will transition to ERROR state in any case at the end of the flow.
1451 On passive side teardown sequence is initiated once a graceful / abortive request is received from peer. In this case ecore will send a ECORE\_IWARP\_EVENT\_DISCONNECT to upper layer driver. Ecore will transition to ERROR state in any case at the end of the flow.
1483 \item \myfunc{rdma\_create\_qp}{rdma_create_qp} -- This function will create the qp object in ecore and for iWARP in FW. In RoCE no FW ramrods are sent during the call to this function. The main change from existing create\_qp function, for iWARP is that instead of providing addresses to rq, sq separately, and allocating memory for FW queues in ecore, FW requires contiguous memory for the the pbl of all FW queues (RQ, SQ, ORQ, IRQ, HQ). Therefore interface will change, and instead of upper layer driver providing pbl address to create\_qp, these will be provided as out\_parameters after being allocated in ecore. Upper layer driver will be required to pass the number of pages required for SQ / RQ. Populating the pbls will be done after calling create\_qp and not before as done today. For ease of code sharing between iWARP and RoCE FW will modify RoCE implementation to work the same as iWARP.
1484 \item \myfunc{rdma\_modify\_qp}{rdma_modify_qp} -- The API will remain the same, however, for iWARP not all fields are relevant. Naming convention of RDMA/iWARP/RoCE was done in ecore\_roce\_api to distinguish between what is required and what is not. Modify QP is used in iWARP for part of the teardown flow detailed in \ref{sec:iwarp_teardown}
1488 Ecore client has the ability to signal ecore that a specific tcp port in app tlv should be recognized as pertaining to the iwarp offloaded connections. If an app tlv which matches this port is indicated by MFW, all offloaded iwarp traffic of the PF will abide by this configuration (regardless of the actual tcp port of the offloaded connections). The app tlv can be set by the ecore client via the regular APIs for setting "locally administered params”. Ecore client communicates the tcp port value via \texttt{rdma\_pf\_params} structure, the value needs to be populated before invoking \myfunc{resc\_alloc}{resc_alloc}. To configure the iwarp app tlv in the locally administered dcbx parameters, ecore client need to use the Dcbx APIs described in "Dcbx Interface" section. The relevant APIs are \myfunc{dcbx\_get\_config\_params}{dcbx_get_config_params} and \myfunc{dcbx\_config\_params}{dcbx_config_params}.
1497 The LL2 is a simplified version of L2 for which both slowpath and fastpath flows reside in ecore, and it is being used by the upper-layer drivers of the storage protocols.
1510 \texttt{rx\_num\_desc} & maximal number of entries in the Rx ring. \\ \hline
1511 \texttt{tx\_num\_desc} & maximal number of entries in the Tx ring (each packet-buffer occupies an entry). \\ \hline
1512 \texttt{rx\_num\_ooo\_buffers} & Relevant only for OOO connection if 0 default value of 2*rx\_num\_desc will be used). \\ \hline
1514 \texttt{rx\_vlan\_removal\_en} & can be set if it is desired to get the VLAN stripped and out-of-band. \\ \hline
1516 \texttt{tx\_dest} & Destination of tx -> Network or Loopback. \\ \hline
1521 \texttt{cbs} & Callback functions that should be called on completion or release of rx / tx packets. \\ \hline
1542 \item \myfunc{ll2\_prepare\_tx\_packet}{ll2_prepare_tx_packet} -- this function adds a new packet to the transmit ring. If the packet is composed from more than a single buffer, than the address and length of the additional buffers is provided to ecore by calling \texttt{ecore\_ll2\_set\_fragment\_of\_tx\_packet} for each additional buffer. \\
1543 \texttt{num\_of\_bds} is the number of buffers that compose the packet (including the first buffer), and is limited to CORE\_LL2\_TX\_LOCAL\_RING\_SIZE.
1546 \item \myfunc{ll2\_set\_fragment\_of\_tx\_packet}{ll2_set_fragment_of_tx_packet} -- this function provides the next buffer of a packet. \texttt{addr} should be a DMA-mapped address, and \texttt{nbytes} is the buffer length in bytes.
1548 \item \texttt{complete\_tx\_packet} -- this is a callback function that should be implemented in the upper driver. Ecore calls this function when the transmission of the packet is completed (it is called once per-packet). \texttt{cookie} and \texttt{first\_frag\_addr} are echoed from the call that posted that first fragment of the packet. \texttt{b\_last\_fragment} and \texttt{b\_last\_packet} can be ignored.
1569 \myindex{SRIOV} is a PCIe functionality which allows Physical functions (also termed \myindex{PF}s) to spawn Virtual functions (also termed \myindex{VF}s), with a limited set of registers in their PCI configuration space and bars, but that should supply ~the same basic functionality
1573 With some exceptions, it could be said that the upper driver doesn’t need to be aware of driving a VF instead of a PF.
1578 Sections \ref{sec:sriov-hw-channel}, \ref{sec:sriov-tlv} mostly give a glimpse of the mechanism used by the ecore to support the feature, while the rest of the sections are of more interest to the upper-driver implementer since they contain the howtos.
1584 Upper driver can read values in this struct [instead of accessing PCI configuration space] if needed,
1585 but there are is a single field it 'owns', b\_hw\_channel -- In most distros VFs will communicate with PFs using the HW-channel [see section \ref{sec:sriov-hw-channel}], and upper-driver should set it to `true'. However, if upper-driver utilizes a designated SW-channel which it can use instead of the HW-channel, it should set let this field remain 'false'. \\
1587 An additional important field is the \myindex{total\_vfs} which represents the maximal number of VFs current PF can possibly have. The macro \myindex{IS\_ECORE\_SRIOV} can be used to determined if PF has $\text{total\_vfs} > 0$, therefore whether IOV is relevant to the PF or not. \\
1589 Important terminology when talking about VFs is \myindex{relative\_vfid} versus \myindex{absolute\_vfid}. The relative vfid is the zero-based index of the VF relative to its parent PF, i.e., the first VF of a given PF is always 0, second is 1, etc.
1590 The absolute vfid is the zero-based index of the VF relate to all the VFs on the same path, i.e., it's possible the first PF of a given VF will have an absolute vfid which is greater than zero.
1595 When a PF is about to initialize its VFs, it should enable the access of each VF to the HW by calling \myfunc{iov\_init\_hw\_for\_vf}{iov_init_hw_for_vf} for each VF [passing its relative vfid].
1597 At this point upper-layer driver has to know the number of interrupts to assign to each VF, since the MSIX table in each VF configuration space must be arranged by ecore prior to VF activation.
1601 Afterwards, upper-driver can initialize the VF same as it would have the PF, i.e., the difference in initialization logic is 'hidden' inside the ecore. Upper-layer code doesn't need to contain all sorts of if-else clauses to differentiate between VF and PF [at least, not as far as the ecore initialization is concerned.
1603 Closing the VF should operate smoothly without need of any special calls. I.e., regular closure sequence for PFs should be translated by ecore opaquely into a sequence closing the VF.
1607 If the upper-driver has the option, during sriov-disable prior to VFs been unset in the PF's PCI configuration space, upper-driver should call \myfunc{iov\_set\_vf\_to\_disable}{iov_set_vf_to_disable} for each one of its active VFs. This will result with a cleaner FW/HW after closure is complete.
1611 The VF's PCI bar is very different from the PF bar, and with much more limited access toward chip; see \cite{doc:iov-sys} for details about the VF bar. As a result, most of the slowpath configuration that needs to be done for the VF actually has to be done by the PF.
1613 To support this, there is a mechanism of \myindex{TLV}\footnote{Type-Length-Value} message passing from VF to PF, in which the VF can request the PF to either perform services for it or supply it with information which is otherwise inaccessible for the VF.
1614 This message passing is usually done via the HW channel [see section \ref{sec:sriov-hw-channel}], but assuming the existence of an alternative method [i.e., SW-channel] it can be done via it just as well.
1617 During \textit{ecore\_hw\_prepare()} ecore gathers information about the chip from various locations - HW, shared memory with Management FW, etc.. However, almost all of that information is inaccessible to the VF. Thus the VF has an alternative flow by which it sends an ACQUIRE message to the PF, notifying it that it's up and requesting information about the device - e.g., number of resources such as status blocks and queues available to the VF.
1635 \item VF prepares a message on DMA-able memory, which also contains an address of an additional DMA-able memory upon which the VF will poll for the PF reply.
1639 \item PF driver's ISR wakes. It recognizes the message and calls OSAL\_PF\_VF\_MSG to notify upper-layer driver of the message; This is mostly since the slowpath context isn't the proper place to handle VF messages.
1647 One optional auxilary function that can be used by the ecore-client is \myfunc{iov\_pf\_get\_pending\_events}{iov_pf_get_pending_events}. It will return a bitmask of all the VFs belonging to the PF for which there's a message yet to be processed.
1656 The Message passing from VF to PF benefits from the PF's slowpath status-block, i.e., the ability of the PF to receive slowpath interrupts.
1657 Since the VF hasn't got such a status block allocated for it, the message passing between the PF and the VF consists of polling on the VF side.
1662 PF can use bulletin boards to notify VF of current link state. Notice that link state doesn't necessarily has to reflect the physical link state.
1663 E.g., Hypervisor tools might be used to configure VF link state as always up regardless of physical state, so that VFs could communicate using Tx-Switching.
1668 If such a change occurs, since the bulletin doesn't contain deltas from previous messages but rather the entire data [due to lack of handshake the PF can't know if VF read previous bulletin boards], the upper-driver has a wide assortment of functions-per-feature which are defined in ecore\_vf\_api.h and can be used to learn of the current state. E.g., \myfunc{vf\_get\_link\_state}{vf_get_link_state},
1673 [Inner workings of the ecore; Not necessary for the upper-driver implementer]
1676 \item During \textit{ecore\_hw\_prepare()} of the VF, ecore allocates a DMA-able buffer for the bulletin board.
1677 \item During the ACQUIRE message sent from VF to PF, VF posts the physical address of the bulletin board as well as its size.
1678 \item During the ACQUIRE response sent from PF to VF, PF agrees upon the size of bulletin board which will be used [forward-backward compatibility].
1679 \item Whenever any of the field of the bulletin fields the PF wants to post changes, PF increments a counter, calculate a CRC and uses DMAE to copy its local buffer into the VF's bulletin buffer.
1680 \item On the VF-side, the polled \textit{ecore\_vf\_read\_bulletin()} samples the buffer, verifies the CRC [to make sure it has a consistent image of the buffer] and if the bulletin index has increment since last seen get's updated according to the new bulletin board.
1686 In addition, FLR might be used internally by driver/FW in case of malicious VFs, where that VF's database should be cleared before re-enablement.
1693 [See \cite{doc:iov-sys} for more details]. From driver point-of-view, management FW notifies driver of FLR after it and the storm FW have already done some work [storm FW done what's called `initial cleanup'].
1708 Sr-iov is exposed to complex versioning challenges. Specifically, a given PF driver may be working with VF drivers of older and/or newer versions at the same time.
1711 The receiver of a message (be it PF receiving a request or VF receiving a response) will parse the message, process the parts it is aware of and be able to skip over parts which it doesn't recognize.
1715 The compatibility requirements of fastpath flows have to be lenient, as we can't afford to penalize the performance.
1716 If the fastpath API changes in a non backward compatible fashion (assumed to be a rare occurrence) the VF will either fail to load, or else have to carry with it several alternate implementations for fastpath. The VF driver learns of the fastpath version from the slowpath interaction with the PF.
1724 This chapter describes the ecore interfaces for selftests. The scope of the selftests is to sample various aspects of device functionality and verify that it is operational. It is not intended and does not lay claim to perform full coverage of any functionality. \\
1727 \myfunc{selftest\_register}{selftest_register} -- this test verifies the data integrity of the registers. It writes a predefined value to the register, reads it back and verifies that the contents are correctly saved. It saves the register original content before preforming the test and restores its value after the test. This test is performed via MFW and accesses registers from both engines as well as registers from engine common blocks.
1730 \myfunc{selftest\_clock}{selftest_clock} -- it measures the clock frequencies of the E4 modules. The clocks verified in this test are,
1741 \myfunc{selftest\_memory}{selftest_memory} -- this test samples some of the memories in the device. Ecore employs its most basic flow which exercises memories, again the heartbeat ramrod. In this flow context is loaded to the context manager memory and is verified by the storm FW (otherwise the ramrod would fail).
1751 This chapter provides an high level overview of PTP and describes the ecore interfaces for the same. PTP also known as Time Sync allows the synchronization of the clocks in the distributed systems. The protocol selects one clock in the network as master clock and all other clocks (slave clocks) synchronizes their clocks with the master. Driver's responsibilities include enable/disable of the PTP feature on the device, register/un-register of the hardware clock and its operations to the OS and configure the required Rx/Tx PTP filters. HW/FW does the timestamping of Tx/Rx PTP packets, driver need to read these timestamp values and present it to upper layer protocols (e.g., IPv4). Rx timestamping will be available during the Rx interrupt processing of the driver. FW does the Tx timestamping when first byte of the PTP packet is placed on the wire, driver has to poll for the availability of this timestamp value when processing the PTP Tx packet. \\
1761 As part of feature clean up, ecore client should call \myfunc{ptp\_disable}{ptp_disable} API to disable the PTP feature on the hardware. \\