Reference Implementation of the Process Management Interface Exascale (PMIx) standard
Downloads Privacy Policy Security Policy Publications Community ContributePlease note that the role of the fabric manager within PMIx is a subject of current RFC development. Thus, the description on this page should be considered a “draft” at this time and is provided to help stimulate discussion.
The fabric manager (FM) has several areas of interaction with PMIx, providing support for operations spanning the scheduler to the resource manager (RM), debuggers, and applications. Some of these represent new capabilities, while others are alternative (possibly more abstracted) methods for current interactions. In the PMIx reference library, support for each fabric type is provided via a fabric-specific plugin. Thus, the fabric vendors are free to provide whatever level of support they deem of value to their customers for the identified operations.
Schedulers generally require knowledge of the fabric topology as part of their scheduling procedure to support optimized communication between processes on different nodes. Current methods utilize files, read by the scheduler when it starts, to communicate this information. However, the format of the files tends to be both scheduler and fabric vendor specific, thereby making it more difficult to maintain as cluster configurations evolve (e.g., as a cluster reconfigures its network topology, or vendors offer new fabrics).
PMIx defines two modes for supporting inventory collection:
Regardless of which mechanism is used, the final inventory needs to include both a topology description of switch connectivity, and a description of the NICs on each node. The latter must include at least all fabric-related contact information (e.g., GID/LID) as well as the network plane to which the NIC is attached. Additional information on NIC capabilities (e.g., firmware version, memory size) is desirable, but not required.
Inventory collection is expected to be a rare event – at system startup and upon command from a system administrator. Inventory updates are expected to involve a smaller operation involving only the changed information. For example, replacement of a node would generate an event to notify the scheduler with an inventory update without invoking the global inventory operation.
Once an allocation has been made, the FM is involved in two steps of the application launch procedure. Each of these will occur at least once per application. Optimizations (e.g., caching of local NIC addressing info on compute nodes) are left to the discretion of the PMIx plugin implementor for each specific fabric.
Prior to starting the application, the RM may need to pass configuration information for the job to the FM (e.g., defining/setting QoS levels). The precise pre-launch settings will depend upon the abilities of the RM as well as the supported options of the FM. In order to coordinate this operation, the FM must support an ability for the RM to query (via the PMIx_Query_nb API) its levels of support (e.g., available QoS levels). This should include obtaining a list of configurable options, and any predefined set points.
Allocation of resources typically occurs when the scheduler is notified of job completion and just prior to initiation of the epilog portion of the prior job. However, many schedulers also utilize anticipatory scheduling to identify one or more jobs that will be run under different termination scenarios, and the need to aggregate resources for larger jobs often results in at least some portion of the allocation becoming available on a rolling basis. Thus, there frequently is some amount of time available for pre-conditioning of the allocation prior to start of a newly scheduled job.
RMs can take advantage of this time to pass both the allocation and job description to subsystems such as the fabric manager via the appropriate PMIx API, receiving back a “blob” containing all information each subsystem has determined will be required by its local resources within the allocation to support the described job. Values returned by the FM must be marked to identify which are to be given directly to the local resource (e.g., network driver) on each node, and which are to be passed to application processes as environmental variables that are recognized by the resource when initialized by that specific process. For example, some networks may “preload” the local network library with the location of all processes in that application, thus allowing the library to compute the required address information for any process, and/or include a security token to protect inter-process communications.
Once the initialization blob has been obtained, the RM passes it to its remote daemons on the allocated nodes as part of the launch message. Prior to spawning any local processes, each RM daemon utilizes a PMIx interface to deliver the configuration information to the local resources. This allows the resources to prepare any internal state tables for the job, an operation which may require elevated privileges and therefore cannot be performed by the application processes themselves. Upon completion of this “instant-on” initialization, processes have all information required to communicate with their peers – no further peer-to-peer exchange of information is required.
Accomplishing this procedure requires that the FM and associated network driver provide support for:
When a connection is to be formed (most libraries do that lazily nowadays), the library may be able to simply send the message (if the network driver is preloaded with connection information), or can get the remote port/plane assignment for the target proc by simply calling PMIx_Get to retrieve the info from the shared memory region and transferring it to the network driver via an appropriate interface.
PMIx provides a mechanism by which various system management subsystems can notify applications and other subsystems of events that might require their attention. This can include failures (broadly defined as anything that impacts the ability of that subsystem to perform as expected) or warnings of impending issues (e.g., a temperature excursion that could lead to shutdown of a resource). Events are generated on an asynchronous basis – i.e., the various subsystems are not expected to “poll” each other to obtain reports of events – and are transferred by PMIx to the RM for distribution to applications that have requested notification through the PMIx event registration interface. The FM is expected to provide notification of fabric-related events, including changes in routing (e.g., due to traffic or link failures) and node replacement.
As the FM is the source of these events, the impact on the FM is dictated solely by the frequency of events to report.
Applications, tools, and debuggers all would benefit from access to FM-based information. These are expected to be relatively infrequent, except perhaps during a dedicated debugging session when more frequent requests on network loading might be appropriate. Two areas of support have been identified (note that more will undoubtedly arise over time):
* Debuggers currently have limited, if any, access to network information. Given appropriate query support, a debugger could display an application layout from a network (as opposed to compute node) viewpoint, showing the user how the application is wired together across the fabric. In addition, traffic reports in support of debugging operations would aid in the identification of “hot spots” during execution.
In general, debugger queries should be infrequent except for the traffic reporting scenario. Reducing the burden on the FM in this latter use-case might require adoption of a distributed notification architecture – e.g., allowing the debugger to register for “traffic report” events, and enabling individual switches to generate them.
* Application developers have indicated interest in an ability to query the current network topology of their application, request QoS changes or splitting their network resource allocation into multiple QoS “channels”, etc. These will be infrequent.