OpenPMIx

Reference Implementation of the Process Management Interface Exascale (PMIx) standard

View the Project on GitHub

Downloads   Privacy Policy   Security Policy   Publications   Community   Contribute

UNDER CONSTRUCTION

Please note that tiered storage support is a subject of current RFC development. Thus, the description on this page should be considered a “draft” at this time and is provided to help stimulate discussion.

Next-generation systems are designed to access information that resides in an array of storage media, ranging from offline archives to streaming data flows. This tiered storage architecture presents a challenge to application developers and system managers striving to achieve high system efficiency and performance. Multiple vendor-specific solutions have been proposed, each with its own unique API and associated data structures. However, this results in a corresponding loss in application portability and increased cost of customer migration across procurements. Likewise, it would be burdensome for scheduler/work load manager (WLM) vendors to directly interface to storage managers from numerous vendors. Accordingly, the PMIx Storage working group seeks to define a flexible solution that allows vendor independence by defining an abstraction layer based on PMIx APIs and data structures in accordance with the PMIx standard architecture:

The current effort focuses on two distinct stages: application startup time and run-time control for workflow steering. Underpinning the effort are several key premises:

Application Startup

Application startup is significantly impacted by the time required to instantiate the executable and its dependent libraries on all of the active compute nodes. Understanding the time needed by the storage system to obtain access to the necessary bits on storage media and relocate them to the compute nodes is therefore essential to efficient use of the system’s resources.

Discovery Fig

The startup process can be separated into two phases. The first phase, shown at right, centers around identification of the bits required for execution of the application. Obviously, user specification of binaries and dependencies (both data and libraries) at time of job submittal offers the most accurate approach to this problem. However, users often cannot (or will not) provide a complete list of dependencies, especially when using third-party dynamically linked libraries.

In order to support these situations, PMIx provides an API by which the Work Load Manager (WLM) can obtain a list of libraries and files required by the submitted job using some combination of:

Once the dependencies have been identified, the WLM can query the storage manager to obtain an estimate of the time required to acquire executables, libraries, and data files specified by the scheduler, and position them to locations specified by the scheduler, using the results in its scheduling algorithm. This allows, for example, the time to retrieve data from cold storage (e.g., an offline tape archive) to be factored into the schedule.

Prestage Fig

The second phase of the startup process, shown at right, begins when the time window for job execution approaches – i.e., when the WLM anticipates the allocation will be given. At this point, the WLM alerts the storage manager to the upcoming allocation, passing the storage manager a list of files to be retrieved and locations where those files are to be cached. The precise timing of the caching operation is a function of the system management stack (SMS) environment. For example, some systems may initiate pre-staging while an existing job is executing its epilog, allowing the operation to continue in parallel with running the prolog for the new allocation. Others may choose to execute only during the prolog phase, or to make pre-staging contingent upon allocation of cache storage resources. The role of PMIx in this phase remains the same: to provide an API by which the WLM can direct the storage manager to move bits to their target destinations. A corresponding PMIx event has been defined by which the storage manager can alert the WLM when the bits have been cached into position.

Note that the WLM is not required to support these features – they are offered solely as an optional method for optimizing launch performance.

Run-Time Control

Run-time control over storage options – including the ability for applications to influence location, relocation, and storage strategies (e.g., striping across multiple locations, hot/warm/cold storage) of checkpoints and other data – is likewise of importance, particularly for dynamically-steered workflows. In this case, storage directives can be issued by both the application itself, and by tools executed by the user on (for example) a login node while the application is executing. Likewise, a mechanism is needed by which the SMS can alert the application to impending changes in resource availability, data movement, and other changes that might impact execution.

Runtime Control Fig

PMIx supports run-time storage control by providing:

A couple of key issues remain under discussion:

Similar to the launch support, WLMs are not required to support run-time storage control – however, they are required to at least return a “not supported” error in response to requests for unsupported services. Providing a NULL function pointer in the server callback function structure is considered equivalent to providing the “not supported” response.