OpenPMIx

Reference Implementation of the Process Management Interface Exascale (PMIx) standard

View the Project on GitHub

Downloads   Privacy Policy   Security Policy   Publications   Community   Contribute

RFC0012

Title

Add APIs and internal support for RM-network library interactions

Abstract

Add a network support framework and appropriate APIs so that the RM can:

* precondition an application (e.g., adding a security token to the
  app's environment) prior to launch

* setup the local network driver to support an application (e.g., for
  "instant on" address resolution) prior to spawning the local processes

* pass directives in the environment of client processes prior to forking

* cleanup after each child process terminates

* cleanup after all local children for a given application have
  terminated

Labels

[EXTENSION][SERVER-API]

Action

[APPROVED]

Copyright 2016 Intel, Inc. All rights reserved.

This document is subject to all provisions relating to code contributions to the PMIx community as defined in the community’s LICENSE file. Code Components extracted from this document must include the License text as described in that file.

Description

Part of the “Instant On” initiative relies on establishing a partnership between the resource manager (RM) and the networking library that allows the combination to fully setup the messaging environment prior to spawn of an application’s processes. Completing this procedure enables applications to communicate without discovery and exchange of network endpoint information.

There are two new APIs required to enable this support:

  1. Preconditioning the application for network operations. This typically involves obtaining a security token that must be used by each application process when communicating to another process in the same job. Some network libraries generate this token algorithmically, while others may need to obtain a token from a central server. Precondition values are typically passed to application processes as environmental variables that are recognized by the network library when initialized by the process. Thus, the PMIx_server_setup_application API takes the application’s nspace (plus whatever pmix_info_t directives the RM chooses to provide), and returns an array of pmix_info_t structures. The return is done via a callback function so the API will not block should the library need to obtain the token from a remote server.

The structures may consist of any combination of key-value pairs, and the RM shall:

Note: this API is not network-specific. Thus, as other precondition data is identified in the future, internal support can be extended to ensure all precondition data is included without changing this API.

The expected flow of operation is that the workload manager will call PMIx_server_setup_application from its head node (or system management node) once for each job to be launched. The returned information will then be included in the launch message containing the job description sent to each compute node. The compute node PMIx server will subsequently include the information in its call to PMIxServer_register_nspace_ so that the local client processes will receive it.

  1. Preparing the local network driver to support an application’s processes that are being spawned on that node. The “Instant On” initiative requires that a process know how to communicate to each other at startup. One method of accomplishing this is to “preload” the local network library with the location of all processes in that application, thus allowing the library to compute the required address information for any process. The PMIx_server_setup_local_support API is called by the RM prior to fork/exec of any local process from the given application. This is defined as a non-blocking call to allow for operations that might not immediately complete. The RM is not allowed to fork/exec any local process from the specified application until the provided callback function has been executed.

Note: some components executed by the PMIx_server_setup_local_support call may require elevated (e.g., root) privileges.

Note: this API is not network-specific. Thus, as other setup operations are identified in the future, internal support can be extended to ensure all setup is accomplished without changing this API.

Several other operations are also required by this RFC, but are not done as part of exposed APIs – i.e., they are simple additions to internal procedures. These include:

Protoype Implementation

The PMIx library implementation is covered in the Add network support APIs pull request.

Author(s)

Ralph H. Castain
Intel, Inc.
Github: rhc54