MRNet is a customizable, high-throughput communication software system for parallel tools and applications with a master/slave architecture. MRNet reduces the cost of these tools' activities by incorporating a tree-based overlay network (TBON) of processes between the tool's front-end and back-ends. MRNet uses the TBON to distribute many important tool communication and computation activities, reducing analysis time and keeping tool front-end loads manageable.
MRNet-based tools send data between front-end and back-ends on logical flows of data called streams. MRNet internal processes use filters to synchronize and aggregate data sent to the tool's front-end. Using filters to manipulate data in parallel as it passes through the network, MRNet can efficiently compute averages, sums, and other more complex aggregations on back-end data.
The MRNet distribution has two main components: libmrnet , a library that is linked into a tool's front-end and back-end components, and mrnet_commnode , a program that runs on intermediate nodes interposed between the application front-end and back-ends. libmrnet exports an API (see See C++ API Reference) that enables I/O interaction between the front-end and groups of back-ends via MRNet. The primary purpose of mrnet_commnode is to distribute data processing functionality across multiple computer hosts and to implement efficient and scalable group communications. In addition, there is another component, libmrnet_lightweight , which exports an API (see See C API Reference) that enables I/O interaction between the front-end and groups of "lightweight" back-ends via MRNet. Lightweight back-ends provide a pure C implementation of the MRNet API. They also do not support loading custom filters, and by default the API cannot be used by multiple threads concurrently. There is a separately built component, libmrnet_lightweight_r , which is thread-safe. The following sub-sections describe the lower-level components of the MRNet API in more detail.
An MRNet end-point represents a tool or application process. Specifically, they represent the back-end processes (i.e., leaf processes) in the overlay tree. The front-end can communicate in a unicast or multicast fashion with these end-points as described below.
MRNet uses communicators to represent groups of end-points. Like communicators in MPI, MRNet communicators provide a handle that identifies a set of end-points for point-to-point, multicast or broadcast communications. MPI applications typically have a non-hierarchical layout of potentially identical processes. In contrast, MRNet enforces a tree-like layout of all processes, rooted at the front-end. Accordingly, MRNet communicators are created and managed by the front-end, and communication is only allowed between a front-end and its back-ends. Thus, back-ends cannot interact with each other directly using the MRNet API.
A stream is a logical channel that connects the front-end to the end-points of a communicator. All MRNet communication uses the stream abstraction. Streams carry data packets downstream, from the front-end toward the back-ends, and upstream, from the back-ends toward the front-end. Streams are expected to carry data of a specific type, allowing data aggregation operations to be associated with a stream. The type is specified using a format string (see See Format Strings) similar to those used in C formatted I/O primitives (e.g., a packet whose data is described by the format string " %d %d %f %s " contains two integers followed by a float then a character string). MRNet expands the standard format string specification to allow for description of arrays.
Data aggregation is the process of merging multiple input data packets and transforming them into one or more output packets. Though it is not necessary for the aggregation to result in less or even different data, aggregations that reduce or modify data values are most common. MRNet uses data filters to aggregate data packets. Filters specify an operation to perform and the type of the data expected on the bound stream. Filter instances are bound to a stream at stream creation. MRNet uses two types of filters: synchronization filters and transformation filters. Synchronization filters organize data packets from downstream nodes into synchronized waves of data packets, while transformation filters operate on the synchronized data packets yielding one or more output packets. A distinction between synchronization and transformation filters is that synchronization filters are independent of the packet data type, but transformation filters operate on packets of a specific type.
Synchronization filters operate on data flowing upstream in the network, receiving packets one at a time and outputting packets only when the specified synchronization criteria has been met. Synchronization filters provide a mechanism to deal with the asynchronous arrival of packets from children nodes. The synchronizer collects packets and typically aligns them into waves, passing an entire wave onward at the same time. Therefore, synchronization filters do no data transformation and can operate on packets in a type-independent fashion. MRNet currently supports three synchronization modes:
Transformation filters can be used on both upstream and downstream data flows. Transformation filters input a group of synchronized packets, and combine data from multiple packets by performing an aggregation that yields one or more new data packets. Data packets produced by a transformation filter can be forwarded in either direction on a Stream by placing them in the appropriate output set. Since transformation filters are expected to perform computational operations on data packets, there is a type requirement for the data packets to be passed to this type of filter: the data format string of the stream's packets and the filter must be the same. Transformation operations must be synchronous, but are able to maintain state from one execution to the next. MRNet provides several transformation filters that should be of general use:
See Adding New Filters describes facilities for adding new user-defined transformation and synchronization filters.
A complete description of the MRNet API is in See C++ API Reference and See C API Reference. This section offers a brief overview only. Using libmrnet, a tool can leverage a system of internal processes, instances of the mrnet_commnode program, as a communication substrate. After instantiation of the MRNet network (discussed in See MRNet Instantiation), the front-end and back-end processes are connected by the internal processes. The connection topology and host assignment of these processes is determined by a configuration file, thus the geometry of MRNet's process tree can be customized to suit the physical topology of the underlying hardware resources. While MRNet can generate a variety of standard topologies, users can easily specify their own topologies; see See Process-Tree Topologies for further discussion.
The MRNet API contains Network, EndPoint, Communicator, and Stream objects that a tool's front-end and back-end use for communication. The Network object is used to instantiate the MRNet network and access EndPoint objects that represent available tool back-ends. The Communicator object is a container for groups of end-points, and Stream objects are used to send data to the EndPoints in a Communicator.
A simplified version of code from an example tool front-end is shown in See MRNet Front-End Sample Code. In the front-end code, after some variable definitions in lines 2-6, an instance of the MRNet network is created on line 9 using the topology specified in topology_file. In line 10, the newly created Network object is queried for an auto-generated broadcast communicator that contains all available end-points. In line 11, this Communicator is used to establish a Stream that will use a built-in filter that finds the summation of the data sent upstream. The front-end then sends one or more initialization messages to the backends; in our example code on line 12, we broadcast an integer initializer on the new stream. The tag parameter is an application-specific value denoting the nature of the message being transmitted. After the send operation, the front-end performs a blocking stream receive at line 13. This call returns a tag and a packet. Finally, line 14 calls unpack to deserialize the floating point value contained in packet.
See MRNet Back-End Sample Code shows the code for the back-end that reciprocates the actions of the front-end. Each tool back-end first connects to the MRNet network in line 5, using the back-end version of the Network constructor that receives its arguments via the program argument vector ( argc/argv ). While the front-end makes a stream-specific receive call, the back-ends use a stream-anonymous network receive that returns the tag sent by the front-end, the packet containing the actual data sent, and a stream object representing the stream that the front-end has established. Finally, each back-end sends a scalar floating point value upstream toward the front-end.
A complete example of MRNet code can be found below in See A Complete Example: Integer Addition.
While conceptually simple, creating and connecting the internal processes is complicated by interactions with the various job scheduling systems. In the simplest environments, MRNet can launch processes directly using facilities like rsh or ssh . In more complex environments, it is necessary to submit all requests to a job management system. In this case, MRNet is constrained by the operations provided by the job manager (and these vary from system to system). Currently, two modes of instantiating MRNet-based tools are supported.
In the first mode of process instantiation, MRNet creates the internal and back-end processes, using the specified MRNet topology configuration to determine the hosts on which the components should be located. First, the front-end consults the configuration and uses a remote shell program to create internal processes for the first level of the communication tree on the appropriate hosts. Upon instantiation, the newly created processes establish a network connection to the process that created it. The first activity on this connection is a message from parent to child containing the portion of the configuration relevant to that child. The child then uses this information to begin instantiation of the sub-tree rooted at that child. When a sub-tree has been established, the root of that sub-tree sends a report to its parent containing the end-points accessible via that sub-tree. Each internal node establishes its children processes and their respective connections sequentially. However, since the various processes are expected to run on different compute nodes, sub-trees in different branches of the network are created concurrently, maximizing the efficiency of network instantiation.
In the second mode of process instantiation, MRNet relies on a process management system to create some of the MRNet processes. This mode accommodates tools that require their back-ends to create, monitor, and control other processes. For example, IBM's POE uses environment variables to pass information, such as the process' rank within the application's global MPI communicator, to the MPI run-time library in each application process. In cases like this, MRNet cannot provide back-end processes with the environment necessary to start MPI application processes. As a result, MRNet creates its internal processes recursively as in the first instantiation mode, but does not instantiate any back-end processes. MRNet then waits for the tool back-ends to be started by the process management system to ensure they have the environment needed to create application processes successfully. To allow back-ends to connect to the MRNet network, information such as process host names and connection port numbers must be provided to the back-ends. This information can be provided via the environment, using shared filesystems or other information services as available on the target system. To collect the necessary information, the front-end can use the MRNet API methods for discovering the network topology details. This mode of process instantiation is referred to as "back-end attach mode". We show how to construct a tool that requires back-end attach in $MRNET_ROOT/Examples/NoBackEndInstantiation .
Standard MRNet relies on the back-end nodes supporting C++ libraries. However, we have also created a lightweight backend library with a pure C interface. The instantiation process is the same and both methods of process instantation are supported, although the API interface is slightly different.
All classes are included in the MRN namespace. For this discussion, we do not explicitly include reference to the namespace; for example, when we reference the class Network , we are implying the class MRN::Network .
In MRNet, there are five top-level classes: Network , NetworkTopology , Communicator , Stream , and Packet . The Network class primarily contains methods for instantiating and destroying MRNet process trees. The NetworkTopology class represents the interface for discovering details about the topology of an instantiated network. Application back-ends are referred to as end-points, and the Communicator class is used to reference a group of end-points. A communicator is used to establish a Stream for unicast, multicast, or broadcast communications via the MRNet infrastructure. The Packet class encapsulates the data packets that are sent on a stream. The public members of these classes are detailed below.
The corresponding lightweight backend API class is See Class Network.
backend_exe is the path to the executable to be used for the application's back-end processes. backend_argv is a null terminated list of arguments to pass to the back-end application upon creation (NOTE: backend_argv shoud not contain the executable). If backend_exe is NULL, no back-end processes will be started, and the leaves of the topology specified by topology will be instances of mrnet_commnode .
attrs is a pointer to a map of attribute-value string pairs. attrs allows front-ends to programatically set the values to use for the MRNet and XPlat environment variables (see See Environment Variables and Network Attributes) -- MRNet will only query the environment for settings not given via attrs. On Cray XT, when communication or back-end processes of the MRNet tree are to be co-located with application processes, attrs must contain a string pair that provides either the "CRAY_ALPS_APID" or "CRAY_ALPS_APRUN_PID" attribute value (see See Environment Variables and Network Attributes for a description of these attributes).
When this function completes without error, all MRNet processes specified in the topology will have been instantiated. You may use Network::has_error to check for successful completion. The explicit use of the Network constructor is now deprecated.
The back-end constructor method that is used when the process is started due to a front-end network instantiation. MRNet automatically passes the necessary information to the back-end process using the program argument vector ( argc/argv ) by inserting it after the user-specified arguments. The explicit use of the Network constructor is now deprecated.
In the "back-end attach" mode of network instantiation, where the back-end is not launched directly by MRNet, the back-end program must construct a suitable argument vector. Typically, the front-end program will obtain information about the leaf mrnet_commnode processes using the NetworkTopology class, and pass this information to back-ends using external communication channels (e.g., a shared file system). The back-ends choose a leaf process as a parent, and use that parent's host, port, and rank information to attach. Each back-end must choose a unique value for its local rank; this value must be larger than any of the ranks of the processes in the existing network. The following code shows how to construct a valid argument vector:
Network::set_FailureRecovery is used by a front-end to control whether internal communication processes and back-ends will automatically re-connect to a new parent when their parent terminates unexpectedly. By default, failure recovery is enabled and processes will re-connect. Call this method with enable set to false to turn off automatic failure recovery. This method returns true if the setting has been applied successfully, false otherwise.
Network::print_error prints a message to stderr describing the last error encountered during a Network method. It first prints the null-terminated string error_msg followed by a colon, then the actual error message followed by a newline.
This method loads a new filter operation for use in the Network, and is conveniently similar to the conventional dlopen facilities for opening a shared object and dynamically loading symbols defined within.
so_file is the path to a shared object file that contains the filter function to be loaded and functions is a vector of function names to be loaded. filter_ids is an output vector of filter ids, where the id for the function at index i in functions will be stored at index i in filter_ids .
Network::load_FilterFuncs returns the number of filter functions that were successfully loaded, and populates the the filter_ids vector with the ids of the newly loaded filters (or -1 if a function could not be loaded).
otag will be filled in with the integer tag value that was passed by the corresponding Stream::send operation. packet is the packet that was received. A pointer to the stream to which the packet was addressed will be returned in stream .
A return value of -1 indicates that the Network has experienced a terminal failure, and further attempts to send or receive data on the Network will fail. A return value of 0 indicates no packets were available for a non-blocking receive, or a stream has been closed for a blocking receive. The return value 1 indicates a packet has been received successfully.
Network::send is used to singlecast a packet from the front-end to a specific back-end. be is the rank of the back-end process. tag is an integer that identifies the data in the packet. format_string is a format string describing the data in the packet (See See Format Strings for a full description.)
A return value of -1 indicates that the Network has experienced a terminal failure, and further attempts to send or receive data on the Network will fail. The return value 0 indicates a packet has been sent successfully.
NOTE: tag must have a value greather than or equal to the constant FirstApplicationTag defined by MRNet ( #include "mrnet/Types.h" ). Tag values less than FirstApplicationTag are reserved for internal MRNet use.
Network::enable_PerformanceData uses Stream::enable_PerformanceData to start the recording of performance data of the specified metric type for the given context on all streams. Returns true on success, false otherwise. See MRNet Stream Performance Data describes the supported metric and context types. See Stream::enable_PerformanceData for additional details.
Network::disable_PerformanceData stops the recording of performance data of the specified metric type for the given context on all streams. Returns true on success, false otherwise. See Stream::disable_PerformanceData for additional details.
Network::collect_PerformanceData collects the performance data of the specified metric type for the given context on all streams. The performance data of each stream is passed through the transformation filter identified by aggr_filter_id . The data for all streams is stored within the map results , keyed by stream identifier. Returns true on success, false otherwise. See Stream::collect_PerformanceData for additional details.
Network::enable_PerformanceData uses Stream::print_PerformanceData to print recorded performance data of the specified metric type for the given context on all streams. Data is printed to the MRNet log files. See Stream::print_PerformanceData for additional details.
This method returns a pointer the next pending Event, or NULL if no events are available. Each event has an associated EventClass, one of Event::DATA_EVENT , Event::TOPOLOGY_EVENT , or Event::ERROR_EVENT , that can be queried using Event::get_Class. Similarly, each event has an associated EventType that can be queried using Event::get_Type.
etyp should be set to either Event::EVENT_TYPE_ALL , to have the function called when any event within the specified EventClass occurs, or one of the valid class-specific EventType values (see the classes DataEvent, TopologyEvent, and ErrorEvent in "mrnet/Event.h" for the class-specific types).
The type evt_cb_func is defined as `void (*evt_cb_fn)( Event* e, void* cb_data )'. All user-defined callback functions must use the same function prototype. When an event occurs, all callbacks registered for that type of event will be called. Each function is passed a pointer to the Event , and the value of the auxiliary data pointer cb_func_data given at registration, which may be NULL.
This method removes cb_func from the list of functions to be called for the specified EventClass and EventType. If eclass is given as Event::EVENT_CLASS_ALL , the function will be removed for all events. etyp can be given as Event::EVENT_TYPE_ALL to remove the function for all types of events in the given eclass .
This method removes all functions to be called for the specified EventClass and EventType. If eclass is given as Event::EVENT_CLASS_ALL , all callback functions will be removed for all events. etyp can be given as Event::EVENT_TYPE_ALL to remove all functions registered for all types of events in the given eclass .
eclass should be set to one of Event::DATA_EVENT , Event::TOPOLOGY_EVENT , or Event::ERROR_EVENT . Event::DATA_EVENT can be used by both front-end and back-end processes to provide notification that one or more data packets have been received. Event::TOPOLOGY_EVENT and Event::ERROR_EVENT can only be used by front-end processes, and provide notification when the front-end observes a change in network topology or an error, respectively.
When the file descriptor has data available (for reading), you should call Network::clear_EventNotificationFd before taking action on the notification. When notifications are no longer needed, use Network::close_EventNotificationFd .
This method resets the event notification file descriptor returned from Network::get_EventNotificationFd . eclass should be set to one of Event::DATA_EVENT , Event::TOPOLOGY_EVENT , or Event::ERROR_EVENT .
This method closes the event notification file descriptor returned from Network::get_EventNotificationFd . eclass should be set to one of Event::DATA_EVENT , Event::TOPOLOGY_EVENT , or Event::ERROR_EVENT .
The corresponding lightweight backend API class is See Class NetworkTopology.
This method fills the leaves vector with pointers to the leaf nodes in the topology. In the case where back-end processes are not started when the network is instantiated, a front-end process can use this function to retrieve information about the leaf internal processes to which the back-ends should attach.
This method fills a set with pointers to all back-end process tree nodes. Note that this method is unsafe to use while the network topology is in flux, as is the case during the "back-end attach" mode of MRNet tree instantiation.
num_nodes is the total number of tree nodes (same as the value returned by NetworkTopology::get_NumNodes ), depth is the depth of the tree (i.e., the maximum path length from root to any leaf), min_fanout is the minimum number of children of any parent node, max_fanout is the maximum number of children of any parent node, avg_fanout is the average number of children across all parent nodes, and stddev_fanout is the standard deviation in number of children across all parent nodes.
Multiple calls to this method return the same pointer to the Communicator object created at network instantiation. If the network topology changes, as can occur when starting back-ends separately, the object will be updated to reflect the additions or deletions. This object should not be deleted.
If the set of end-points in the communicator already contains the new end-point, the function returns success. This method fails if there exists no end-point defined by ep_rank . This method returns true on success, false on failure.
This method is similar to add_EndPoint above except that it takes a pointer to a CommunicationNode object instead of a rank. Success and failure conditions are exactly as stated above. This method returns true on success and false on failure.
Instances of Stream are network specific, so their creation methods are functions of an instantiated Network object. The corresponding lightweight backend API class is See Class Stream.
MRNet provides two types of streams, homegenous and heterogeneous. Homogenous streams use the same filters at every process participating in the stream, while heterogeneous streams allow for different filters to be used at different processes.
This version of Network::new_Stream is used to creae a heterogeneous Stream object. Users specify where packet filters are placed within the tree. Like the homogenous version of Network::new_Stream , the end-points are specified by the comm argument.
Strings are used to specify the filter placements, with the following syntax: "filter_id => rank; [filter_id => rank; ...]". If "*" is specified as the rank for an assignment, the filter will be assigned to all ranks that have not already been assigned. If a rank within comm is not assigned a filter, it will use the default filter. See $MRNET_ROOT/Examples/HeterogeneousFilters for an example of using Network::new_Stream to specify different filter types to be used within the same stream.
Returns a pointer to the Stream identified by id , or NULL on failure. Back-ends may pass their local rank as the id to retrieve a singlecast stream that can be used for non-filtered communication directly with the front-end.
When used by back-ends, this method returns true if the front-end has closed this Stream by deleting the corresponding object, false otherwise. On the front-end, this method can be used to determine if the stream has been disabled due to a non-recoverable failure (e.g., a back-end process has died or a sub-tree containing stream end-points has become unreachable).
format_string is a format string describing the data in the packet (See See Format Strings for a full description).
NOTE: tag must have a value greather than or equal to the constant FirstApplicationTag defined by MRNet ( #include "mrnet/Types.h" ). Tag values less than FirstApplicationTag are reserved for internal MRNet use.
Commits a flush of all packets currently buffered by this Stream. A successful return value of 0 indicates that all buffered packets have been passed to the operating system for network transmission. A return value of -1 indicates that the stream has experienced a terminal failure, and further attempts to send or receive data on the stream will fail.
A return value of -1 indicates that the stream has experienced a terminal failure, and further attempts to send or receive data on the stream will fail. A return value of 0 indicates no packets were available for a non-blocking receive, or the stream has been closed. The return value 1 indicates a packet has been received successfully.
When the file descriptor has data available (for reading), you should call Stream::clear_DataNotificationFd before taking action on the notification. When notifications are no longer needed, use Stream::close_DataNotificationFd .
Stream::set_FilterParameters allows users to dynamically configure the operation of a stream transformation filter by passing arbitrary data in a similar fashion to Stream::send . When the filter executes, the passed data is available as a PacketPtr parameter to the filter, and the filter can extract the configuration settings.
ftype should be given as FILTER_UPSTREAM_SYNC to configure the synchronization filter, FILTER_UPSTREAM_TRANS for upstream transformation filter and FILTER_DOWNSTREAM_TRANS for downstream transformation filter.
Stream::enable_PerformanceData starts recording performance data for the specified metric type for the given context . Returns true on success, false otherwise. See MRNet Stream Performance Data describes the metric and context types.
Stream::disable_PerformanceData stops recording performance data for the specified metric type for the given context . Previously recorded data is not discarded, so that it can be retrieved with Stream::collect_PerformanceData . Users can enable/disable recording for a particular metric and context any number of times before collecting the results. Returns true on success, false otherwise.
Stream::collect_PerformanceData collects the recorded performance data for the specified metric type for the given context . The collected data is returned in a rank_perfdata_map , which associates individual node ranks to a std::vector< perf_data_t > containing the recorded data instances. After collection, the recorded data at each nodeis discarded. Returns true on success, false otherwise.
Users can aggregate the recorded data across nodes by specifying a transformation filter with aggr_filter_id . Note that only the built-in filter types of TFILTER_SUM, TFILTER_MIN, TFILTER_MAX, TFILTER_AVG, and TFILTER_ARRAY_CONCAT are supported. By default, performance data from each node is concatenated, and results contains every recorded data instance for each node. If the summary aggregation filters are used, results will contain a single associated pair. The rank for this pair is equal to -1∞(number of aggregated ranks), and the vector contains one or more aggregated instances.
A Packet encapsulates a set of formatted data elements sent on a stream. Packets are created using a format string (e.g., " %s %d " describes a null-terminated string followed by a 32-bit integer, and the packet is said to contain two data elements). MRNet front-end and back-end processes typically do not create Packet instances, as they are automatically produced from the formatted data passed to Stream::send or Network::send . Each Packet is allocated using malloc of the C standard library, and therefore has the same byte alignment guarantees. See Format Strings contains the full listing of data types that can be sent in a Packet.
When receiving a packet via Stream::recv , Network::recv , or in a filter's input vector, the Packet instance is stored within a PacketPtr object. PacketPtr is a class based on the Boost shared_ptr class, and helps with memory management of packets. A PacketPtr can be assumed to be equivalent to "Packet *", and all operations on packets require use of PacketPtr. Packets can be created explicitly using the constructors shown below, using the following method to instantiate a PacketPtr.
The corresponding lightweight backend API class is See Class Packet.
Constructs a Packet that can be sent on the stream with the given stream_id. The packet is associated with a tag that can be used by receivers to identify the contents. format_str is a format string describing the data elements in the packet. The variable arguments following format_str should have the appropriate types for each data element. Note that for array data elements, an extra argument must be passed to hold each array's length. (See See Format Strings for a full description.)
Works the same as the first Packet constructor, but allows for passing a va_list in place of the variable arguments. This constructor is useful for libraries built on top of MRNet that allow users to specify packet format strings.
Works the same as the first Packet constructor, but allows for passing an array of data element pointers instead of the variable arguments. The data_elems array must contain the same number of elements as indicated by format_str .
Extracts data contained within this Packet according to the format_str , which must match that of the packet. format_str is a format string describing the data in the packet (See See Format Strings for a full description).
For the first version, the function arguments following format_str should be pointers to the appropriate types of each data item. For string and array data types, new memory buffers to hold the data will be allocated using malloc , and it is the user's responsibility to free these strings and arrays. Note that for array data elements, an extra argument must be passed to hold each array's length.
For the second version, the va_list list should contain the arguments corresponding to the varargs from the first version. The third parameter is a dummy parameter required by some compilers to distinguish the second version from the first version, and its value is ignored.
This method can be used to tell MRNet to deliver the packet to a specific set of back-ends, rather than all the back-ends addressed by the stream on which the packet is sent. bes should point to an array of back-end ranks, and num_bes is the number of entries in the array.
This method can be used to tell MRNet whether or not to deallocate the string and array data members of a Packet. If destroy is true, string and array data members will be deallocated using free when the Packet destructor is executed - this assumes they were allocated using malloc . The default behavior for user-generated packets is not to deallocate (false). Turning on deallocation is useful in filter code that must allocate strings or arrays for output packets, which cannot be freed before the filter function returns.
In the MRNet lightweight back-end library, the MRNet C++ classes are mimicked for ease of use. With the exception of constructors/destructors, API calls in standard MRNet can be translated to their lightweight versions according to the following pattern:
The back-end constructor method. MRNet automatically passes the necessary information to the back-end process using the program argument vector ( argc/argv ) by inserting it after the user specified arguments. See See Network * Network::CreateNetworkBE( int argc, char ** argv ); for more information on the required arguments.
otag will be filled in with the integer tag value that was passed by the corresponding Stream_send operation. packet is the packet that was received. A pointer to the Stream_t to which the packet was addressed will be returned in stream .
otag will be filled in with the integer tag value that was passed by the corresponding Stream_send operation. packet is the packet that was received. A pointer to the Stream_t to which the packet was addressed will be returned in stream .
Network_get_NetworkTopology is used to retrieve a pointer to the underlying NetworkTopology_t instance within network . Note that in the lightweight back-end library, the information available in the NetworkTopology_t may be a subset of the full topology.
Network_get_Stream returns a pointer to a Stream_t identified by id , or NULL on failure. Back-ends may pass their local rank as the id to retrieve a singlecast stream that can be used for non-filtered communication directly with the front-end.
This method sends data on stream . tag is an integer that identifies the data to be sent by the stream. format_string is a format string describing the types of the data elements (see See Format Strings for a full description.) On success, Stream_send returns 0; on failure, -1.
NOTE: tag must have a value greater than or equal to the constant FirstApplicationTag defined by MRNet ( #include "mrnet_lightweight/Types.h" ). Tag values less than FirstApplicationTag are reserved for internal MRNet use.
Stream_recv invokes a stream-specific receive operation. Packets addressed to the passed stream will be returned, one-at-a-time, in FIFO order. If blocking is true, the operation will block until a packet is available for this stream; if false, the operation will return immediately.
This method extracts data elements contained within packet according to the format_string , which must match that of packet . The function arguments following format_string should be pointers to the appropriate types of each data element. For string and array data types, new memory buffers to hold the data will be allocated using malloc , and it is the user's responsibility to free these strings and arrays. Note that for array data elements, an extra argument must be passed to hold each array's length.
This method extracts data elements contained within packet according to the format_string , which must match that of packet . The function arguments contained in the va_list arg_list should be pointers to the appropriate types of each data element. For string and array data types, new memory buffers to hold the data will be allocated using malloc , and it is the user's responsibility to free these strings and arrays. Note that for array data elements, an extra argument must be passed to hold each array's length. The fourth parameter
For this discussion, $MRNET_ROOT is the location of the top-level directory of the MRNet distribution and $MRNET_ARCH is a string describing the platform (OS and architecture) as discovered by the configure process.
MRNet has been developed to be highly portable; we expect it to run properly on all common Unix-based as well as Windows platforms. Please refer to the README document included with the MRNet distribution for the list of currently supported platforms.
MRNet requires GNU make for building on Unix/Linux systems. Our build system attempts to use native system compilers where available. For building on Windows systems, Visual Studio 2010 solution/project files are available, as are pre-built libraries and binaries.
The shell script, mrnet_tests.sh is placed in the build/installation directory for binaries along with the other executables. This script can be used to run the MRNet test programs and check their output for errors. The script is used as follows:
One of the -l , -r , or -a flags is required. The -l flag is used to run all tests using only topologies that create processes on the local machine (note: running the tests locally can take quite a while, up to 15 minutes depending on machine capabilities). The -r flag runs tests using remote machines specified in the file whose name immediately follows this flag. To run tests both locally and remotely, use the -a flag and provide a hostfile.
To test MRNet's ability to dynamically load shared libraries containing filter functions, you must specify the -f flag. The -lightweight flag is used to run the tests using lightweight back-ends in addition to the standard back-ends.
MRNet allows a tool to specify a node allocation and process connectivity tailored to its computation and communication requirements and to the system where the tool will run. Choosing an appropriate MRNet configuration can be difficult due to the complexity of the tool's own activity and its interaction with the system. This section describes how users define their own process topologies, and the mrnet_topgen utility provided by MRNet to facilitate generation of topology specification files.
The first parameter to Network::CreateNetworkFE is the name of an MRNet topology file. This file defines the topological layout of the front-end, internal, and back-end MRNet processes. In the syntax of the topology file, the hostname:id tuple represents a process with instance id running on hostname. It is important to note that the instance is used to distinguish processes on the same host, and does not reflect a port or process rank. A line in the topology file has the form:
meaning a process on hostname1 with instance id 0 has two children, with instance ids 1 and 2, running on the same host. MRNet will parse the topology file without error if the file properly defines a tree in the mathematical sense (i.e. a tree must have a single root, no cycles, full connection, and no node can be its own descendant). Please note that the hostname associated with the root of the topology must match the host where the front-end process is run, or a run-time error will occur.
MRNet provides a topology generator program that supports generation of balanced and k-nomial topologies using simple specifications, and arbitrary topologies with a more complex specification that fully enumerates the topology fan-outs at each level of the tree. After MRNet is built, this program can be found at $MRNET_ROOT/build/$MRNET_ARCH/bin/mrnet_topgen . The usage can be obtained by running mrnet_topgen without arguments.
The generator program uses host lists that specify the maximum number of processes to place on each host. The format for the host list is one host specification per line, where each specification is of the form hostname[:num_slots]. If the number of process slots is not given with the host, the generator program assumes only one process should be placed on the host. Additionally, if the same hostname is given on multiple lines, the number of processes that can be placed on the host is the summation of the process slot counts for all lines. An example host list file follows:
The above host list file results in three hosts being available for topology process placement, with host1 having four available slots, and host2 and host3 each having two available slots. The generator program also allows users to specify different host lists for the placement of internal communication processes and back-end processes (see the mrnet_topgen usage for more information).
Some MRNet front-end programs may wish to generate a topology at run-time. To support this requirement, MRNet provides three API classes: BalancedTree, KnomialTree, and GenericTree that front-end programs may use directly to generate any topology that can be produced by mrnet_topgen . Not surprisingly, mrnet_topgen is built upon these classes, and its source code ( $MRNET_ROOT/tests/config_generator.C) can serve as a reference for front-end programs wishing to use these classes.
packets_in is a reference to a vector of packets serving as input to the filter function. packets_out is a reference to a vector into which output packets should be placed. When packets need to be sent in the reverse direction on the stream, packets_out_reverse can be used instead of packets_out . Both packets_out and packets_out_reverse can be used simultaneously. filter_state may be used to define and maintain state specific to a filter instance. config_params is a reference to a PacketPtr containing the current configuration settings for the filter instance, as can be set using Stream::set_FilterParameters . Finally, topol_info provides information that can be used by filters to determine the local process's placement in the topology, as well as access to the local Network object.
For each filter function defined in a shared object file, there must be a const char * symbol named by the string formed by the concatenation of the filter function name and the suffix " _format_string ". For instance, if the filter function is named my_filter_func , the shared object must define a symbol "const char* my_filter_func_format_string ". The value of this string will be the MRNet format string describing the format of data that the filter can operate on. A value of "" denotes that the filter can operate on packets containing arbitrary data.
MRNet automatically recovers from failures of internal tree processes (i.e., those processes that are not the front-end (root) or back-ends (leaves)). As part of the recovery, MRNet will extract filter state from the children of a failed process and pass that state as input to each child's newly chosen parent. If you have a filter that maintains persistent state using filter_state , you can provide an additional function within the shared object for your filter that MRNet may use to extract the state. The name of this extraction function should be the same as the filter name with the suffix " _get_state " appended. For instance, if the filter function is named my_filter_func, the extraction function should be named my_filter_func_get_state .
filter_state is a pointer to the state defined by the filter for the stream identified by stream_id . This function should extract the necessary state and return a packet that can be passed as input to the filter function. Since the packet will be processed as a normal input packet for the filter, it's format must match that expected by the filter. A fault-tolerant filter example is provided in $MRNET_ROOT/Examples/FaultRecovery .
Since we use the C facility dlopen to dynamically load new filter functions, all C++ symbols must be exported. That is, the filter function, format string, and state extraction function definitions must fall within the statements:
The file that contains the filter functions and format strings must be compiled into a valid shared object. For example, with the GNU C++ compiler on ELF systems, the options " -fPIC -shared -rdynamic " can be used. Please refer to your compiler documentation for the appropriate options for other compilers. You can also refer to the setting of the SOFLAGS variable in $MRNET_ROOT/build/$MRNET_ARCH/Makefile.examples to see the options chosen during the configure process for compiling the Example filter libraries.
Additionally, front-end and back-end programs that will dynamically load filters must be built with compiler options that notify the linker to export global symbols (for GNU compilers, you can use "- Wl,-E ").
Arrays of specific types are also supported. An array can be specified by prepending `a' to the appropriate type conversion (e.g., "%ac" for an array of 8-bit signed characters). Array conversions require an implicit length parameter of type uint32_t to be passed to send() methods: e.g., send("%af", float_array_pointer, float_array_length) . When the number of elements in the array exceeds the maximum value of uint32_t (roughly over 4 billion elements), MRNet provides a large array conversion that can be specified by prepending `A' to the appropriate type conversion (e.g., "%Auc" for an array of 8-bit unsigned characters). Large array conversions require an implicit length parameter of type uint64_t .
All data is recorded as instances of a perf_data_t, which is simply a union type that can hold a 64-bit signed integer, a 64-bit unsigned integer, or a double precision float. As shown below, the data values can be accessed using the i , u , or d union fields.
See Metric-Context Compatibility Matrix shows which metrics are valid for a given context. When a metric is valid for only CTX_FILT_OUT , the metric is actually recorded through a combination of measurements at CTX_FILT_IN and CTX_FILT_OUT . When a metric is valid for only CTX_NONE , the data is only recorded at the time it is collected. An example MRNet application that makes use of the Stream performance data collection facilities is provided in $MRNET_ROOT/Examples/PerformanceData .
Set the debug output level (valid values are 1-5, default is 1). Level 1 will only log warning/error messages, level 3 provides fairly detailed function execution logging, and level 5 enables all log messages.
As an alternative to CRAY_ALPS_APID, you may use this attribute to specify the process id (pid) of the aprun process used to start the target application, and MRNet will obtain the corresponding apid.
Specify a colon-separated list of file pathnames (e.g., "/path/to/file/a:/path/to/file/b"). MRNet will use the ALPS tool helper library to stage the specified files to Cray compute nodes hosting the application identified using either CRAY_ALPS_APID or CRAY_ALPS_APRUN_PID.
MRNet uses a timeout on its topology update stream, and the default is 250 ms. This can be changed by setting MRNET_TOPOLOGY_UPDATE_TIMEOUT_MSEC. This only effects the first timeout period after a topology update is received, and after this the event wait timeout default is used.