Stack Trace Analysis Tool

Motivation

The scale of today’s fastest supercomputers surpasses the capabilities of even the most advanced debuggers. Most notably, Lawrence Livermore National Laboratory’s BlueGene/L boasts 212,992 processors -- far beyond the reach of the most advanced, full-featured parallel debuggers. With future architectures promising millions of cores on the horizon, this gap will only grow wider. To help fill this gap, we developed the Stack Trace Analysis Tool (STAT) to help identify groups of processes in a parallel application that exhibit similar behavior. A single representative of these groups can then be examined with a full-featured debugger like TotalView or DDT for more indepth analysis.

Description

The Stack Trace Analysis Tool gathers and merges stack traces from a parallel application’s processes. The tool produces call graphs: 2D spatial and 3D spatial-temporal; the graphs encode calling behavior of the application programs in the form of a prefix tree. The 2D spatial call prefix tree represents a single snapshot of the entire application (Figure 1). The 3D spatial-temporal call prefix tree represents a series of snapshots from the application taken over time (Figure 2). In these graphs, the nodes are labeled by function names. The directed edges, showing the calling sequence from caller to callee, are labeled by the set of tasks that follow that call path. Nodes that are visited by the same set of tasks are assigned the same color, giving a visual reference to the various equivalence classes.


Figure 1: A 2D spatial call prefix tree.


Figure 2: A 3D spatial-temporal call prefix tree.

Our STAT daemons use DynInst's StackWalker API to attach to the application processes and to gather the stack traces. As the STAT daemons collect the traces, they merge them into a local representation of the 2D and 3D trees.  To address scalability, we leverage the tree-based overlay network (TBON) model, which exploits the logarithmic scaling properties of trees for tool control and data analyses; specifically, STAT leverages the MRNet TBON implementation. MRNet creates a tree topology of networked processes that allows tool computation to be distributed and reduces the network bottleneck experienced by the traditional 1-to-N topology where a tool front-end directly connects to N back-end daemons. As call prefix trees are propagated from the daemons through the MRNet tree of processes, a data aggregation filter applies our merging algorithm. Each communication process in the MRNet tree receives call prefix trees from all of its children and then merges them all into a single 2D and 3D representation. These traces are then forwarded up the tree, until the STAT front-end creates the fully merged call prefix trees.  STAT also uses the LaunchMON daemon launching infrastructure to scalably launch daemons, identify target application processes, and pass MRNet connection information to the daemons.

Performance

We tested the performance and scalability of STAT's merging algorithm on BG/L up to the full 212,992 process count.  For our experiment, each STAT daemon, running on one of BG/L's 1,664 I/O nodes, produced a 2D spatial and 3D spatial-temporal call prefix tree for its 64 (Co-Processor mode) or 128 (Virtual Node mode) application tasks. We then measured the time it took for the front-end to broadcast a request to the daemons for these locally merged traces and for STAT to create a globally merged 2D and 3D call prefix tree representing the entire application.  At each scale, we tested a tool topology with one layer of communication processes between the front-end and daemons.  These communication processes were launched on the 14 BG/L login nodes, each node with 2 cores, allowing for a maximum of 28 concurrent communication processes.


Figure 3: STAT stack trace merging performance on BG/L

STAT's merging algorithm, its core operation, scales very well up to 208K tasks on BG/L, taking under half a second at that scale.  This performance was achieved despite having limited resources for the communication processes.  For BG/L's 1664 I/O nodes, a balanced 2-deep tree would require a layer of 41 communication processes and a balanced 3-deep tree would require layers of 12 and 144 communication processes.  Given additional compute resources we would expect to see even better scalability.

STAT Dependencies

GraphLib - a highly efficient graph generation library.

LaunchMON - a scalable tool infrastructure to co-locate tool daemons with application processes.

MRNet - a communication library that deploys a tree based overlay network.

StackWalker API - a lightweight API for gathering stack traces.

STAT GUI

We have also built a GUI to help visualize STAT's outputted call prefix trees.  This GUI allows for manipulation of the call prefix tree to the user focus on particular paths and tasks of interest.  It can also be used to identify various equivalence classes, which can be used to launch a heavyweight debugger.

Figure 4: A screenshot of the STAT call prefix tree visualizer.

References

Gregory L. Lee, Dorian C. Arnold, Dong H. Ahn, Bronis R. de Supinski, Matthew Legendre, Barton P. Miller, Martin Schulz, and Ben Liblit, “Lessons Learned at 208K: Towards Debugging Millions of Cores”, SuperComputing 2008, Austin, Texas, November 2008.

Gregory L. Lee, Dorian C. Arnold, Dong H. Ahn, Bronis R. de Supinski, Barton P. Miller, and Martin Schulz, “Benchmarking the Stack Trace Analysis Tool for BlueGene/L”, International Conference on Parallel Computing (ParCo), Aachen and Jülich, Germany, September 2007.

Dorian C. Arnold, Dong H. Ahn, Bronis R. de Supinski, Gregory L. Lee, Barton P. Miller, and Martin Schulz, "Stack Trace Analysis for Large Scale Applications", International Parallel & Distributed Processing Symposium, Long Beach, California, March 2007.

Contacts

LLNL: Dong Ahn, Bronis de Supinski, Greg Lee, Martin Schulz.  [ahn1, bronis, lee218, schulz6] @llnl.gov

UW: Matthew Legendre, Barton Miller, Ben Liblit.  [legendre, bart, liblit] @cs.wisc.edu

UNM: Dorian Arnold. [darnold] @cs.unm.edu

LLNL-MI-408819