Stack Trace Analysis Tool

News | Motivation | Description | Performance | GUI | Source, Dependencies and Portability | Documentation | References | Contacts


The scale of today’s fastest supercomputers surpasses the capabilities of even the most advanced debuggers. For instance, Lawrence Livermore National Laboratory’s BlueGene/L boasts 212,992 processors. With future architectures, such as Sequoia, promising millions of cores on the horizon, this gap will only grow wider. To help fill this gap, we developed the Stack Trace Analysis Tool (STAT) to help identify groups of processes in a parallel application that exhibit similar behavior. A single representative of these groups can then be examined with a full-featured debugger like TotalView or DDT for more in-depth analysis.


The Stack Trace Analysis Tool gathers and merges stack traces from a parallel application’s processes. The tool produces call graphs: 2D spatial and 3D spatial-temporal; the graphs encode calling behavior of the application processes in the form of a prefix tree. The 2D spatial call prefix tree represents a single snapshot of the entire application (Figure 1). The 3D spatial-temporal call prefix tree represents a series of snapshots from the application taken over time (Figure 2). In these graphs, the nodes are labeled by function names. The directed edges, showing the calling sequence from caller to callee, are labeled by the set of tasks that follow that call path. Nodes that are visited by the same set of tasks are assigned the same color, giving a visual reference to the various equivalence classes.

Figure 1: A 2D spatial call prefix tree.

Figure 2: A 3D spatial-temporal call prefix tree.

STAT is also capable of gathering stack traces with more fine-grained information, such as the program counter or the source file and line number of each frame.  The latter information can be fed into a static code analysis engine to derive the temporal order of the MPI tasks.  This analysis traverses from the root of the tree towards the leaves, at each step determining the partial ordering of sibling nodes in the tree by analyzing the control flow of the source code.  For straight-line code, this means that one task has made more progress if it has executed past the point of another task, i.e., if it has a greater line number.  This ordering is partial since two tasks in different branches of an if-else are incomparable.  In cases where the program points are within a loop, STAT can extract the loop ordering variable from the application processes and further delineate tasks by execution progress.  This analysis can be useful for identifying the culprit in a deadlocked or livelocked application, where the problematic task has often either made the least or most progress through the code, leaving the remaining tasks stuck in a barrier or blocked pending a message.  More information about the temporal ordering analysis is available here.

Figure 3: STAT's temporal ordering analysis engine indicates that task 1 has made the least progress.  In this example task 1 is stuck in a compute cycle, while the other tasks are blocked in MPI communication, waiting for task 1.

Our STAT daemons use DynInst's StackWalker API to attach to the application processes and to gather the stack traces and stack variables. As the STAT daemons collect the traces, they merge them into a local representation of the 2D and 3D trees.  To address scalability, we leverage the tree-based overlay network (TBON) model, which exploits the logarithmic scaling properties of trees for tool control and data analyses; specifically, STAT leverages the MRNet TBON implementation. MRNet creates a tree topology of networked processes that allows tool computation to be distributed and reduces the network bottleneck experienced by the traditional 1-to-N topology where a tool front-end directly connects to N back-end daemons. As call prefix trees are propagated from the daemons through the MRNet tree of processes, a data aggregation filter applies our merging algorithm. Each communication process in the MRNet tree receives call prefix trees from all of its children and then merges them all into a single 2D and 3D representation. These traces are then forwarded up the tree, until the STAT frontend creates the fully merged call prefix trees.  STAT also uses the LaunchMON daemon launching infrastructure to scalably launch daemons, identify target application processes, and pass MRNet connection information to the daemons.


We tested the performance and scalability of STAT's merging algorithm on BG/L up to the full 212,992 process count.  For our experiment, each STAT daemon, running on one of BG/L's 1,664 I/O nodes, produced a 2D spatial and 3D spatial-temporal call prefix tree for its 64 (Co-Processor mode) or 128 (Virtual Node mode) application tasks. We then measured the time it took for the front-end to broadcast a request to the daemons for these locally merged traces and for STAT to create a globally merged 2D and 3D call prefix tree representing the entire application.  At each scale, we tested a tool topology with one layer of communication processes between the front-end and daemons.  These communication processes were launched on the 14 BG/L login nodes, each node with 2 cores, allowing for a maximum of 28 concurrent communication processes.

Figure 4: STAT stack trace merging performance on BG/L

STAT's merging algorithm, its core operation, scales very well up to 208K tasks on BG/L, taking under half a second at that scale.  This performance was achieved despite having limited resources for the communication processes.  For BG/L's 1664 I/O nodes, a balanced 2-deep tree would require a layer of 41 communication processes.  Given additional compute resources we would expect to see even better scalability.


We have also built a GUI to run STAT and help visualize STAT's outputted call prefix trees.  It is written in Python with PyGTK and uses the XDot package for dot file layout.  This GUI provides a variety of operations to help the user focus on particular call paths and tasks of interest.  It can also be used to identify the various equivalence classes and includes an interface to attach a heavyweight debugger to the representative subset of tasks.

Figure 5: A screenshot of the STAT GUI.

STAT Source, Dependencies and Portability

STAT source code is available for download at the SciDAC Outreach Center here.

STAT is built on a portable, open source infrastructure:
STAT has been successfully run on a wide range of HPC platforms, including:


STAT Quick Start Guide

STAT User Guide

STAT Description and Usage Video


Dong H. Ahn, Bronis R. de Supinski, Ignacio Laguna, Gregory L. Lee, Ben Liblit, Barton P. Miller, and Martin Schulz, "Scalable Temporal Order Analysis for Large Scale Debugging", Supercomputing 2009, Portland, OR, November 2009.

Gregory L. Lee, Dorian C. Arnold, Dong H. Ahn, Bronis R. de Supinski, Matthew Legendre, Barton P. Miller, Martin Schulz, and Ben Liblit, “Lessons Learned at 208K: Towards Debugging Millions of Cores”, SuperComputing 2008, Austin, Texas, November 2008.

Gregory L. Lee, Dorian C. Arnold, Dong H. Ahn, Bronis R. de Supinski, Barton P. Miller, and Martin Schulz, “Benchmarking the Stack Trace Analysis Tool for BlueGene/L”, International Conference on Parallel Computing (ParCo), Aachen and Jülich, Germany, September 2007.

Dorian C. Arnold, Dong H. Ahn, Bronis R. de Supinski, Gregory L. Lee, Barton P. Miller, and Martin Schulz, "Stack Trace Analysis for Large Scale Applications", International Parallel & Distributed Processing Symposium, Long Beach, California, March 2007.


LLNL: Dong Ahn, Bronis de Supinski, Greg Lee, Matt Legendre, Martin Schulz.  [ahn1, bronis, lee218, legendre1, schulz6]

UW: Barton Miller, Ben Liblit.  [bart, liblit]

UNM: Dorian Arnold.  [darnold]
Last Modified November 2, 2011