Stack Trace Analysis Tool
Motivation
The scale of today’s fastest supercomputers surpasses the capabilities of even the most advanced debuggers. Most notably, Lawrence Livermore National Laboratory’s BlueGene/L boasts 131,072 processors -- far beyond the 8192 process capability reach of TotalView, arguably the most scalable parallel debugger. With peta-scale machines on the horizon, this gap will only grow wider. To help fill this gap, we developed the Stack Trace Analysis Tool (STAT) to help identify groups of processes in a parallel application that exhibit similar behavior. Individual (or a small number of) representatives of these groups can then be examined with a full-feature debugger like TotalView for more in-depth analysis.
Description
The Stack Trace Analysis Tool gathers and merges stack traces from a parallel application’s processes. The tool produces two kinds of call graphs: 2D spatial and 3D spatial-temporal; the graphs encode calling behavior of the application programs in the form of a prefix tree. The 2D spatial call graph represents a single snapshot of the entire application (Figure 1). The 3D spatial-temporal call graph represents a series of snapshots from the application taken over time (Figure 2). In these graphs, the nodes are labeled by function names. The directed edges, showing the calling sequence from caller to callee, are labeled by the set of tasks that follow that call path. Nodes that are visited by the same set of tasks are assigned the same color, giving a visual reference to the various equivalence classes.

Figure 1: A 2D spatial call prefix tree.

Figure 2: A 3D spatial-temporal call prefix tree.
Our STAT daemons use DynInst to attach to the application processes and to gather the stack traces. As the STAT daemons collect the traces, they merge them into a local representation of the 2D and 3D trees. To address scalability, we leverage the tree-based overlay network (TBON) model, which exploits the logarithmic scaling properties of trees for tool control and data analyses; specifically, STAT leverages the MRNet TBON implementation. MRNet creates a tree topology of networked processes that allows tool computation to be distributed and reduces the network bottleneck experienced by the traditional 1-to-N topology where a tool front-end directly connects to N back-end daemons. As call prefix trees are propagated from the daemons through the MRNet tree of processes, a data aggregation filter applies our merging algorithm. Each communication process in the MRNet tree receives call prefix trees from all of its children and then merges them all into a single 2D and 3D representation. These trees are then forwarded up the tree, until the STAT front-end creates the fully merged call prefix trees.
Performance
We tested the performance of STAT on Atlas, a Linux cluster with 1,152 four-way, dual-core 2.4 GHz AMD Opteron nodes, totaling 9,216 cores. Each node has 16GB of memory and nodes are connected with DDR Infiniband. For our experiment, each STAT daemon produced a 2D spatial and 3D spatial-temporal call prefix tree for its eight application processes. We then measured the time it took for the front-end to broadcast a request to the daemons for these traces and for STAT to create a globally merged 2D and 3D call prefix tree representing the entire application. At each scale, we tested a traditional 1-to-N tool topology (1-deep), a tool topology with one layer of communication processes between the front-end and daemons (2-deep), and a tool topology with two layers of communication processes (3-deep).

Figure 3: STAT performance on Atlas comparing a 1-deep, 2-deep, and 3-deep tree topologies.
As the process count scales up, the 1-deep tree shows the linear scaling characteristics of a traditional tool hierarchy. By employing a layer of communication processes between the front-end and the daemons (2-deep), we see much stronger scaling. Adding a second layer of communication processes (3-deep) provided even better performance at large scales. At greater scales, a 3-deep topology may prove even more beneficial.To determine the feasibility of running STAT on the full 131,072 processor BlueGene/L, we developed STATBench, a benchmark that can emulate the performance of STAT. While STAT launches a single daemon per SMP node (or I/O node) to analyze multiple application processes, STATBench can utilize all of a compute node’s cores for its daemon emulators. Instead of gathering stack traces from an application, STATBench generates artificial traces that can model the characteristics of a real application traces. By varying the number of traces generated per daemon emulator, STATBench can model various machine architectures. For this test, we generated 64 and 128 traces per daemon emulator to model coprocessor mode and virtual node mode respectively. We ran tests with a 2-deep topology as well as two 3-deep topologies. The 3-deep topology was tested with 2 communication processes connected to the front-end as well as 4 communication processes connected to the front-end. All tests were performed on a single rack, 1024 compute node BG/L system.

Figure 4: STATBench predicted performance of STAT on BG/L using a 2-deep and various 3-deep tree topologies.
At BG/L scales, there is a significant performance improvement when going from a 2-deep to a 3-deep tree topology. While less significant, there is an additional benefit from increasing the fan-out at the front-end from 2 to 4 in the 3-deep topology. At the full BG/L scale, 131,072 processes, STATBench predicts performance of 2.6 seconds.
References
Dorian C. Arnold, Dong H. Ahn, Bronis R. de Supinski, Gregory L. Lee, Barton P. Miller, and Martin Schulz, “Stack Trace Analysis for Large Scale Applications”, International Parallel & Distributed Processing Symposium, Long Beach, California, March 2007.
Gregory L. Lee, Dorian C.
Contacts
LLNL: Dong Ahn, Bronis de Supinski, Greg Lee, Martin Schulz. [ahn1,bronis,lee218,schulz6] @llnl.gov
UCRL-WEB-236186