Stack Trace Analysis Tool
- March 2013: STAT 2.0 release,
now on GitHub.
- September 2012:
STAT used to debug
1 million MPI tasks!
- November 2011: STAT 1.2.1
- June 2011: STAT wins an R&D
100 award! LLNL
- May 2011: STAT 1.1 release.
- March 2011: We have
posted a video
about STAT and its usage on YouTube.
- November 2010: STAT 1.0 release.
- May 2010: STAT
successfully run on 216,000 cores of ORNL's Cray XT5 Jaguar.
- November 2009: STAT's
source code is
now hosted at The Office of Science's SciDAC Outreach Center.
- November 2009: A paper
on STAT's temporal analysis at SuperComputing 2009.
- December 2008: STAT
featured in a Government Computer News article.
- November 2008: A paper
on STAT's scalability at SuperComputing 2008.
The scale of today’s
fastest supercomputers surpasses the capabilities of even the most
advanced debuggers. For instance, Lawrence Livermore National
Laboratory’s BlueGene/Q-based Sequioa boasts 1,572,864 cores. To help
debug problems that arise at such extreme scales, we developed the S
ool (STAT) to help identify groups of processes in a parallel
application that exhibit similar behavior. A single representative of
these groups can then be examined with a
full-featured debugger like TotalView
more in-depth analysis.
The Stack Trace
Analysis Tool gathers and merges stack traces from a parallel
application’s processes. The tool produces call graphs: 2D
spatial and 3D spatial-temporal; the graphs encode calling behavior of
the application processes in the form of a prefix tree. The 2D spatial
call prefix tree represents a single snapshot of the entire application
(Figure 1). The 3D spatial-temporal call prefix tree represents a
snapshots from the application taken over time (Figure 2). In these
graphs, the nodes are labeled by function names. The directed edges,
showing the calling sequence from caller to callee,
are labeled by the set of tasks that follow that call path. Nodes that
are visited by the same set of tasks are assigned the same color,
giving a visual reference to the various equivalence classes.
: A 2D spatial call prefix tree.
: A 3D spatial-temporal call prefix tree.
STAT is also capable of gathering
stack traces with more fine-grained
information, such as the program counter or the source file and line
of each frame. The latter information can be fed into a static
code analysis engine to derive the temporal order of the MPI
tasks. This analysis traverses from the root of the tree towards
the leaves, at each step determining the partial ordering of sibling
nodes in the tree by analyzing the control flow of the source
For straight-line code, this means that one task has made more progress
if it has executed past the point of another task, i.e., if it has a
greater line number. This ordering is partial since two tasks in
different branches of an if-else are incomparable. In cases where
program points are within a loop, STAT can extract the loop ordering
variable from the application processes and further delineate tasks by
execution progress. This analysis can be useful for identifying
in a deadlocked or livelocked application, where the problematic task
has often either made the least or most progress through the
code, leaving the remaining tasks stuck in a barrier or blocked pending
a message. More information about the temporal ordering analysis
is available here
: STAT's temporal
ordering analysis engine indicates that task 1 has made the least
progress. In this example task 1 is stuck in a compute cycle,
while the other tasks are blocked in MPI communication, waiting for
Our STAT daemons use DynInst's Stackwalker API
to attach to the
and to gather the stack traces and stack variables. As the STAT daemons
collect the traces,
they merge them into a local representation of the 2D and 3D
address scalability, we leverage the tree-based overlay network (TBON)
model, which exploits the logarithmic scaling properties of trees for
tool control and data analyses; specifically, STAT leverages the MRNet
TBON implementation. MRNet creates a tree topology of networked
processes that allows tool computation to be distributed and reduces
the network bottleneck experienced by the traditional 1-to-N topology
where a tool front-end directly connects to N back-end daemons. As call
prefix trees are propagated from the daemons through the MRNet tree of
processes, a data aggregation filter applies our merging algorithm.
Each communication process in the MRNet tree receives call prefix trees
from all of its children and then merges them all into a single 2D and
3D representation. These traces are then forwarded up the tree, until
the STAT frontend creates the fully merged call prefix trees.
STAT also uses the LaunchMON daemon launching infrastructure to
scalably launch daemons, identify target application processes, and
pass MRNet connection information to the daemons.
performance and scalability of STAT's merging algorithm on BG/L up to
the full 212,992
process count. For
our experiment, each STAT daemon, running on one of BG/L's 1,664 I/O
a 2D spatial and 3D spatial-temporal call prefix tree for its 64
(Co-Processor mode) or 128 (Virtual Node mode) application tasks. We
then measured the time it took for the
front-end to broadcast a request to the daemons for these locally
merged traces and
for STAT to create a globally merged 2D and 3D call prefix tree
representing the entire application. At
each scale, we tested a
tool topology with one layer of communication processes between the
front-end and daemons. These communication processes
were launched on the 14 BG/L login nodes, each node with 2 cores,
allowing for a maximum of 28 concurrent communication processes.
: STAT stack trace merging performance on BG/L
STAT's merging algorithm, its core
operation, scales very well up to
208K tasks on BG/L,
taking under half a second at that scale. This performance was
achieved despite having limited resources for the communication
processes. For BG/L's 1664 I/O nodes, a balanced 2-deep tree
require a layer of 41 communication processes.
Given additional compute resources we would expect to see even better
We have also built a GUI to run STAT
visualize STAT's outputted
call prefix trees. It is written in Python
and uses the XDot
for dot file layout. This GUI provides a variety of operations to
help the user focus on particular call paths and tasks of
It can also be used to identify the various equivalence
classes and includes an interface to attach a heavyweight debugger to
the representative subset of tasks.
: A screenshot of the
STAT Source, Dependencies and Portability
STAT source code is available for download at GitHub here
STAT is built on a
portable, open source infrastructure:
- a highly efficient graph generation library.
- a scalable tool infrastructure to co-locate tool
daemons with application processes.
library that deploys a tree based overlay
- StackWalker API
API for gathering stack traces.
STAT has been successfully run on a
wide range of HPC platforms,
Quick Start Guide
and Usage Video
Dong H. Ahn, Bronis R. de Supinski, Ignacio Laguna, Gregory L. Lee, Ben
Liblit, Barton P. Miller, and Martin Schulz, "Scalable
Temporal Order Analysis for Large Scale Debugging"
, Portland, OR, November 2009.
Gregory L. Lee, Dorian C. Arnold
Dong H. Ahn, Bronis R. de
Supinski, Matthew Legendre, Barton P. Miller, Martin Schulz, and Ben
Learned at 208K: Towards Debugging Millions of Cores
Gregory L. Lee, Dorian C. Arnold
Dong H. Ahn, Bronis R. de
Supinski, Barton P. Miller, and Martin Schulz, “Benchmarking
Trace Analysis Tool for BlueGene/L
Conference on Parallel Computing (ParCo)
Aachen and Jülich
, Germany, September
Dorian C. Arnold, Dong H. Ahn, Bronis R. de
Supinski, Gregory L. Lee, Barton P. Miller, and Martin Schulz, "Stack
Trace Analysis for Large Scale Applications
Parallel & Distributed Processing Symposium
, Long Beach,
California, March 2007.
LLNL: Dong Ahn, Bronis de Supinski, Greg Lee, Matt Legendre,
[ahn1, bronis, lee218,
UW: Barton Miller, Ben Liblit.
UNM: Dorian Arnold. [darnold] @cs.unm.edu
Last Modified March 8, 2013