Stack Trace Analysis Tool
News
- June 2015: STAT 2.2.0 release,
get the source here.
- April 2014: STAT 2.1.0 release,
get the source here.
- March 2013: STAT 2.0.0 release,
get the source here.
- September 2012:
STAT used to debug
1 million MPI tasks!
- November 2011: STAT 1.2.1
release,
get the source here.
- June 2011: STAT wins an R&D
100 award! LLNL
press release.
- May 2011: STAT 1.1.0 release,
get the source here.
- March 2011: We have
posted a video
about STAT and its usage on YouTube.
- November 2010: STAT 1.0.0 release,
get the source here.
- May 2010: STAT
successfully run on 216,000 cores of ORNL's Cray XT5 Jaguar.
- November 2009: STAT's
source code is
now hosted at The Office of Science's SciDAC Outreach Center.
- November 2009: A paper
on STAT's temporal analysis at SuperComputing 2009.
- December 2008: STAT
featured in a Government Computer News article.
- November 2008: A paper
on STAT's scalability at SuperComputing 2008.
Motivation
The scale of today’s
fastest supercomputers surpasses the capabilities of even the most
advanced debuggers. For instance, Lawrence Livermore National
Laboratory’s BlueGene/Q-based Sequioa boasts 1,572,864 cores. To help
debug problems that arise at such extreme scales, we developed the
Stack
Trace
Analysis
Tool (STAT) to help identify groups of processes in a parallel
application that exhibit similar behavior. A single representative of
these groups can then be examined with a
full-featured debugger like
TotalView
or DDT
for
more in-depth analysis.
Description
The Stack Trace
Analysis Tool gathers and merges stack traces from a parallel
application’s processes. The tool produces call graphs: 2D
spatial and 3D spatial-temporal; the graphs encode calling behavior of
the application processes in the form of a prefix tree. The 2D spatial
call prefix tree represents a single snapshot of the entire application
(Figure 1). The 3D spatial-temporal call prefix tree represents a
series of
snapshots from the application taken over time (Figure 2). In these
graphs, the nodes are labeled by function names. The directed edges,
showing the calling sequence from caller to callee,
are labeled by the set of tasks that follow that call path. Nodes that
are visited by the same set of tasks are assigned the same color,
giving a visual reference to the various equivalence classes.
Figure
1: A 2D spatial call prefix tree.
Figure
2: A 3D spatial-temporal call prefix tree.
STAT is also capable of gathering
stack traces with more fine-grained
information, such as the program counter or the source file and line
number
of each frame. The latter information can be fed into a static
code analysis engine to derive the temporal order of the MPI
tasks. This analysis traverses from the root of the tree towards
the leaves, at each step determining the partial ordering of sibling
nodes in the tree by analyzing the control flow of the source
code.
For straight-line code, this means that one task has made more progress
if it has executed past the point of another task, i.e., if it has a
greater line number. This ordering is partial since two tasks in
different branches of an if-else are incomparable. In cases where
the
program points are within a loop, STAT can extract the loop ordering
variable from the application processes and further delineate tasks by
execution progress. This analysis can be useful for identifying
the culprit
in a deadlocked or livelocked application, where the problematic task
has often either made the least or most progress through the
code, leaving the remaining tasks stuck in a barrier or blocked pending
a message. More information about the temporal ordering analysis
is available
here.
Figure 3: STAT's temporal
ordering analysis engine indicates that task 1 has made the least
progress. In this example task 1 is stuck in a compute cycle,
while the other tasks are blocked in MPI communication, waiting for
task 1.
Our STAT daemons use
DynInst's Stackwalker API
to attach to the
application processes
and to gather the stack traces and stack variables. As the STAT daemons
collect the traces,
they merge them into a local representation of the 2D and 3D
trees. To
address scalability, we leverage the tree-based overlay network (TBON)
model, which exploits the logarithmic scaling properties of trees for
tool control and data analyses; specifically, STAT leverages the MRNet
TBON implementation. MRNet creates a tree topology of networked
processes that allows tool computation to be distributed and reduces
the network bottleneck experienced by the traditional 1-to-N topology
where a tool front-end directly connects to N back-end daemons. As call
prefix trees are propagated from the daemons through the MRNet tree of
processes, a data aggregation filter applies our merging algorithm.
Each communication process in the MRNet tree receives call prefix trees
from all of its children and then merges them all into a single 2D and
3D representation. These traces are then forwarded up the tree, until
the STAT frontend creates the fully merged call prefix trees.
STAT also uses the LaunchMON daemon launching infrastructure to
scalably launch daemons, identify target application processes, and
pass MRNet connection information to the daemons.
Performance
We
tested the
performance and scalability of STAT's merging algorithm on BG/L up to
the full 212,992
process count. For
our experiment, each STAT daemon, running on one of BG/L's 1,664 I/O
nodes, produced
a 2D spatial and 3D spatial-temporal call prefix tree for its 64
(Co-Processor mode) or 128 (Virtual Node mode) application tasks. We
then measured the time it took for the
front-end to broadcast a request to the daemons for these locally
merged traces and
for STAT to create a globally merged 2D and 3D call prefix tree
representing the entire application. At
each scale, we tested a
tool topology with one layer of communication processes between the
front-end and daemons. These communication processes
were launched on the 14 BG/L login nodes, each node with 2 cores,
allowing for a maximum of 28 concurrent communication processes.
Figure 4: STAT stack trace merging performance on BG/L
STAT's merging algorithm, its core
operation, scales very well up to
208K tasks on BG/L,
taking under half a second at that scale. This performance was
achieved despite having limited resources for the communication
processes. For BG/L's 1664 I/O nodes, a balanced 2-deep tree
would
require a layer of 41 communication processes.
Given additional compute resources we would expect to see even better
scalability.
STAT GUI
We have also built a GUI to run STAT
and help
visualize STAT's outputted
call prefix trees. It is written in
Python with
PyGTK and uses the
XDot package
for dot file layout. This GUI provides a variety of operations to
help the user focus on particular call paths and tasks of
interest.
It can also be used to identify the various equivalence
classes and includes an interface to attach a heavyweight debugger to
the representative subset of tasks.
Figure 5: A screenshot of the
STAT GUI.
STAT Source, Dependencies and Portability
STAT source code is available for download at GitHub
here.
STAT is built on a
portable, open source infrastructure:
- GraphLib
- a highly efficient graph generation library.
- LaunchMON
- a scalable tool infrastructure to co-locate tool
daemons with application processes.
- MRNet
- a
communication
library that deploys a tree based overlay
network.
- StackWalker API
- a
lightweight
API for gathering stack traces.
STAT has been successfully run on a
wide range of HPC platforms,
including:
Documentation
STAT
Quick Start Guide
STAT
User Guide
STAT
Description
and Usage Video
References
Dong H. Ahn, Michael J. Brim, Bronis R. de Supinski, Todd Gamblin, Gregory L. Lee, Barton P. Miller, Adam, Moody, and Martin Schulz,
"Efficient and Scalable Retrieval Techniques for Global File Properties",
International Parallel & Distributed Processing Symposium, Boston, MA, May 2013.
Dong H. Ahn, Bronis R. de Supinski, Ignacio Laguna, Gregory L. Lee, Ben
Liblit, Barton P. Miller, and Martin Schulz,
"Scalable
Temporal Order Analysis for Large Scale Debugging",
Supercomputing
2009, Portland, OR, November 2009.
Gregory L. Lee, Dorian C.
Arnold,
Dong H. Ahn, Bronis R. de
Supinski, Matthew Legendre, Barton P. Miller, Martin Schulz, and Ben
Liblit, “
Lessons
Learned at 208K: Towards Debugging Millions of Cores”,
SuperComputing 2008,
Austin,
Texas, November
2008.
Gregory L. Lee, Dorian C.
Arnold,
Dong H. Ahn, Bronis R. de
Supinski, Barton P. Miller, and Martin Schulz, “
Benchmarking
the Stack
Trace Analysis Tool for BlueGene/L”,
International
Conference on Parallel Computing (ParCo),
Aachen and
Jülich, Germany, September
2007.
Dorian C. Arnold, Dong H. Ahn, Bronis R. de
Supinski, Gregory L. Lee, Barton P. Miller, and Martin Schulz, "
Stack
Trace Analysis for Large Scale Applications",
International
Parallel & Distributed Processing Symposium, Long Beach,
California, March 2007.
Contacts
LLNL: Dong Ahn, Bronis de Supinski, Greg Lee, Matt Legendre,
Martin Schulz.
[ahn1, bronis, lee218,
legendre1,
schulz6]
@llnl.gov
UW: Barton Miller, Ben Liblit.
[bart,
liblit
] @cs.wisc.edu
UNM: Dorian Arnold. [darnold] @cs.unm.edu


LLNL-MI-408819
Last Modified April 3, 2014