Stack Trace Analysis Tool
Motivation
The scale of today’s
fastest supercomputers surpasses the capabilities of even the most
advanced debuggers. Most notably, Lawrence Livermore National
Laboratory’s BlueGene/L boasts 212,992 processors -- far beyond the
reach of the most advanced, full-featured parallel debuggers. With future architectures promising millions of cores
on the horizon, this gap will only grow wider. To help fill
this gap, we developed the Stack Trace Analysis
Tool (STAT) to help identify groups of processes in a parallel
application that exhibit similar behavior. A single representative of
these groups can then be examined with a
full-featured debugger like TotalView or DDT
for
more indepth analysis.
Description
The Stack Trace
Analysis Tool gathers and merges stack traces from a parallel
application’s processes. The tool produces call graphs: 2D
spatial and 3D spatial-temporal; the graphs encode calling behavior of
the application programs in the form of a prefix tree. The 2D spatial
call prefix tree represents a single snapshot of the entire application
(Figure 1). The 3D spatial-temporal call prefix tree represents a
series of
snapshots from the application taken over time (Figure 2). In these
graphs, the nodes are labeled by function names. The directed edges,
showing the calling sequence from caller to callee,
are labeled by the set of tasks that follow that call path. Nodes that
are visited by the same set of tasks are assigned the same color,
giving a visual reference to the various equivalence classes.
Figure
1: A 2D spatial call prefix tree.
Figure
2: A 3D spatial-temporal call prefix tree.
Our STAT daemons use DynInst's StackWalker API to attach to the
application processes
and to gather the stack traces. As the STAT daemons collect the traces,
they merge them into a local representation of the 2D and 3D
trees. To
address scalability, we leverage the tree-based overlay network (TBON)
model, which exploits the logarithmic scaling properties of trees for
tool control and data analyses; specifically, STAT leverages the MRNet
TBON implementation. MRNet creates a tree topology of networked
processes that allows tool computation to be distributed and reduces
the network bottleneck experienced by the traditional 1-to-N topology
where a tool front-end directly connects to N back-end daemons. As call
prefix trees are propagated from the daemons through the MRNet tree of
processes, a data aggregation filter applies our merging algorithm.
Each communication process in the MRNet tree receives call prefix trees
from all of its children and then merges them all into a single 2D and
3D representation. These traces are then forwarded up the tree, until
the STAT front-end creates the fully merged call prefix trees.
STAT also uses the LaunchMON daemon launching infrastructure to
scalably launch daemons, identify target application processes, and
pass MRNet connection information to the daemons.
Performance
We
tested the
performance and scalability of STAT's merging algorithm on BG/L up to
the full 212,992
process count. For
our experiment, each STAT daemon, running on one of BG/L's 1,664 I/O
nodes, produced
a 2D spatial and 3D spatial-temporal call prefix tree for its 64
(Co-Processor mode) or 128 (Virtual Node mode) application tasks. We
then measured the time it took for the
front-end to broadcast a request to the daemons for these locally
merged traces and
for STAT to create a globally merged 2D and 3D call prefix tree
representing the entire application. At
each scale, we tested a
tool topology with one layer of communication processes between the
front-end and daemons. These communication processes
were launched on the 14 BG/L login nodes, each node with 2 cores,
allowing for a maximum of 28 concurrent communication processes.
Figure
3: STAT stack trace merging performance on BG/L
STAT's merging algorithm, its core
operation, scales very well up to
208K tasks on BG/L,
taking under half a second at that scale. This performance was
achieved despite having limited resources for the communication
processes. For BG/L's 1664 I/O nodes, a balanced 2-deep tree
would
require a layer of 41 communication processes and a balanced 3-deep
tree would require layers of 12 and 144 communication processes.
Given additional compute resources we would expect to see even better
scalability.
STAT Dependencies
GraphLib - a highly efficient graph generation library.
LaunchMON
- a scalable tool infrastructure to co-locate tool
daemons with application processes.
MRNet - a communication
library that deploys a tree based overlay
network.
StackWalker API - a lightweight
API for gathering stack traces.
STAT GUI
We have also built a GUI to help
visualize STAT's outputted
call prefix trees. This GUI allows for manipulation of the call
prefix tree to the user focus on particular paths and tasks of
interest. It can also be used to identify various equivalence
classes, which can be used to launch a heavyweight debugger.
Figure 4: A screenshot of the
STAT call prefix tree visualizer.
References
Gregory L. Lee, Dorian C.
Arnold,
Dong H. Ahn, Bronis R. de
Supinski, Matthew Legendre, Barton P. Miller, Martin Schulz, and Ben
Liblit, “
Lessons
Learned at 208K: Towards Debugging Millions of Cores”,
SuperComputing 2008,
Austin,
Texas, November
2008.
Gregory L. Lee, Dorian C.
Arnold,
Dong H. Ahn, Bronis R. de
Supinski, Barton P. Miller, and Martin Schulz, “
Benchmarking
the Stack
Trace Analysis Tool for BlueGene/L”,
International
Conference on Parallel Computing (ParCo),
Aachen and
Jülich, Germany, September
2007.
Dorian C. Arnold, Dong H. Ahn, Bronis R. de
Supinski, Gregory L. Lee, Barton P. Miller, and Martin Schulz, "
Stack
Trace Analysis for Large Scale Applications",
International
Parallel & Distributed Processing Symposium, Long Beach,
California, March 2007.
Contacts
LLNL: Dong Ahn, Bronis de Supinski, Greg Lee,
Martin Schulz.
[ahn1, bronis, lee218,
schulz6]
@llnl.gov
UW: Matthew Legendre, Barton Miller, Ben Liblit.
[
legendre, bart, liblit] @cs.wisc.edu
UNM: Dorian Arnold. [darnold] @cs.unm.edu
LLNL-MI-408819