We have extended the
Stack
Trace Analysis Tool (STAT) to determine
temporal orders of MPI tasks. Our evaluation demonstrates that this
combined static and dynamic analysis technique addresses two key
challenges for contemporary large scale debugging tools: refining task
behavior equivalence classes beyond simple metrics such as similar call
path profiles and, identifying those tasks in which errors most likely
originated. Our evaluation shows that this temporal order analysis
technique isolates bugs in the NPB BT with randomly injected faults as
well as a real world hang case with AMG2006. Our performance
evaluation verifies that our approach provides such a fine-grain
analysis domain to STAT with only moderate performance overheads.
Case
Study on AMG2006
This case study describes our experience applying the prototype to
Algebraic MultiGrid (AMG) 2006. AMG2006 is a scalable iterative solver
and preconditioner for solving large unstructured sparse linear systems
that arise in a wide range of science and engineering
applications. In preparation for extending it to unprecedented numbers
of
multicore compute nodes, the user was testing performance enhancing
code modifications at increasingly large scales when a hang occurred at
4,096 tasks.
In order to diagnose this issue, we first examined the first level of
detail with STAT, which merged stack traces based on the function
names,
which indicated that the hang occurred in the preconditioner setup
phase during creation of the mesh hierarchy. However, no equivalence
class was clearly the cause of the hang. Thus, we examined the line
number-based, chronology-unaware tree and began the adaptive refinement
process. As figure 1 shows, this refinement quickly identified a group
of twelve tasks that had ceased progressing due to a type coercion
error at a function parameter in the
big_insert_new_nodes function.
Figure 1: Chronology-aware
prefix tree for a code hang exhibited by AMG2006 at 4,096 MPI tasks
At the first refinement step, our method evaluated all frames leading
up to the
hypre_BoomerAMGSetup
function and found that they were temporally equal. During this
evaluation, we found one active loop, the while loop that tests the
completion of the coarsening process within hypre_BoomerAMGSetup. Our
novel Loop Order Variable (LOV) analysis identified the
level variable, which keeps track
of multigrid levels. We found that its values were four in all 4,096
tasks.
The next refinement determined that a group of twelve tasks had made
the least progress as indicated by edges in blue in Figure 1.
Because the next refinement step for these twelve tasks found all
frames preceding the next branching point to be equal and the branching
point was in the MPI layer, we manually inspected the code for relevant
execution flows in and around the
hypre_BoomerAMGBuildExtPIInterp
->
big_insert_new_nodes
->
hypre_ParCSRCommHandleDestroy
path. We quickly found a type coercion problem for int offset, a
function parameter of
big_insert_new_nodes.
The
application team recently widened key integer variables to 64 bit to
support matrix indices that grow with scale. However, they overlooked
the definition of this function, causing the type coercion. We theorize
that at this particular scale and input, the 64-bit integers were
truncated when coerced into 32-bits during parameter passing for the
twelve tasks, which in turn caused the tasks to send corrupted MPI
messages. Ultimately, this incorrect communication caused these tasks
to hang in the
MPI_Waitall
call. Once the team corrected this error, they confirmed correct
execution at this scale.
Dong H. Ahn, Bronis R. de Supinski, Ignacio Laguna, Gregory L. Lee, Ben
Liblit, Barton P. Miller, and Martin Schulz,
, Portland, OR, November 2009.
LLNL: Dong Ahn, Bronis de Supinski, Greg Lee,
Martin Schulz.
UW: Matthew Legendre, Barton Miller, Ben Liblit.