Scalable Temporal Order Analysis
for Large Scale Debugging



Description | Case Study | References | Contacts



Description

We developed a scalable temporal order analysis technique that supports debugging of large scale applications by classifying MPI tasks based on their logical program execution order. Our approach combines static analysis techniques with dynamic analysis to determine this temporal order scalably. It uses scalable stack trace analysis techniques to guide selection of critical program execution points in anomalous application runs. Our novel temporal ordering engine, built on the Rose compiler, then leverages this information along with the application’s static control structure to apply data flow analysis techniques to determine key application data such as loop control variables in exactly the program regions of interest. We then use lightweight techniques to gather the dynamic data that determines the temporal order of the MPI tasks when used as the basis of a lexicographical ordering.

We have extended the Stack Trace Analysis Tool (STAT) to determine temporal orders of MPI tasks. Our evaluation demonstrates that this combined static and dynamic analysis technique addresses two key challenges for contemporary large scale debugging tools: refining task behavior equivalence classes beyond simple metrics such as similar call path profiles and, identifying those tasks in which errors most likely originated. Our evaluation shows that this temporal order analysis technique isolates bugs in the NPB BT with randomly injected faults as well as a real world hang case with AMG2006. Our performance evaluation verifies that our approach provides such a fine-grain analysis domain to STAT with only moderate performance overheads.

Case Study on AMG2006

This case study describes our experience applying the prototype to Algebraic MultiGrid (AMG) 2006. AMG2006 is a scalable iterative solver and preconditioner for solving large unstructured sparse linear systems that arise in a wide range of science and engineering applications. In preparation for extending it to unprecedented numbers of multicore compute nodes, the user was testing performance enhancing code modifications at increasingly large scales when a hang occurred at 4,096 tasks.

In order to diagnose this issue, we first examined the first level of detail with STAT, which merged stack traces based on the function names, which indicated that the hang occurred in the preconditioner setup phase during creation of the mesh hierarchy. However, no equivalence class was clearly the cause of the hang. Thus, we examined the line number-based, chronology-unaware tree and began the adaptive refinement process. As figure 1 shows, this refinement quickly identified a group of twelve tasks that had ceased progressing due to a type coercion error at a function parameter in the big_insert_new_nodes function.


Figure 1: Chronology-aware prefix tree for a code hang exhibited by AMG2006 at 4,096 MPI tasks

At the first refinement step, our method evaluated all frames leading up to the hypre_BoomerAMGSetup function and found that they were temporally equal. During this evaluation, we found one active loop, the while loop that tests the completion of the coarsening process within hypre_BoomerAMGSetup. Our novel Loop Order Variable (LOV) analysis identified the level variable, which keeps track of multigrid levels. We found that its values were four in all 4,096 tasks.

The next refinement determined that a group of twelve tasks had made the least progress as indicated by edges in blue in Figure 1. Because the next refinement step for these twelve tasks found all frames preceding the next branching point to be equal and the branching point was in the MPI layer, we manually inspected the code for relevant execution flows in and around the hypre_BoomerAMGBuildExtPIInterp -> big_insert_new_nodes -> hypre_ParCSRCommHandleDestroy path. We quickly found a type coercion problem for int offset, a function parameter of big_insert_new_nodes. The application team recently widened key integer variables to 64 bit to support matrix indices that grow with scale. However, they overlooked the definition of this function, causing the type coercion. We theorize that at this particular scale and input, the 64-bit integers were truncated when coerced into 32-bits during parameter passing for the twelve tasks, which in turn caused the tasks to send corrupted MPI messages. Ultimately, this incorrect communication caused these tasks to hang in the MPI_Waitall call. Once the team corrected this error, they confirmed correct execution at this scale.

References

Dong H. Ahn, Bronis R. de Supinski, Ignacio Laguna, Gregory L. Lee, Ben Liblit, Barton P. Miller, and Martin Schulz, "Scalable Temporal Order Analysis for Large Scale Debugging", Supercomputing 2009, Portland, OR, November 2009.

Contacts


LLNL: Dong Ahn, Bronis de Supinski, Greg Lee, Martin Schulz.  [ahn1, bronis, lee218, schulz6] @llnl.gov

UW: Matthew Legendre, Barton Miller, Ben Liblit.  [legendre, bart, liblit] @cs.wisc.edu



LLNL-MI-419806
Last Modified November 9, 2009