News Archives

  • UNM
  • >Home
  • >News
  • >2015
  • >April
  • >[Colloquium] Fault Aware MPI: An transaction-oriented approach to Resilient MPI-3 and MPI-4

[Colloquium] Fault Aware MPI: An transaction-oriented approach to Resilient MPI-3 and MPI-4

April 7, 2015

Watch Video: 

MOV FILE 

  • Date: Tuesday, 4/7/15
  • Time: 11:00 AM - 12:15 PM
  • Place: Mechanical Engineering, Room 218


Anthony Skjellum,
Auburn University

Title: Fault Aware MPI: An transaction-oriented approach to Resilient 
MPI-3 and MPI-4
Anthony Skjellum (speaker, Auburn), Amin Hassani (UAB), Ron Brightwell 
(Sandia), Purushotham Bangalore (UAB)

Faults are prevalent in large-scale, long-running parallel computers and programs. Hardware and other faults in large petascale and future exascale systems place an increasing burden on MPI programs, which fail nominally when the first process, processor, network, or other fault occurs and becomes active. Several efforts in the late 1990's and early 2000's failed to produce an effective solution for a fault-aware, resilient, or fault-tolerant MPI model. In the absence of such fault tolerance, fail stop programs can be made to make progress through progressive check-point restart, but that may be extremely inefficient at scale. For systems of the 1990's and even to today, checkpoint restart is the primary solution, but is not the only possibility.

After many years of relative dormancy on this topic, two alternatives - User Level Fault Mitigation (started about 2009-2010), and FA-MPI have been initiated and expanded. These two approaches offer largely complementary strategies to addressing the detection, isolation, mitigation, and recovery of faulty MPI programs. In this talk, we focus on FA-MPI, and explain our progress in both API design, semantics of parallel programming with fault-aware MPI extensions, as well as in our practical implementation efforts over Open MPI. FA-MPI emphasizes the subset of MPI we also consider to be most relevant to exascale, the non-blocking APIs, which are also growing in MPI-3 and MPI-4 standards. We mention our work on porting specific compact applications to FA-MPI, and to our emphasis on building community experience and practice before standardization of a solution to faults in MPI parallel programs.

This Auburn-UAB-Sandia collaborative project involves the definition of Fault Aware MPI (FA-MPI), which is the current PhD work of Amin Hassani 
at UAB.

Brief Bio

Tony Skjellum received his BS, MS, and PhD Degrees from Caltech. His PhD work emphasized portable, parallel software for simulation, with a specific emphasis on message-passing systems. After graduating in 1990, he worked at LLNL for 2.5 years as a computer scientist emphasizing 
performance-portable message passing and portable parallel math libraries. From 1993-2003, he was a on faculty at Mississippi State University, where he and his students co-developed MPICH with Argonne National Laboratory, the first implementation of the MPI-1 standard. Skjellum was a leading participant in MPI-1 and MPI-2 standards as well, with specific contributions to the concepts of "groups contexts, and communicators," which stemmed from his PhD research. From 2003-2013, he was professor and chair at the University of Alabama at Birmingham, Dept. of Computer and Information Sciences, where he continued work on high performance computing and cyber. In July 2014, he became the Lead Cyber Scientist for Auburn University and Cyber Center director. He leads R&D in HPC and cyber at Auburn University in the college of engineering at present. Skjellum's current research group is a split between cyber/Internet of Things, and High Performance Computing and Exascale Storage. FA-MPI is Skjellum's second implementation of a resilient MPI; he and students and his then company, MPI Software technology, previously designed and published MPI/FT, a fault-aware MPI based on MPI/Pro, a commercial MPI licensed from the mid-1990's through mid-2000's. Skjellum has funding from DOE, NSF, and DOD. He is a senior member of ACM and IEEE, and Associate Member of the American Academy of Forensic Science (AAFS), Digital & Multimedia Sciences Division. He remains active in the MPI Forum, and is co-chair of the OMG High Performance Embedded Working Group as well.