News Archives

[Colloquium] SCR: The Scalable Checkpoint/Restart Library

January 26, 2012

Watch Colloquium: 

M4V file (661 MB)

  • Date: Thursday, January 26, 2012 
  • Time: 11:00 am — 12:15 pm 
  • Place: Mechanical Engineering 218

Kathryn Mohror
Lawrence Livermore National Lab

Applications running on high-performance computing systems can encounter mean times between failures on the order of hours or days. Commonly, applications tolerate failures by periodically saving their state to checkpoint files on reliable storage, typically a parallel file system. Writing these checkpoints can be expensive at large scale, taking tens of minutes to complete. To address this problem, we developed the Scalable Checkpoint/Restart library (SCR). SCR is a multi-level checkpointing library; it checkpoints to storage on the compute nodes in addition to the parallel file system. Through experiments and modeling, we show that multi-level checkpointing benefits existing systems, and we find that the benefits increase on larger systems. In particular, we developed low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of our system failures. Our approach improves machine efficiency up to 35%, while reducing the load on the parallel file system by a factor of two.

 

Bio: Kathryn Mohror is a Postdoctoral Research Staff Member at the Center for Applied Scientific Computing (CASC) at Lawrence Livermore National Laboratory. Kathryn.s research on high-end computing systems is currently focused on scalable fault tolerant computing and performance measurement and analysis. Her other research interests include scalable automated performance analysis and tuning, parallel file systems, and parallel programming paradigms. Kathryn received her Ph.D. in Computer Science in 2010, an M.S. in Computer Science in 2004, and a B.S. in Chemistry in 1999 from Portland State University in Portland, OR.