Fault Tolerance and Robustness

Robust-first Computing

Efficiency costs robustness. For the safety of society and to let us build really big computers, we should put robustness first, ahead even of strict correctness and maximum efficiency, robust-first computing embodies this across the entire computational stack.
NM Investigators: David Ackley and Lance Williams

Robust Communication and Computation

Secure and robust multiparty computations or communication in networks with adversarial nodes is important to large scale systems. This work addresses resource-efficient and cost-competitive algorithms in these contexts.
UNM Investigator: Jared Sala
Collaborators: Drexel U., U. of Michigan, U. of Victona

Fault-tolerance for HPC Systems

To address the challenges of running applications on next- generation, large-scale, error-prone systems, we use modeling simulation and real frameworks to understand the impart of different resilience mechanisms or application performance.
UNM Investigators: Dorian Arnold
Collaborators: Sandia National Labs