REPEAL: Resilience vs. Performance in linear Algebra
|R&D projects, public funding|
|2016 - 2019|
|Research Group Theory and Applications of Algorithms|
Algorithmic fault tolerance is an important open problem for complex numerical algorithms. Substantial advances will have high impact in a wide spectrum of applications – ranging from sensor networks over P2P networks to high performance computing (HPC), since future HPC systems are expected to exhibit much higher fault rates than current systems do. It is an open question how much resilience can be achieved at the algorithmic level and how this influences sustained performance.
The REPEAL project (REsilience vs. PErformance in linear ALgebra) investigates resilient/fault-tolerant parallel algorithms for a range of numerical linear algebra problems. The objectives are to design algorithms which provably produce accurate results (within the limitations of floating-point arithmetic) in the presence of faults, and to gain a better understanding of the resilience-performance trade-off.
We address the following questions: How can existing approaches be improved to handle more general temporal and spatial fault distributions? Which numerical accuracy and sustained performance do resilient algorithms achieve in real computations? Which resilience improvements can be achieved by combining deterministic with randomized approaches? What is the "price" of resilience, i.e., which slow-down has to be expected compared to non-resilient high performance algorithms?
- Wilfried Gansterer
- Markus Levonyak
- Kastor Felsner
- Carlos Andres Pachajoa Mejia
- Elias Wimmer
- Jesper Larsson Träff