A high-level C++ approach to manage local errors, asynchrony and faults in an MPI application

Engwer Christian, Altenbernd Mirco, Dreier Nils-Arne, Dominik Göddeke

Forschungsartikel in Sammelband (Konferenz) | Peer reviewed

Zusammenfassung

C++ advocates exceptions as the preferred way to handle unexpected behaviour of an implementation in the code. This does not integrate well with the error handling of MPI, which more or less always results in program termination in case of MPI failures. In particular, a local C++ exception can currently lead to a deadlock due to unfinished communication requests on remote hosts. At the same time, future MPI implementations are expected to include an API to continue computations even after a hard fault (node loss), i.e. the worst possible unexpected behaviour. In this paper we present an approach that adds extended exception propagation support to C++ MPI programs. Our technique allows to propagate local exceptions to remote hosts to avoid deadlocks, and to map MPI failures on remote hosts to local exceptions. A use case of particular interest are asynchronous 'local failure local recovery' resilience approaches. Our prototype implementation uses MPI-3.0 features only. In addition we present a dedicated implementation, which integrates seamlessly with MPI-ULFM, i.e. the most prominent proposal for extending MPI towards fault tolerance. Our implementation is available at https://gitlab.dune-project.org/christi/test-mpi-exceptions.

Details zur Publikation

StatusVeröffentlicht
Veröffentlichungsjahr2018
Sprache, in der die Publikation verfasst istEnglisch
Konferenz26th Euromicro International Conference on Parallel, Distributed and Network-based Processing, Cambridge, Vereinigtes Königreich, undefined
DOI10.1109/PDP2018.2018.00117
Link zum Volltexthttps://arxiv.org/abs/1804.04481v2
StichwörterC++; ULFM; Exceptions; Fault-Tolerance

Autor*innen der Universität Münster

Dreier, Nils-Arne
Professur für Anwendungen von partiellen Differentialgleichungen (Prof. Engwer)