Performance, Portability, and Productivity for Deep Learning Applications on Multi- and Many-Core Architectures (PPP-DL)

Basic data for this project

Type of projectIndividual project
Duration at the University of Münster01/03/2022 - 28/02/2025 | 1st Funding period

Description

Deep Learning (DL) is currently the most popular machine-learning method that solves a great variety of real-world problems in academia and industry. The success of DL applications critically depends on the quality of software that implements DL algorithms for modern parallel architectures like multi-core CPU, Graphics Processing Unit (GPU), Field-Programmable Gate Array (FPGA), etc. The state-of-the-art DL frameworks like TensorFlow and PyTorch rely for high performance upon general-purpose libraries provided by vendors, such as Intel or NVIDIA, causing major weaknesses regarding three fundamental aspects: i) suboptimal performance – many DL-specific optimizations are not applicable because of libraries’ focus toward general-purpose usage; ii) lacking both functional and performance portability, because the libraries are specifically designed and optimized toward architectures of particular vendors only; iii) restricted user productivity, because the libraries are limited to a fixed set of pre- implemented algorithms (e.g., matrix multiplication and convolutions), and it is cumbersome to integrate high-performance libraries into DL frameworks. This project will develop a novel, holistic approach toward automatic code generation and optimization for DL applications targeting modern parallel architectures; its overall goal is to address in one combined approach three major research challenges in the area of high-performance computing for DL: Performance, Portability, and Productivity (PPP). We plan to achieve the goal of the project based on the following new contributions: 1) a new algebraic formalism and a formalism-based Domain-Specific Language (DSL) for conveniently expressing/implementing established and emerging DL applications at a high-level of abstraction, thereby contributing to programmer’s productivity; 2) a uniform low-level programming model for DL applications, which enables functional portability of code by being straightforwardly lowerable to executable code in the state-of-practice parallel programming approaches: OpenMP, CUDA, OpenCL, etc.; 3) a code generation mechanism for our DSL that enables high, portable performance over various architectures and input/output characteristics by automatically generating auto-tunable code in our low-level programming model; 4) a systematic process that integrates our code generation mechanism into modern DL frameworks, based on the emerging MLIR framework; 5) a new auto-tuning system that fully automatically optimizes our generated code via combined numerical search techniques; 6) a new analytical cost model to predict for different architectures the run time of DL applications expressed in our DSL, in order to accelerate the auto-tuning process.We will experimentally compare our approach in terms of all – performance, portability, and productivity – to state-of-the-art approaches for a broad range of DL applications, parallel architectures, and real-world DL data sets.

KeywordsDeep Learning; Informatik
DFG-Gepris-IDhttps://gepris.dfg.de/gepris/projekt/470527619
Funding identifierGO 756/8-1 | DFG project number: 470527619
Funder / funding scheme
  • DFG - Individual Grants Programme

Project management at the University of Münster

Gorlatch, Sergei
Professur für Praktische Informatik (Prof. Gorlatch)

Applicants from the University of Münster

Gorlatch, Sergei
Professur für Praktische Informatik (Prof. Gorlatch)

Project partners outside the University of Münster

  • Technische Universität Berlin (TU Berlin)Germany
  • The University of EdinburghUnited Kingdom
  • nVIDIACanada
  • Google ResearchFrance