On the Safe Deployment of Matrix Multiplication in Massively Parallel Safety-Related Systems

Javier Fernández; Jon Perez-Cerrolaza; Irune Agirre; Alejandro J. Calderon; Jaume Abella; Francisco J. Cazorla

On the Safe Deployment of Matrix Multiplication in Massively Parallel Safety-Related Systems

Javier Fernández; Jon Perez-Cerrolaza; Irune Agirre; Alejandro J. Calderon; Jaume Abella; Francisco J. Cazorla

Deep learning technology has enabled the development of increasingly complex safety-related autonomous systems using high-performance computers, such as graphics processing units (GPUs), which provide the required high computing performance for the execution of parallel computing algorithms, such as matrix&ndash:matrix multiplications (a central computing element of deep learning software libraries). However, the safety certification of parallel computing software algorithms and GPU-based safety-related systems is a challenge to be addressed. For example, achieving the required fault-tolerance and diagnostic coverage for random hardware errors. This paper contributes with a safe matrix&ndash:matrix multiplication software implementation for GPUs with random hardware error-detection capabilities (permanent, transient) that can be used with different architectural patterns for fault-tolerance, and which serves as a foundation for the implementation of safe deep learning libraries for GPUs. The proposed contribution is complementary and can be combined with other techniques, such as algorithm-based fault tolerance. In particular, (i) we provide the high-performance matrix multiplication CUTLASS library with a catalog of diagnostic mechanisms to detect random hardware errors down to the arithmetic operation level: and (ii) we measure the performance impact incurred by the adoption of these mechanisms and their achievable diagnostic coverage with a set of representative matrix dimensions. To that end, we implement these algebraic operations, targeting CUDA cores with single instructions and multiple-thread math instructions in an NVIDIA Xavier NX GPU.

Show more [+]

AGROVOC Keywords

safety

Bibliographic information

Publisher

Multidisciplinary Digital Publishing Institute

Pagination

p.-

Other Subjects

Cnn; Fault detection; Reliability; Gpu; Matrix multiplication

Language

English

Note

Source Identifier: oai:mdpi.com:1999-4907/13/4/590/; . setSpec: Article;

Type

Journal Article

In AGRIS since: 2022-05-15

Format: AGRIS AP

Data Provider

This bibliographic record has been provided by Multidisciplinary Digital Publishing Institute

Discover this data provider's collection in AGRIS

Links

DOI https://www.mdpi.com/2076-3417/12/8/3779/pdf

Lookup at Google Scholar

If you notice any incorrect information relating to this record, please contact us at agris@fao.org

AGRIS - International System for Agricultural Science and Technology

Share

On the Safe Deployment of Matrix Multiplication in Massively Parallel Safety-Related Systems

AGROVOC Keywords

Bibliographic information