Multi-level Analysis of GPU Utilization in ML Training Workloads

Paul Delestrac; Debjyoti Bhattacharjee; Simei Yang; Diksha Moolchandani; Francky Catthoor; Lionel Torres; David Novo

Communication Dans Un Congrès Année : 2024

Multi-level Analysis of GPU Utilization in ML Training Workloads

(1) , (2) , (2) , (2) , (2, 3) , (4) , (5)

1
2
3
4
5

Paul Delestrac

Fonction : Auteur correspondant
PersonId : 1163872
IdHAL : paul-delestrac
ORCID : 0000-0002-7476-1422

Connectez-vous pour contacter l'auteur

Laboratoire d'Informatique de Robotique et de Microélectronique de Montpellier

Debjyoti Bhattacharjee

Fonction : Auteur

IMEC

Simei Yang

Fonction : Auteur

IMEC

Diksha Moolchandani

Fonction : Auteur
PersonId : 1368576

IMEC

Francky Catthoor

Fonction : Auteur
PersonId : 1086572

IMEC

KU Leuven

Lionel Torres

Fonction : Auteur
PersonId : 929667
ORCID : 0000-0001-5807-5070

Conception et Test de Systèmes MICroélectroniques

David Novo

Fonction : Auteur
PersonId : 170933
IdHAL : david-novo
ORCID : 0000-0002-5510-4152
IdRef : 244276455

ADAptive Computing

Résumé

Training time has become a critical bottleneck due 100% to the recent proliferation of large-parameter ML models. GPUs continue to be the prevailing architecture for training ML models. However, the complex execution flow of ML frameworks makes it difficult to understand GPU computing resource utilization. Our main goal is to provide a better understanding of how efficiently ML training workloads use the computing resources of modern GPUs. To this end, we first describe an ideal reference execution of a GPU-accelerated ML training loop and identify relevant metrics that can be measured using existing profiling tools. Second, we produce a coherent integration of the traces obtained from each profiling tool. Third, we leverage the metrics within our integrated trace to analyze the impact of different software optimizations (e.g., mixed-precision, various ML frameworks, and execution modes) on the throughput and the associated utilization at multiple levels of hardware abstraction (i.e., whole GPU, SM subpartitions, issue slots, and tensor cores). In our results on two modern GPUs, we present seven takeaways and show that although close to 100% utilization is generally achieved at the GPU level, average utilization of the issue slots and tensor cores always remains below 50% and 5.2%, respectively.

Mots clés

GPU utilization Performance Analysis ML Training

Domaines

Intelligence artificielle [cs.AI] Machine Learning [stat.ML]

Fichier principal

Delestrac 2024 Multilevel.pdf (908.04 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Paul Delestrac : Connectez-vous pour contacter le contributeur

https://hal.umontpellier.fr/hal-04523554

Soumis le : samedi 30 mars 2024-16:41:27

Dernière modification le : samedi 6 avril 2024-03:15:00

Dates et versions

hal-04523554 , version 1 (30-03-2024)

Identifiants

HAL Id : hal-04523554 , version 1

Citer

Paul Delestrac, Debjyoti Bhattacharjee, Simei Yang, Diksha Moolchandani, Francky Catthoor, et al.. Multi-level Analysis of GPU Utilization in ML Training Workloads. 2024 Design, Automation & Test in Europe Conference (DATE 2024), Mar 2024, Valencia (Espagne), Spain. ⟨hal-04523554⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS SYSMIC LIRMM GENCI ADAC LIRMM_MIC MIC MIPS UNIV-MONTPELLIER

0 Consultations

1 Téléchargements

Multi-level Analysis of GPU Utilization in ML Training Workloads

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager