• mark

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI

Read our paper entitled "Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation" at arXiv.org.


Synopsis


In this work minds.ai together with the MVAPICH team from Ohio State University led by Prof. DK Panda performed an extensive benchmarking and testing suite of distributed TensorFlow methods. The goal of this work was to determine which method would perform the most optimal on various high performance computing infrastructures. The systems tested ranged from University clusters to the Piz Daint Supercomputer comprised of 5000 GPU powered compute nodes. The work showed that, depending on the workload, one has to be careful with which distribution method and libraries are used. The work furthermore identified a number of bottlenecks in MVAPICH for which improvements were proposed and that are now part of the stable MVAPICH distribution.

Get in touch
contact@minds.ai
 
U.S. 
101 Cooper St. 
Santa Cruz, CA 95060
 
India
Minds Artificial Intelligence Technologies Pvt. Ltd.
1st Floor, Anugraha, 174, 19th Main Rd,
Sector 4, HSR Layout,
Bengaluru, Karnataka, 560102
 
Europe
Amsterdam, the Netherlands