Using a Heterogeneous Multi-GPU Cluster to Support Exploration at Scale

David Kaeli, Yanzhi Wang, Xue Lin, and Devesh Tiwari

ECE Professor David Kaeli and Assistant Professors Yanzhi Wang, Xue Lin, and Devesh Tiwari were awarded a $570K NSF grant for the “Acquisition of a Heterogeneous Multi-GPU Cluster to Support Exploration at Scale.”


Abstract Source: NSF

This project aims to acquire a heterogeneous Multi-GPU cluster, constructed out of state-of-the-art GPUs devices, interconnected with emerging NVLink and HDR networks, network-attached non-volatile memory (NVM) storage for GPU caching, and interconnected by a smart HDR infiniband switch, to enable, accelerate, explore, and support applications at scale from different domains that include:

  • Distributed deep neural networks for retinopathy,
  • Wireless network forensics,
  • Adversarial machine learning,
  • Computational social science,
  • Mathematical optimization and big data analytics,
  • Coastal engineering modeling, and
  • Multi-GPU system (including NVMe technology to support caching in GPU network and a smart network switch that can offload collective operations)

These features will enable computational scientists to exploit GPU parallelism in new ways by programming the smart network switch and caching selectively to hide memory and interconnect latency.

Currently, graphics processing units (GPUs) provide high computational throughput by lunching a large number of threads by overlapping compute and memory operations. Combined with low-overhead thread swapping, GPUs can hide long memory operations. But underlying system architectures have not kept up as the size and complexity of GPU applications grow. The multi-GPU solutions are less programmer friendly and result in lower scalability when their architectural support is compared with the multi-CPU systems. Current GPUs systems treat GPUs as discrete devices, with limited support for a truly shared memory programming model. Since multi-GPU interconnect bandwidth has become a limiting factor for scaling multi-GPU systems, exploration of new network topologies, smarter network elements, and enhanced software layers for caching and prefetching, that meet the needs of tomorrow’s demanding data applications are necessary.

Related Faculty: David Kaeli, Yanzhi Wang, Xue "Shelley" Lin, Devesh Tiwari

Related Departments:Electrical & Computer Engineering