Increasing GPU Capacity with Composable Hardware

Increasing GPU Capacity with Composable Hardware

The demand for high-performance computing (HPC) continues to grow, and users are always looking for ways to enhance their systems’ capabilities. While standard servers typically offer up to four GPU slots, the dream of many HPC users is to have more. However, simply adding more GPUs to a single server may result in stranded hardware and decreased efficiency.

Traditionally, servers had fewer resources and were more granular, allowing for effective resource application. But as servers have become more powerful with larger memory and multiple GPUs, sharing resources has become more complex. A server with four GPUs may be used exclusively for GPU jobs and sit idle for other tasks, resulting in stranded memory and CPUs that could be utilized. While packing more hardware into a single server reduces overall cost, it may not be optimal for HPC workloads in the long run.

To address this issue, the Compute Express Link™ (CXL™) standard was established. CXL is an industry-supported Cache-Coherent Interconnect for Processors, Memory Expansion, and Accelerators. It enables memory coherency between the CPU memory space and attached devices, allowing for resource sharing and improved performance.

GigaIO, a company pioneering in this area, offers a Single-Node Supercomputer that supports up to 32 GPUs. This supercomputer utilizes GigaIO’s FabreX ™ technology, a PCIe network that creates a dynamic memory fabric for resource allocation in a composable manner. Unlike other systems, GigaIO’s solution allows all 32 GPUs to be fully usable and addressable by a single host system, without partitioning them across server nodes.

GigaIO’s SuperNODE, powered by 32 AMD Instinct MI210 accelerators, demonstrates improved scalability compared to traditional methods. It eliminates the need for complex MPI communication between multiple nodes and achieves nearly linear scaling for GPU-intensive workloads. Benchmarks involving Hashcat and Resnet50 show that the SuperNODE’s performance scales well as the number of GPUs increases.

Additionally, the SuperNODE has been used successfully in CFD simulations. Dr. Moritz Lehmann conducted a large-scale simulation of the Concorde using 32 GPUs on the SuperNode, completing it in 33 hours. This accomplishment showcases the SuperNode’s capabilities in handling computationally demanding tasks.

GigaIO’s SuperNode offers a hardware-agnostic solution, supporting various accelerator technologies such as GPUs and FPGAs. It simplifies deployment in large-scale GPU environments and provides instant support through popular libraries like TensorFlow and PyTorch.

In conclusion, GigaIO’s SuperNode, powered by CXL technology, offers a breakthrough solution for increasing GPU capacity in HPC systems. It addresses the issue of stranded hardware and enables efficient resource sharing, ultimately improving performance and reducing overall system costs.

link

Leave a Reply

Your email address will not be published. Required fields are marked *