Why Investing In Hardware Is Vital For Advancing Generative AI
Marc Bolitho is the CEO of Recogni, developer of AI-based inference processing solutions for Gen AI.
A July 2023 report from AI researcher Konstantin Pilz suggests that of the 10,000 to 30,000 data centers currently operating worldwide, only between 335 and 1,325 have the capacity to train and host large language models (LLMs). Although currently few in number, generative AI models and the data centers that house them accounted for nearly 2% of global energy demand in 2022. The International Energy Agency expects that amount to double by 2026.
AI isn’t just outpacing our capacity to power it; the generative technology is straining the hardware stack that runs it. From logic and processors to memory and interconnect bandwidth, the performance of AI’s base technology layer isn’t keeping up with the computing requirements of large-scale models. However, without additional investments in hardware, specifically memory capacity and interconnect bandwidth, efforts to scale generative AI could suffer.
“We believe the world needs more AI infrastructure—fab capacity, energy, data centers, etc.—than people are currently planning to build,” Sam Altman, CEO of OpenAI, wrote in a February Twitter post. “Building massive-scale AI infrastructure, and a resilient supply chain, is crucial to economic competitiveness.”
The Current Limitations Of Compute And Connection
The generative AI applications and features that end users interact with are built on several distinct layers of technology with compute hardware (composed of the processor, memory and interconnects) as the base, supporting the software layers running atop it. Today, the GPU processor serves as the core of the compute hardware stack.
GPUs, originally developed for image processing and graphics, are the main processors used for the parallel computing tasks needed by AI, with performance measured in peak FLOPs/s (floating point operations per second). This denotes the chip’s theoretical maximum and the effective FLOPs reflect its real-world performance, which is degraded by practical constraints imposed by its architecture, memory and interconnection performance—even the type of software running on it.
Although GPU performance is marginally improving, that growth is insufficient to keep up with the ever-increasing size of LLMs. As a 2022 study notes, the rate of model growth drastically accelerated in the four years between 2018 and 2022, increasing by five orders of magnitude. Open AI’s GPT-3 model, for example, was released in 2020 and leveraged 175 billion parameters. By comparison, GPT-4, released in March 2023, offered a staggering 1.7 trillion parameters. Additionally, the current generation of GPUs is optimized for AI training operations, not the ongoing inference operations that are expected to account for a significant portion of the industry’s compute capacity as more models come online and use cases multiply.
Memory and interconnect bandwidth capacities are lagging even farther behind. For example, between the A100 and H100 chip generations, peak FLOPs/s increased by six times, whereas memory bandwidth grew by just 1.65 times. This dissonance in hardware growth leads to the memory wall problem, wherein the processor performs its calculations faster than data can be transferred in and out of memory, resulting in lowered system efficiency, increased latency, wasted computing cycles and higher power requirements.
That issue is compounded in generative AI applications where multiple processors effectively function as a single unified processor to perform their parallel computing tasks. In my experience, today’s LLMs can require more than 100 arrayed processors for inference operations and potentially more than 10,000 GPUs working in concert for model training. The added strain on the interconnects will only worsen as the number of parameters employed by LLMs continues to balloon.
Increasing investment in emerging technologies such as highly efficient AI accelerators, high-bandwidth memory and any-to-any connectivity can help shrink that performance gap. More, and more aggressive, investment in these hardware components will be critical as the number of AI applications and users continue to grow in the coming years and demands for inference operations increase.
Investing In Our Generative AI Future
Given the resource-intensive nature of generative AI operations, any increase in efficiency will have an outsized impact on sustainability efforts, especially as data centers steadily grow into the multi-megawatt range. These sustainability issues will likely be exacerbated as increasingly more generative AI operations transition from training to inference. Although the upfront costs to build and train some of today’s largest LLMs are already staggering, the ongoing expense of running those models could prove even more costly as demand for inference operations accelerates.
The final number of data centers that will need to be built or retrofitted to fully support future generative AI efforts is still evolving; those efforts will be limited by the amount of capital available to scale out the data centers, as well as operational costs such as power and cooling that will be required. Increasing AI compute efficiency by delivering more processing at lower power will allow us to build out tomorrow’s generative AI systems more effectively, more densely and at a lower cost. It will also allow us to take advantage of the benefits of generative AI’s growth across many applications and use cases.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?
link