AMD Instinct MI300X accelerators MLPerf results
The AMD Instinct MI300X GPUs, powered by the latest version of the open-source ROCm, have delivered impressive outcomes in the MLPerf Inference v4.1 round.
This achievement underscores the robust capabilities of AMD's comprehensive inference platform. Focusing on the LLaMA2-70B model, known for its versatility and high performance, AMD's submission highlighted its strong generative AI inference capabilities, positioning it as a competitive alternative to the NVIDIA H100 and demonstrating the potential of the AMD Instinct MI300X accelerators.
The role of MLPerf in the AI landscape
As large language models (LLMs) become more complex and expansive, the need for efficient and cost-effective inference and training solutions grows increasingly urgent. High-performance LLMs depend on robust parallel computing infrastructure and a well-tuned software ecosystem. MLPerf, a leading benchmarking suite developed by MLCommons—a consortium in which AMD is a founding member—plays a critical role in this space. It provides a suite of open-source AI benchmarks, including those for generative AI and LLMs, offering rigorous, peer-reviewed metrics. These benchmarks help organisations evaluate the effectiveness of their AI hardware and software. Achieving success in MLPerf Inference v4.1 is a significant milestone for AMD, reflecting its dedication to transparency and its commitment to delivering reliable, standardised data to support enterprise decision-making.
A closer look at the LLaMA2-70B model benchmark
AMD's first submission to MLPerf utilised the LLaMA2-70B model, a key advancement in the realm of LLMs that is particularly suited for real-world applications like natural language processing and large-scale inference. The benchmarking involved a Q&A scenario using 24,576 samples from the OpenORCA dataset, each containing up to 1,024 input and output tokens. The performance was evaluated in two key scenarios:
- Offline Scenario: Focused on batch processing of input questions to maximise throughput, measured in tokens per second.
- Server Scenario: Simulated real-time query processing with strict latency constraints (TTFT* ≤ 2s, TPOT* ≤ 200ms), testing the hardware’s ability to handle low-latency tasks efficiently.
(*TTFT – Time to First Token, *TPOT – Time per Output Token)
Performance of AMD Instinct MI300X in MLPerf
In its initial MLPerf submission, the AMD Instinct MI300X showed strong performance using the Supermicro AS-8125GS-TNMR2 system, with four major entries for the LLaMA2-70B model. These results are particularly noteworthy because they offer a direct comparison with other AI accelerators, with peer-reviewed validation, reproducibility, and a focus on industry-relevant applications.
Synergy between CPU and GPU for AI workloads:
- Submission ID 4.1-0002: Featuring a configuration of 8x AMD Instinct MI300X accelerators paired with 2x AMD EPYC 9374F (Genoa) CPUs in the Available category.
This setup demonstrated the powerful combination of AMD Instinct MI300X GPU accelerators and 4th Gen EPYC CPUs, also known as "Genoa," in handling AI workloads. The performance was comparable to within 2-3% of the NVIDIA DGX H100, equipped with 4th Gen Intel Xeon CPUs, in both server and offline scenarios at FP8 precision, highlighting the competitiveness of AMD's solution in demanding AI environments. (See Figure 1)
Figure 1 - Showcasing performance of CPU-GPU combination for AI workload
Preview of performance with next-generation CPUs:
- Submission ID 4.1-0070: Configuration included 8x AMD Instinct MI300X GPUs paired with 2x AMD EPYC "Turin" CPUs, categorised under the Preview category.
This submission showcased the performance benefits of the upcoming 5th Gen AMD EPYC "Turin" CPUs when used alongside AMD Instinct MI300X GPU accelerators. The results indicated a slight performance advantage over the NVIDIA DGX H100 equipped with Intel Xeon CPUs in the server scenario while maintaining similar performance in the offline scenario at FP8 precision (refer to Figure 1 for details).
Single GPU efficiency:
- Submission ID 4.1-0001: Setup involved 1x AMD Instinct MI300X accelerator with 2x 4th Gen AMD EPYC 9374F (Genoa) CPUs, classified under the Available category.
This configuration underscored the efficiency of the AMD Instinct MI300X's substantial 192GB memory, which enabled a single GPU to run the entire LLaMA2-70B model effectively. This approach eliminated the network overhead that typically comes with distributing the model across multiple GPUs at FP8 precision, highlighting the benefits of leveraging a single GPU for certain workloads.
Figure 2 - Single GPU Running the Entire Llama 2 70B Model
The AMD Instinct MI300X, built on AMD's CDNA 3 architecture, features 192GB of HBM3 memory and achieves a peak memory bandwidth of 5.3TB/s. This significant memory capacity allows the AMD Instinct MI300X to efficiently handle and run large models, such as the 70 billion parameter LLaMA2-70B, on a single GPU. With the ROCm software stack, the scaling efficiency from a single AMD Instinct MI300X (TP1) to an 8x configuration (8x TP1) is nearly linear, as demonstrated in Figure 2. This illustrates the MI300X's ability to manage the largest MLPerf inference model to date effectively.
Notable results with Dell server design using AMD Instinct MI300X accelerators
- Submission ID 4.1-0022: Configuration included 8x AMD Instinct MI300X accelerators with 2x Intel Xeon Platinum 8460Y+ CPUs in the Available category.
In addition to AMD's own submissions, Dell also validated the platform-level performance of AMD Instinct accelerators by submitting results using an 8x AMD Instinct MI300X setup on their PowerEdge XE9680 server. This submission, focused on the LLaMA2-70B model, underscores the strength of the partnership between AMD and Dell and highlights the robust performance of their combined ecosystem. It positions its solutions as strong options for both data centre and Edge inference deployments.
Performance highlights
The strong competitive performance of the AMD Instinct MI300X accelerators is a result of its high computing power, substantial memory capacity with fast bandwidth, and the optimised ROCm software stack, which collectively ensure efficient management of large AI models like LLaMA2-70B. Several key factors contribute to this performance:
- Large GPU memory size: The AMD Instinct MI300X offers the largest GPU memory available, allowing the entire LLaMA2-70B model, along with the KV cache, to fit within a single GPU. This capability eliminates the need to split the model across multiple GPUs, thereby avoiding network overhead and maximising inference throughput. In the offline scenario, the configuration used a ‘max_num_seqs’ parameter of 2048 to maximise throughput. For the server scenario, a parameter of 768 was set to meet latency targets, both of which are significantly higher than the default value of 256 used in vLLM. The vLLM's support for paged attention facilitates efficient KV cache management, avoiding memory fragmentation issues due to the large memory of the AMD Instinct MI300X accelerators.
- FP8 support: The hardware of the AMD Instinct MI300X accelerator supports the FP8 numerical format, which has been extended across the entire inference software stack. By using Quark, the LLaMA2-70B model weights were quantised to FP8, maintaining 99.9% accuracy, as required by MLPerf standards. Additional FP8 support was integrated into vLLM, the hipBLASLt library was upgraded, and an FP8 KV cache was implemented, all of which significantly enhanced performance.
- Software optimisations: Extensive profiling and optimisation efforts were undertaken, including the use of AMD Composable Kernels (CK) for prefill attention, FP8 decode paged attention, and fused kernels like residual add RMS Norm and SwiGLU with FP8 output scaling. The scheduler was also improved for faster decode scheduling and more efficient prefill batching, optimising both offline and server use cases.
- CPU optimisation: While the bulk of AI workload processing occurs on GPUs, CPU performance remains crucial. Lower core count CPUs with high boost frequencies, such as the EPYC 9374F with 32 cores and a boost of up to 4.3 GHz, provided optimal performance, particularly for server scenarios. Tests with the upcoming "Turin" generation of EPYC CPUs indicated performance gains over the 4th Gen EPYC, which were included as a Preview submission.
Setting a precedent for handling large models
The successful results achieved in MLPerf with the LLaMA2-70B model validate the performance of AMD Instinct MI300X GPU accelerators, setting a strong precedent for their future use with even larger models like LLaMA 3.1. AMD is proud to support Meta's new LLaMA 3.1 model, featuring 405 billion parameters, with Day 0 support on AMD Instinct MI300X accelerators. Thanks to the industry-leading memory capabilities of the AMD Instinct MI300X platform, a server powered by eight AMD Instinct MI300X GPUs can accommodate the entire LLaMA 3.1 model, with 405 billion parameters, in a single server using FP16 datatype (see Figure 3). This capability reduces server usage and lowers costs, positioning AMD Instinct MI300X accelerators as the ultimate solution for powering the largest open models available today.
Figure 3 – LLaMa 3.1 (405B) Estimated Memory Requirements vs Available GPU Memory