Submitters also showed improved inference results for Bert-Large, which is particularly of interest since it is closest in nature to large large language models (LLMs) like ChatGPT.
Looking forward to ChatGPT
The biggest trend in AI inference today is at-scale inference of LLMs, such as ChatGPT. While GPT-class models are not included in the current MLPerf benchmark suite, David Kanter, executive director of MLCommons, said that LLMs will be coming to the next round of training benchmarks (due next quarter) and potentially coming to inference rounds in Q3 ’23 or Q1 ’24. Other forms of generative AI, including image generation, are also on the MLPerf roadmap.
In the meantime, Kanter said, Bert-Large scores are the most relevant for companies curious about performance on transformer workloads. While Bert-Large is approximately 500× smaller than ChatGPT (340 million parameters compared with 175 billion), the computational aspects of ChatGPT are most similar to Bert.
“Just because something does great on Bert doesn’t mean you can use that same system on ChatGPT,” he said. “You’d need a lot more memory, or clever ways to handle memory, because the model is much bigger.”
“As the number of layers scales up, the number of parameters scales up and the sequence length scales up, but from an architecture perspective, the matrix multiplication, the layer normalization, the softmax—this base structure in Bert-Large is for most purposes just scaled up for GPT-3,” added Michael Goin, product engineering lead at Neural Magic. “All the submitters who have done optimizations for Bert-Large, these optimizations will transfer to LLM workloads. It’s a matter of scale.”
“Another way to think about it is [GPT-3 and -4] will generate hundreds or thousands of smaller models that are distilled down from these very large models,” said Jordan Plawner, senior director of AI products at Intel. “You can’t run GPT-4 on the edge—you’d need to distill it down to something smaller, more efficient, maybe more targeted, purpose-built for a specific use case, and create inference models at lower precision for those specific use cases… running Bert-Large could be a proxy for [whether] we can run these smaller models as well.”
Top of the table for Bert-Large results was Nvidia with its DGX-H100-SXM system (eight H100 chips), in server and offline categories for both accuracy levels, except for Bert-99.9% in the offline scenario where a Dell server with eight H100s squeezed in a victory (this Dell PowerEdge XE9680 server uses a single-socket Intel Xeon Platinum 8470 host CPU compared to Nvidia’s 8480C, though both CPUs are built on Sapphire Rapids architecture).
In the data center open division—where submitters are allowed to tweak the models—several submitters entered Bert results, including Moffett, whose efficiency is based on its sparse algorithms combined with 32X hardware support for extreme sparsity in its chips; Neural Magic, who are using sparsification algorithms to increase performance on CPUs; and Deci, whose use of neural architecture search techniques to make hardware-aware models helps improve throughput on CPUs and GPUs.
Neural Magic sparsified Bert weights by two orders of magnitude to 10 MB and improved performance 1000× on the same CPU hardware. Deci showed results offering a higher throughput for its version of Bert on one Nvidia H100 card compared to the original Bert on eight Nvidia A100s.
Edge inference division
In the edge inference divisions, Nvidia’s AGX Orin was beaten in ResNet power efficiency in the single and multi-stream scenarios by startup SiMa. Nvidia AGX Orin’s mJ/frame for single stream was 1.45× SiMa’s score (lower is better), and SiMa’s latency was also 27% faster. For multi stream, the difference was 1.39× with latency 22% faster.
SiMa’s offline performance for ResNet was 136 frames per second per Watt (fps/W), compared to Nvidia’s 152 fps/W for AGX Orin.
SiMa CEO Krishna Rangasayee told EE Times that the company is proud of its chip design, which today in 16-nm technology is one to two nodes behind competitors, so there is potential to move to more advanced nodes in the future.
“We’re also very proud of the fact that while most of these companies have hundreds of engineers polishing benchmarks and hand-tuning them for performance, our entire result was push-button, with no human being involved,” Rangasayee said.
“It’s not just about performance, power, all the cool things that the technology can do, the right person with the right number of PhDs can do that for any product,” said Gopal Hegde, VP of engineering and operations at SiMa. “However, making it easy to use is key for product adoption, that’s one of the main reasons [AI] is not taking off at the edge, because it’s hard to use.”
SiMa did not submit results for its vision-focused chip on any other workloads.
Nvidia has improved its AGX Orin scores significantly compared to the previous round thanks to software improvements. Performance per Watt figures increased 24-63%. Nvidia also showed off the replacement for the (smaller) Xavier NX, the Orin NX, whose results improved on the previous generation by up to 3.2×.
Data center inference
In the data center, Nvidia’s latest-gen H100 GPU is the one to beat. H100 performance scores improved as much as 54% compared to the last round (most improved was object detection with RetinaNet), due to software optimizations. With its transformer engine, H100 excels at Bert, offering roughly 4.5× the performance of previous gen A100 technology for Bert 99.9%, and 12% improvement on H100 scores from the last round.
The brand new L4 GPU, announced a few weeks ago at GTC 2023, debuted this time around. The L4 is a next-gen replacement for the T4, and is designed as a general-purpose accelerator that can handle graphics, video and AI workloads. It uses 4th generation tensor cores on the Ada Lovelace architecture, which includes FP8 capability.
MLPerf inference results showed the L4 offers 3× the performance of the T4, in the same single-slot PCIe format. Results also indicated that dedicated AI accelerator GPUs, such as the A100 and H100, offer roughly 2-3×and 3-7.5×the AI inference performance of the L4, respectively.
This round also included scores from server makers for the brand new Nvidia L40, which is optimized for AI image generation.
A couple of things stood out in Nvidia’s scores this time around.
Scalability of the DGX-H100 from one to eight GPUs in the system was close to 8.0 for most workloads, with the outliers being RNN-T in server mode (6.7×) and RetinaNet in offline mode (8.23×). Dave Salvator, director of AI, benchmarking and cloud at Nvidia, attributed the RetinaNet scale factor to “run-to-run variation.”
A100 scores for DGX-A100-SXM systems for identical hardware using Nvidia’s Triton Inference Server software were very close to scores without Triton for most workloads. However, using Triton software appeared to significantly degrade performance in the server scenario for ResNet and DLRM (ResNet server performance dropped a third with Triton, DLRM server scores dropped to less than half). Salvator said Nvidia is aware of the differences, adding that “sometimes we have to make choices about where to spend our engineering time, so there are some areas that remain areas where we need to do additional optimization to bring that performance closer to what we see in the offline scenario.”
Taiwanese startup Neuchips showed off Nvidia-beating recommendation (DLRM) power scores. Neuchips’ first chip, RecAccel 3000, is specially designed to accelerate recommendation workloads. On DLRM 99.9%, eight Neuchips RecAccel chips achieved double the performance per Watt of eight Nvidia A100s in server mode, and 1.67× the performance per Watt of eight Nvidia-H100s in server mode.
“By going to application-specific chip design, we are able to provide the best efficiency in terms of QPS/Watt, and of course this would lead to lower TCO because of power saving, and the chip is smaller,” Neuchips CEO Youn-Long Lin told EE Times.
Eight Neuchips RecAccel chips achieved similar overall performance and very similar performance per Watt numbers for server and offline mode; Nvidia GPUs perform 1.5-1.75× better with a similar power budget in offline mode than they do in server mode. Thus, while Neuchips beat A100 in offline mode, Nvidia H100 remains the winner for the offline scenario in terms of performance per Watt. While Nvidia has previously said Grace Hopper is its ideal solution for recommendation inference, it has not yet submitted benchmark scores for Grace Hopper.
Lin said that for recommendation workloads, server mode is a more realistic scenario than offline mode.
“[There have been questions about] the assumption of whether it’s realistic to put all the [recommendation] queries in the system beforehand, before the timer is started,” he said. “I think server mode is a more realistic scenario. [In server mode], requests come in from the host over the network, and then we listen to the query within the time constraint. If you process a batch of images, then [offline mode] makes sense, but for recommendation, server mode is more realistic.”
Compared to Intel Sapphire Rapids CPUs, Neuchips’ DLRM performance scores achieved almost double the throughput on a per-chip basis. Intel did not submit power scores for Sapphire Rapids, but its dual-node system likely has a much higher power budget than Neuchips’ single card.
Neuchips’ RecAccel chip is available now on PCIe cards, with a dual M.2 module sampling at the end of this quarter, Lin said.
Data center scores
Qualcomm showed power efficiency results beating Nvidia’s H100 for image classification (ResNet) and object detection (RetinaNet). Specifically, eight Qualcomm CloudAI100s (each limited to 75W TDP) beat eight Nvidia H100 (PCIe) with queries per second per Watt working out at between 1.5-2.1×.
On NLP (Bert), eight Qualcomm CloudAI100s beat eight previous-gen Nvidia A100 (PCIe) on queries per second per Watt (narrowly, in some cases), but were no match for H100.
Qualcomm did not submit data center performance or power scores for its CloudAI100 on medical imaging (3DUNet), speech to text (RNN-T) or recommendation (DLRM) benchmarks.
Qualcomm scores are supported by consultancy Krai, which has released a dedicated version of its KILT (Krai inference library template) codebase under a permissive open-source license. While KILT supports Qualcomm’s CloudAI100 out of the box, it can be customized for other accelerators, said Krai CEO Anton Lokhmotov. This library has allowed the company to improve server scenario performance for certain hardware by 20%, bringing server scores close to offline scores, he added.
VMWare showed off results based on virtualization of Nvidia H100 GPUs. Uday Kurkure, staff engineer at VMWare, said that virtualized GPUs were able to achieve 94-105% of bare metal performance.
“The notable thing is that out of 128 logical CPU cores, we only used 16,” Kurkure said. “The remaining 112 cores will be available for other workloads, without impacting the performance of machine learning. That is the power of virtualization.”
Intel demonstrated data center benchmarks across every workload for its 4th gen Xeon Scalable CPUs (Sapphire Rapids). Sapphire Rapids has AI features built-in, including specialized additions to its instruction set called advanced matrix extensions (AMX). Scores show up to 5× inference performance versus 3rd gen hardware (without AMX) in previous rounds.
Compared to preview results Intel submitted for Sapphire Rapids in the last round, performance improved 1.2× for server mode and 1.4× for offline mode.
“Last time we just got AMX enabled, this time we’re tuning it and improving the software,” said Jordan Plawner, senior director of AI products at Intel. “We see improved performance on all models, between 1.2-1.4× in a matter of months… we expect to get up to 2× in the current generation with software alone.”
New for this round is a Network division, designed to represent a more realistic scenario for the data center.
“In MLPerf inference, in the regular mode, the data all starts in memory,” MLCommons’ Kanter said. “When you’re actually doing inference, it has to come from somewhere, whether that’s from storage, from the network, or from an external PCIe device. The one thing that’s common to those is they all pass through memory, that’s why [existing benchmarks] have all the data starting in host memory, typically—we specifically excluded the networking component because it’s more complex and it’s more things to optimize. With [the new Network division], we’re adding that back in, so queries are delivered over the network interface.”
Submitters Nvidia, HPE and Qualcomm used the Network division to show off results very close to their own results in the data center closed division.