AMD-Based “Frontier” Supercomputer Trapped Under Several Hardware Failures

Building a supercomputer is always demanding, but establishing the first exascale-class equipment for the sector is very difficult and involves a lot of hardware and software development. Unfortunately, this may be the case with the Frontier supercomputer at Oak Ridge National Laboratory, which can hardly go a day without experiencing many hardware issues. 

With AMD’s 64-core EPYC Trento CPUs, Instinct MI250X compute GPUs, and HPE’s Slingshot interconnectivity, ORNL’s Frontier is the first system in the industry that can achieve peak performance of up to 1.685 FP64 ExaFLOPS at 21 MW of electricity. The system was created by HPE using the Cray EX architecture, which was created for scale-out applications, particularly for exceptionally fast supercomputers. 

Although the hardware components of the Frontier supercomputer have been delivered and the machine appears to have remarkable potential on paper, hardware issues seem to be preventing it from going online and becoming available to researchers that need a performance of about 1 FP64 ExaFLOPS

Justin Whitt, program director for the Oak Ridge Leadership Computing Facility (OLCF) commented on the situation mentioning:

We are working through issues in hardware and making sure that we understand (what they are). You are going to have failures at this scale. Mean time between failure on a system this size is hours, it’s not days.” 

There have been rumors regarding possible hardware malfunctions with Frontier for a time. AccorSuding to a different InsideHPC article, several claimed that the Slingshot connector caused issues for the system. The Instinct MI250X compute GPUs from AMD were not as dependable this year, according to other reports as well. It’s important to keep in mind that only a limited number of consumers may purchase the X version, which has a greater number of stream processors and faster speeds. 

Mr. Whitt pushed that the computer has several hardware problems but he did not indicate that the system had any specific problems with Instinct or Slingshot.

A lot of challenges are focused around those [GPUs], but that’s not the majority of the challenges that we are seeing. It is a pretty good spread among common culprits of parts failures that have been a big part of it. I don’t think that at this point we have a lot of concern over the AMD products.” 

The Frontier supercomputer at Oak Ridge National Laboratory is by no means the only one to incorporate AMD’s EPYC CPUs, Slingshot interconnects, and Cray EX architecture from HPE. For instance, the Lumi supercomputer from Finland, officially recognized as the third-most powerful supercomputer in the world, has a peak performance of 550 PetaFLOPS using similar components. The size of the machine, which requires a total of 60 million pieces, may make the issue viable. 

Given that the Frontier supercomputer is still not formally deployed, it is still unclear if it will be made available to academics beginning in 2023 as originally scheduled to be online in 2022.

Muhammad Zuhair
Passionate about technology and gaming content, Zuhair focuses on analysing information and then presenting it to the audience.
Back to top button