Radeon Open Compute “ROCm” Stack v3.1 Released With RAS For Vega 7nm, SLURM Support For Better Resource Management, But Navi Still Missing

Radeon Open Compute or “ROCm” stack new version is now available for download. The Radeon Open Compute v3.1 brings with it quite a few features, but strangely, support for AMD Navi as well as GFX10 is still missing.

ROCm, the most widely accepted universal platform for GPU-accelerated computing, is now on version 3.1. The latest update to the modular platform which allows hardware vendors to build drivers that support the ROCm framework includes some much-anticipated features like RAS support for 7nm Vega and SLURM support for AMD GPUs. However, for reasons yet unknown, the ROCm still doesn’t have complete support for the next-generation AMD Navi Architecture.

What’s New In Radeon ROCm v3.1:

The biggest and most obvious change in the new installation of Radeon ROCm v3.1 is in the ROCm Installation Directory Structure. A fresh installation of the ROCm toolkit installs the packages in the /opt/rocm-<version> folder. Previously, ROCm toolkit packages were installed in the /opt/rocm folder.

The new version of ROCm has enhanced Reliability, Accessibility, and Serviceability (RAS) support for Vega 7nm GPUs. This 7nm Vega work is presumably under the microscope still for the Vega-based “Arcturus” compute accelerator coming this year. The support includes:

  • UMC RAS – HBM ECC (uncorrectable error injection), page retirement, RAS recovery via GPU (BACO) reset
  • GFX RAS – GFX, MMHUB ECC (uncorrectable error injection), RAS recovery via GPU (BACO) reset
  • PCIE RAS – PCIE_BIF ECC (uncorrectable error injection), RAS recovery via GPU (BACO) reset

Radeon ROCm v3.1 also gets SLURM Support for AMD GPUs. SLURM or Simple Linux Utility for Resource Management is one of the highly preferred and readily used cluster management and job scheduling system for Linux clusters. SLURM is preferred owing to it being open-source, fault-tolerant, and highly scalable.

This system can now interact well with AMD GPUs. The latest version 20.02.0 of SLURM includes AMD plugins that enable SLURM to detect and configure AMD GPUs automatically. It also collects and reports the energy consumption of graphics chips. The SLURM support is a useful addition given the increasing number of super-computing deployments using Radeon GPUs and other larger AMD GPU clusters.

Despite the inclusion of several features, there still are no signs of GFX10/Navi support in ROCm. The GitHub page for ROCm has been updated to reflect all the changes, installation notes, and known issues.

Alap Naik Desai
A B.Tech Plastics (UDCT) and a Windows enthusiast. Optimizing the OS, exploring software, searching and deploying solutions to strange and weird issues is Alap's main interest.