ZAYA1 AI Model Using AMD GPUs Achieves Significant Training Milestone
Zyphra, AMD, and IBM collaborated for a year to test whether AMD’s GPUs and platform could support large-scale AI model training. Their efforts resulted in the creation of ZAYA1, a major Mixture-of-Experts (MoE) foundation model built entirely on AMD GPUs and networking. This achievement demonstrates that the AI market does not have to rely solely on NVIDIA to scale AI workloads.
The ZAYA1 AI model using AMD GPUs was trained on AMD’s Instinct MI300X chips, Pensando networking hardware, and ROCm software. All of this ran on IBM Cloud’s infrastructure. What stands out is the conventional nature of the setup. Instead of relying on experimental hardware or unusual configurations, Zyphra constructed the system similarly to a typical enterprise cluster—simply without NVIDIA components.
Zyphra reports that ZAYA1 performs on par with, and in some areas surpasses, well-known open models in tasks involving reasoning, mathematics, and coding. For businesses facing supply shortages or rising GPU costs, this development offers a rare alternative that does not compromise on performance or capability.
How Zyphra Used AMD GPUs to Optimize AI Training Costs and Performance
When planning AI training budgets, most organizations prioritize memory capacity, communication speed, and consistent iteration times over raw theoretical throughput. The MI300X GPU’s 192GB of high-bandwidth memory provides engineers with ample headroom. This allows early training runs to proceed without immediately resorting to complex parallelism, simplifying projects that are typically fragile and time-consuming to tune.
Each training node was built with eight MI300X GPUs connected via InfinityFabric, with each GPU paired with its own Pollara network card. A separate network manages dataset reads and checkpointing. This straightforward design reduces switch costs and helps maintain steady iteration times by keeping wiring and network layout simple.
ZAYA1-base activates 760 million parameters out of a total 8.3 billion and was trained on 12 trillion tokens through three stages. Its architecture incorporates compressed attention, a refined routing system to direct tokens to the appropriate experts, and lighter residual scaling to stabilize deeper layers.
The model employs a mix of Muon and AdamW optimizers. To optimize Muon for AMD hardware, Zyphra fused kernels and minimized unnecessary memory traffic, preventing the optimizer from dominating each iteration. Batch sizes were gradually increased, depending heavily on storage pipelines capable of delivering tokens quickly.
This approach resulted in an AI model trained on AMD hardware that competes with larger peers such as Qwen3-4B, Gemma3-12B, Llama-3-8B, and OLMoE. One advantage of the MoE structure is that only a small portion of the model runs at once, which helps manage inference memory and reduces serving costs.
For example, a bank could train a domain-specific model for investigations without requiring complex parallelism early in the process. The MI300X’s memory capacity provides engineers with room to iterate, while ZAYA1’s compressed attention reduces prefill time during evaluation.
Technical Adaptations and Practical Benefits of ZAYA1 AI Model Using AMD GPUs
Zyphra openly acknowledged the challenges of moving a mature NVIDIA-based workflow onto AMD’s ROCm platform. Rather than blindly porting components, the team carefully measured AMD hardware behavior and adjusted model dimensions, GEMM patterns, and microbatch sizes to fit the MI300X’s preferred compute ranges.
InfinityFabric performs best when all eight GPUs in a node participate in collective operations, and Pollara achieves peak throughput with larger message sizes. Zyphra sized fusion buffers accordingly. Long-context training, ranging from 4,000 to 32,000 tokens, used ring attention for sharded sequences and tree attention during decoding to avoid bottlenecks.
Storage was also optimized pragmatically. Smaller models demand high IOPS, while larger models require sustained bandwidth. Zyphra bundled dataset shards to reduce scattered reads and increased per-node page caches to speed up checkpoint recovery, which is crucial during long training runs where rewinds are common.
To maintain cluster stability, Zyphra’s Aegis service monitors logs and system metrics, detects failures such as NIC glitches or ECC errors, and automatically applies corrective actions. The team also extended RCCL timeouts to prevent brief network interruptions from terminating entire jobs.
Checkpointing is distributed across all GPUs instead of funneling through a single bottleneck. This approach delivers checkpoint saves more than ten times faster than naïve methods, improving uptime and reducing operator workload.
The milestone achieved by the ZAYA1 AI model using AMD GPUs highlights a clear distinction between NVIDIA’s ecosystem and AMD’s alternatives, such as NVLINK versus InfinityFabric, NCCL versus RCCL, and cuBLASLt versus hipBLASLt. The report argues that AMD’s software and hardware stack is now mature enough for serious large-scale model development.
This does not imply that enterprises should abandon existing NVIDIA clusters. Instead, a more practical approach is to retain NVIDIA for production workloads while leveraging AMD for training stages that benefit from the MI300X’s memory capacity and ROCm’s openness. This strategy diversifies supplier risk and increases total training capacity without major disruptions.
The key takeaways include treating model architecture as adjustable rather than fixed, designing networks around actual collective operations used during training, building fault tolerance that protects GPU hours instead of merely logging failures, and modernizing checkpointing to avoid disrupting training flow.
These insights come from the combined experience of Zyphra, AMD, and IBM in training a large MoE AI model on AMD GPUs. For organizations seeking to expand AI capacity without relying solely on one vendor, this work offers a valuable blueprint.
For more stories on this topic, visit our category page.
Source: original article.
