A newly published 14-page technical paper from the DeepSeek-V3 team, co-authored by DeepSeek CEO Wenfeng Liang, explores the “Scaling Challenges and Reflections on Hardware for AI Architectures.” This paper builds on their initial technical report and examines the complex relationship between large language model (LLM) development, training, and the hardware infrastructure that supports it. Moving beyond the architectural details of DeepSeek-V3 itself, the paper investigates how hardware-aware model co-design can effectively overcome current hardware limitations. This approach ultimately enables cost-efficient training and inference at a large scale.

The rapid growth of LLMs has revealed significant bottlenecks in existing hardware architectures, especially in areas such as memory capacity, computational efficiency, and interconnect bandwidth. DeepSeek-V3, trained on a cluster of 2048 NVIDIA H800 GPUs, serves as a strong example of how combining model design with hardware considerations can address these challenges. The research focuses on the interaction between hardware architecture and model design to achieve economical large-scale training and inference. Its goal is to provide practical insights for scaling LLMs efficiently without sacrificing performance or accessibility.

The paper highlights several key areas of focus. First, it analyzes how hardware features—such as FP8 low-precision computation and network properties related to scaling up or out—influence architectural decisions within DeepSeek-V3. Second, it investigates the interdependencies between hardware capabilities and model innovation, showing how the evolving demands of LLMs drive the need for next-generation hardware. Third, it draws practical lessons from DeepSeek-V3 to guide future co-design of hardware and model architectures aimed at scalable and cost-effective AI systems.

DeepSeek-V3 incorporates several important architectural innovations, including the DeepSeekMoE architecture and Multi-head Latent Attention (MLA), which directly address core scaling challenges like memory efficiency, cost-effectiveness, and inference speed. The paper’s Figure 1 illustrates these designs.

Regarding memory efficiency, MLA and KV cache optimization play a crucial role. LLMs require exponentially growing memory, which outpaces the slower growth of high-speed memory such as HBM. While multi-node parallelism helps, optimizing memory usage at the source remains essential. DeepSeek tackles this with MLA, which uses projection matrices to compress the key-value (KV) representations of all attention heads into a smaller latent vector trained jointly with the model. During inference, only this compressed latent vector needs to be cached, greatly reducing memory consumption compared to storing full KV caches for each head.

In addition to MLA, DeepSeek discusses other techniques for reducing KV cache size that could inspire future memory-efficient attention mechanisms. These include shared KV, where multiple attention heads share a single set of key-value pairs; window KV, which limits the context window for KV caching; and quantization compression, which reduces the precision of stored KV values.

Table 1 in the paper compares the per-token KV cache memory footprint of DeepSeek-V3, Qwen-2.5 72B, and LLaMA-3.1 405B. DeepSeek-V3 requires only 70 KB per token, a significant reduction compared to LLaMA-3.1 405B’s 516 KB and Qwen-2.5 72B’s 327 KB.

For cost-effectiveness, DeepSeek developed DeepSeekMoE, an advanced Mixture-of-Experts (MoE) architecture shown in Figure 1 (bottom right). MoE models offer two main advantages. First, they reduce training compute by selectively activating only a subset of expert parameters per token. This allows a large total number of parameters while keeping computational demands manageable. DeepSeek-V3 has 671 billion parameters, nearly three times its predecessor V2’s 236 billion, but activates only 37 billion parameters per token. In contrast, dense models like Qwen2.5–72B and LLaMa3.1–405B activate all parameters during training. Table 2 shows that DeepSeek-V3 achieves comparable or better performance than these dense models with roughly an order of magnitude less computational cost—about 250 GFLOPS per token versus 394 GFLOPS for the 72B dense model and 2448 GFLOPS for the 405B dense model.

Second, MoE models offer advantages for personal use and local deployment. The selective activation of parameters means much lower memory and compute requirements during single-request inference. For example, DeepSeek-V2 (236B parameters) activates only 21 billion parameters during inference, enabling near or above 20 tokens per second on AI SoC-equipped personal computers. This performance surpasses similarly sized dense models on comparable hardware and opens the door for personalized LLM agents running locally.

DeepSeek also prioritizes inference speed by focusing on both system-level maximum throughput and single-request latency. To maximize throughput, the model uses a dual micro-batch overlapping architecture that overlaps communication latency with computation from the start. It separates the computation of MLA and MoE into distinct stages. While one micro-batch performs part of the MLA or MoE computation, the other concurrently executes the corresponding scheduling communication. During the second micro-batch’s computation, the first micro-batch handles the combine communication step. This pipelined approach allows all-to-all communication to overlap seamlessly with continuous computation, ensuring full GPU utilization.

In production, DeepSeek employs a prefill and decode separation architecture. It assigns large-batch prefill requests and latency-sensitive decode requests to different-sized expert-parallel groups, maximizing system throughput under real-world serving conditions. The paper also emphasizes the importance of test-time scaling for reasoning models and highlights the critical role of high token output speed in reinforcement learning workflows and reducing user-perceived latency during long inference sequences. Optimizing inference speed through hardware-software co-innovation is therefore essential for efficient reasoning models.

On the topic of low-precision design, DeepSeek pioneers the use of FP8 mixed-precision training for a large-scale MoE model. While quantization methods like GPTQ and AWQ have reduced memory needs mainly for inference, DeepSeek-V3 is the first publicly known large model to leverage FP8 for training. This milestone was achieved through close collaboration between infrastructure and algorithm teams and extensive experimentation. Using FP8 significantly lowers computational costs while maintaining model quality, making large-scale training more feasible. Figure 1 shows the FP8 precision used in forward and backward passes during training.

DeepSeek also applies low-precision compression for network communication within DeepSeek-V3. During Expert Parallelism (EP), tokens are scheduled using fine-grained FP8 quantization, cutting communication volume by 50% compared to BF16 and substantially reducing communication time. Beyond standard floating-point formats, DeepSeek experimented with a novel data type called LogFMT-nBit (Logarithmic Floating-Point Formats).

Regarding hardware constraints, DeepSeek currently uses the NVIDIA H800 GPU SXM architecture, which is based on the Hopper architecture like the H100 but has reduced FP64 compute performance and NVLink bandwidth (400 GB/s down from 900 GB/s in H100) due to regulatory requirements. This reduction in intra-node scaling bandwidth presents challenges for high-performance workloads. To compensate, each node is equipped with eight 400G Infiniband CX7 network interface cards (NICs) to boost inter-node scaling capabilities.

To address these limitations, DeepSeek-V3 incorporates hardware-aware parallelization strategies. These include avoiding Tensor Parallelism (TP), enhancing Pipeline Parallelism (PP), and accelerating Expert Parallelism (EP). The paper provides detailed explanations of these strategies.

A key feature of model co-design is “node-aware routing” for the TopK expert selection in the MoE architecture. Because intra-node communication bandwidth (NVLink, about 160 GB/s effective) is roughly four times higher than inter-node bandwidth (Infiniband, about 40 GB/s effective per NIC), DeepSeek designed routing to exploit the higher intra-node bandwidth. The 256 routing experts (4 per GPU in an 8-node, 64-GPU setup) are grouped into 8 groups of 32 experts, each group residing on a single node. The routing algorithm ensures each token is sent to at most 4 nodes, reducing Infiniband communication bottlenecks and improving effective bandwidth during training. Tokens destined for experts on the same node are sent via Infiniband once and then forwarded via NVLink, minimizing redundant Infiniband traffic.

Looking ahead, while node-aware routing reduces bandwidth demands, the bandwidth gap between NVLink and Infiniband complicates communication-intensive kernel implementation. Currently, GPU Streaming Multiprocessors (SMs) handle both network message processing and data forwarding via NVLink, consuming significant compute resources. DeepSeek advocates integrating intra-node (scale-up) and inter-node (scale-out) communication into a unified framework.

They propose dedicated co-processors for network traffic management and seamless forwarding between NVLink and Infiniband domains. This could reduce software complexity and maximize bandwidth use. Hardware support for dynamic traffic deduplication could further optimize strategies like DeepSeek-V3’s node-aware routing. The paper also explores emerging interconnect protocols such as Ultra Ethernet Consortium (UEC) and Ultra Accelerator Link (UALink), noting the Unified Bus (UB) as a recent approach to converging scale-up and scale-out. It details programming framework methods for this convergence, including unified network adapters, dedicated communication co-processors, flexible forwarding and broadcast/reduce mechanisms, and hardware synchronization primitives.

Another hardware limitation is the lack of flexibility in dynamically allocating bandwidth between different traffic types on NVLink and PCIe. For example, transferring KV cache data from CPU memory to GPUs during inference can saturate PCIe bandwidth, causing contention with inter-GPU Expert Parallelism communication via Infiniband. This contention can degrade performance and cause latency spikes. DeepSeek suggests solutions such as dynamic NVLink/PCIe traffic prioritization, I/O chiplet integration, and CPU-GPU interconnect improvements within the scale-up domain.

For large-scale network design, DeepSeek-V3 training uses a Multi-Plane Fat-Tree (MPFT) scale-out network. Each node, with 8 GPUs and 8 Infiniband NICs, assigns each GPU-NIC pair to a different network plane. Each node also has a 400 Gbps Ethernet RoCE NIC connected to a separate storage network plane for accessing the 3FS distributed file system. The scale-out network uses 64-port 400G Infiniband switches, theoretically supporting up to 16,384 GPUs while maintaining the cost and latency benefits of a two-layer network. However, due to policy and regulatory constraints, the actual deployment involved over two thousand GPUs.

The deployed MPFT network did not fully realize its intended architecture because of current limitations in the Infiniband ConnectX-7. Ideally, each NIC would have multiple physical ports, each connected to a separate network plane but presented as a single logical interface via port bonding. This would allow a single Queue Pair (QP) to send and receive messages across all ports, similar to packet spraying. Native out-of-order layout support within the NIC would be necessary to ensure message consistency and correct ordering, as packets from the same QP might take different paths and arrive out of order. Infiniband ConnectX-8 supports four planes natively, and future NICs with full multi-plane capabilities will greatly enhance scalability for large AI clusters. Multi-plane architectures offer advantages in fault isolation, robustness, load balancing, and scalability for large systems.

DeepSeek highlights several benefits of MPFT, including its composition as a subset of Multi-Rail Fat-Tree (MRFT), which allows seamless integration of existing NVIDIA and NCCL optimizations for MRFT networks. MPFT is cost-effective, provides traffic isolation, reduces latency, and improves robustness. Performance analysis comparing MPFT and MRFT (Figures 5 and 6, Table 4) shows that all-to-all performance of multi-plane networks is very similar to single-plane multi-rail networks. When training the V3 model on 2048 GPUs, MPFT and MRFT performance was nearly identical.

Low-latency networking is critical for DeepSeek’s model inference, where large-scale Expert Parallelism relies heavily on all-to-all communication sensitive to both bandwidth and latency. Even microsecond-level network latency can significantly affect system performance.

The paper analyzes latency characteristics of Infiniband (IB) and RoCE (Table 5), noting IB’s consistently lower latency, making it preferable for latency-sensitive workloads like distributed training and inference. While RoCE offers a potentially cost-effective alternative, its current latency and scalability limitations prevent it from fully meeting large-scale AI system demands. DeepSeek proposes improvements for RoCE, including dedicated low-latency RoCE switches, optimized routing policies, and enhanced traffic isolation or congestion control.

To further reduce network communication latency, DeepSeek uses InfiniBand GPUDirect Async (IBGDA). Traditionally, network communication involves CPU proxy threads, adding overhead. IBGDA allows GPUs to directly populate Work Request content and write to RDMA doorbell MMIO addresses, eliminating latency from GPU-CPU communication. Managing the entire control plane within the GPU avoids CPU bottlenecks, especially when sending many small packets, as GPU parallel threads can distribute the workload. DeepSeek’s DeepEP and other works have shown significant performance gains using IBGDA, and DeepSeek advocates broad support for such features across accelerator devices.

The paper concludes with a discussion of future hardware architecture design directions based on identified limitations and proposed solutions. These include addressing robustness challenges like hardware failures and silent data corruption through advanced error detection and correction for continuous AI infrastructure. It also highlights the need to overcome CPU bottlenecks and interconnect limitations by optimizing CPU-accelerator collaboration and breaking traditional interface constraints like PCIe for high-speed intra-node communication.

Further directions include developing intelligent networks for AI with low latency and adaptive routing, resolving data consistency and ordering challenges in memory semantic communication through hardware-level guarantees, offloading computation and compression into the network to unlock bandwidth potential, and innovating memory-centric architectures to address the memory bandwidth crisis caused by exponential model scaling. Technologies like DRAM stacking and wafer-scale integration are explored.

The paper provides detailed insights and recommendations in each area, emphasizing the importance of a holistic co-design approach between hardware and software to enable continued progress and accessibility of large-scale AI.

In summary, this technical report offers valuable insights into the challenges and solutions encountered during DeepSeek-V3’s development and training. By carefully analyzing the interplay between model architecture and hardware limitations, DeepSeek presents a compelling vision for future AI infrastructure. It underscores the critical role of hardware-aware co-design in achieving cost-efficient, scalable large language models. The paper’s detailed exploration of techniques such as MLA, DeepSeekMoE, FP8 training, LogFMT, and the MPFT network, along with forward-looking hardware development recommendations, makes a significant contribution to large-scale AI research and engineering.

By Futurete

My name is Go Ka, and I’m the founder and editor of Future Technology X, a news platform focused on AI, cybersecurity, advanced computing, and future digital technologies. I track how artificial intelligence, software, and modern devices change industries and everyday life, and I turn complex tech topics into clear, accurate explanations for readers around the world.