share_log

DeepSeek V4 Explained: Cost Reduction, Efficiency Enhancement, and Agent Optimization

CSC Financial Co., Ltd. ·  Apr 27 12:35

The preview version of DeepSeek-V4 has been released, with its parameter count doubling compared to the previous generation. Its performance rivals global closed-source models and reaches the state-of-the-art (SOTA) level among open-source models. Computational cost continues to be optimized, marking the advent of a high-cost-performance era for million-context models. DeepSeek-V4 innovates and upgrades core aspects such as hybrid attention mechanisms, mHC, and Muon, showcasing numerous highlights including computational network ratio optimization, heterogeneous KV Cache, and FP4 quantization-aware innovations. Domestic computing power and domestic models continue to evolve in synergy with deep integration, heralding a golden age for domestic computing power. DeepSeek-V4 adheres to an open-source strategy, significantly reducing costs while further enhancing capabilities like context length and Agent functionality, offering comprehensive benefits for complex application scenarios.

DeepSeek-V4 Preview: Parameter count doubles compared to the previous generation, introducing a high-cost-performance million-context model.

At noon on April 24, DeepSeek released its new-generation model V4-Preview, which includes two base models: DeepSeek-V4-Pro and DeepSeek-V4-Flash, both supporting a 1M context window. The two models serve different purposes: 1) V4-Pro is positioned as a high-performance expert model with 1.6 trillion total parameters and 49 billion activated parameters, representing more than a twofold increase over DeepSeek V3.2; 2) V4-Flash is a high-cost-performance rapid model with 284 billion total parameters and 13 billion activated parameters. Regarding pricing, the official rate for DeepSeek-V4-Pro on April 24 was ¥12/MTokens for input and ¥24/MTokens for output, but after an announced discount on April 25, it dropped to ¥3/MTokens for input and ¥6/MTokens for output, making it highly competitive globally. According to DeepSeek’s official WeChat account, Pro model service throughput is currently very limited. DeepSeek anticipates that once Ascend 950 super nodes are mass-produced in the second half of the year, the price of Pro will drop significantly.

Model Performance: Evaluation and practical usage rival global closed-source models, achieving SOTA among open-source models.

The official paper compares closed-source and open-source models across dimensions such as reasoning, long context, and Agentic Coding. In knowledge-based tasks, DeepSeek-V4-Pro-Max outperforms open-source models, narrowing the gap with closed-source models. In reasoning tasks, DeepSeek-V4-Pro-Max surpasses GPT-5.2 and Gemini-3.0-Pro but slightly trails GPT-5.4 and Gemini 3.1-Pro, while DeepSeek-V4-Flash-Max performs similarly to GPT-5.2 and Gemini-3.0-Pro. In Agent tasks, DeepSeek-V4-Pro-Max matches leading open-source models but slightly lags behind cutting-edge closed-source models. Internal evaluations show it outperforms Claude Sonnet 4.5 and approaches Opus 4.5 levels. Practical industrial tests have praised its long-context capability for usability and stability, with notable improvements in programming ability, ranking third among open-source models on Arena.ai’s coding arena.

Model Innovations: Upgraded innovations in core areas such as hybrid attention mechanisms, mHC, and Muon.

1) Innovative adoption of CSA+HCA hybrid attention architecture to reduce self-attention layer computational costs and cache usage. DeepSeek V4 Preview continues the cost-reduction and efficiency-improvement approach of prior models’ self-attention layers (Attention). The model alternates between compressed sparse attention (CSA) and heavily compressed attention (HCA) structures within the Attention layer, compressing multiple token KV Caches into a single KV entry. This allows the model to maintain understanding of ultra-long contexts while minimizing computational costs and cache usage. According to DeepSeek’s official paper, in a 1-million-token context scenario, DeepSeek-V4-Pro requires only 27% of the per-token inference FLOPs and 10% of the KV Cache compared to DeepSeek-V3.2, while DeepSeek-V4-Flash further reduces these to 10% of per-token inference FLOPs and 7% of KV Cache.

2) mHC updates residual connection paradigms and adopts online hybrid distillation strategies following post-training mechanisms introduced in V3. Classic HC (Hyper-Connections) can suffer from vanishing or exploding gradients as model depth increases, limiting parameter scaling. DeepSeek V4 introduces manifold-constrained hyper-connections (mHC), preserving multi-path information transfer between layers while restricting each layer's amplification/reduction of information, enhancing stability during training for deeper architectures and longer contexts. Post-training processes in DeepSeek V4 build on the V3.2 framework by incorporating online hybrid distillation (OPD). Multiple domain-expert models trained for mathematics, coding, Agent functionality, and instruction-following are distilled into a unified student model. We believe DeepSeek V4’s algorithmic innovations in training mechanisms further enhance the stability of ultra-high-parameter and ultra-long-context model training.

Computational Optimization: Numerous innovative highlights including computational-to-network ratio, heterogeneous KV Cache, and FP4 quantization-aware innovations.

1) An optimal ratio exists between computation and communication, benefiting targeted optimization of domestic computing power. The computational-to-communication ratio theory proposed by DeepSeek V4 represents a significant breakthrough in MoE large-model system optimization, challenging the industry's entrenched belief that “MoE efficiency relies entirely on extreme high bandwidth.” DeepSeek V4 designs a fine-grained wave scheduling expert parallelism scheme, achieving full overlap of communication and computation, delivering up to 1.96x performance improvement in testing. Based on experimental results and theoretical derivation of the new EP parallelism scheme, DeepSeek determined the optimal computational-to-communication ratio. The team highlights that the core bottleneck of MoE expert parallelism is not absolute bandwidth but whether the computational-to-bandwidth ratio meets a balanced threshold. Through quantitative analysis, DeepSeek identifies the golden balance point for MoE architecture: 6144 FLOPs/Byte, meaning every 1GB/s of interconnect bandwidth can fully support communication demands corresponding to 6.1 TFLOP/s of computational power. Once bandwidth meets this threshold, further stacking reduces chip area dedicated to computation, potentially diminishing marginal returns. We believe this theory provides foundational support for the rise of domestic hardware, with domestic computing chips and super nodes poised to benefit.

2) Innovative optimization of KV Cache, increasing the importance of SSDs and potentially benefiting edge deployment. DeepSeek V4 splits KV Cache into two types of heterogeneous compression, representing an innovative engineering breakthrough. The KV footprint for V4-Pro’s million-scale context window is only 10% of V3.2's, while V4-Flash is just 7% of the previous generation, making it the first open-source frontier model trained based on partial KV Cache offloaded to SSD. Leveraging this heterogeneous hierarchical mechanism, the model fully relocates finalized history blocks to Disk, efficiently decoupling hot and cold data. For the hot SWA window data, the paper proposes three strategies that flexibly balance write pressure and recalculation costs across different scenarios. We believe that in the cloud, V4’s approach enhances the importance of SSDs in data centers by compressing shared prefixes once and skipping repeated pre-fills. On the edge, it effectively reduces the deployment cost and threshold for edge models. For example, edge models with a few billion to several tens of billions of parameters typically have weights of only a few to over ten GB under Q4 quantization, but the KV Cache of a dense model with a 1M context may be several times larger than its weight.

▍Domestic Computing Power: Domestic models continue to converge.

On the day DeepSeek model was released, domestic chips announced day-zero compatibility. We believe the development of domestic models will further boost the growth of domestic computing power, with domestic computing power and models continuing to converge.

1) V4 strengthens the certainty of domestic computing power. Previously, the market was concerned about the limited application scenarios for domestic AI chips. The simultaneous adaptation of V4 indicates that domestic chips are entering the mainstream open-source large model ecosystem.

2) V4 changes the demand structure for domestic computing power. The focus is no longer solely on training cards but increasingly on inference cards, super nodes, interconnections, liquid cooling, and software stacks. The core of future orders will not be 'who has the highest single-card computing power,' but 'who can run large models like DeepSeek at the lowest cost with stability.'

3) V4 raises the commercial ceiling for domestic computing power. When 1M context, Agent, and Coding reach a low-cost and usable stage, enterprise-level AI applications will transition from pilots to large-scale deployments, shifting the demand for domestic computing power from policy-driven to business-driven.

▍Application Impact: DeepSeek V4 continues its open-source strategy, significantly reducing input-output costs while further enhancing capabilities such as context length and Agent functionality, benefiting the implementation of complex application scenarios.

DeepSeek lowers the application threshold through cost-effective inference, paving the way for new business models. Building on this, software companies with industry know-how attributes and deep integration into enterprise recordkeeping, transactions, and payment functions, as well as specialized software companies with private data barriers in vertical niche scenarios or those subject to strong industry regulation requiring deliverable results, are expected to benefit greatly from AI-enabled value growth.

Risk Factors:

Core AI technology development and application expansion falling short of expectations, insufficient reduction in computing power costs, improper use of AI causing severe social impacts, data security risks, information security risks, and intensified industry competition.

▍Investment Strategy: We recommend focusing on the following three investment themes.

1) AI Infrastructure: DeepSeek is deeply compatible with domestic computing power, and domestic computing power and domestic models continue to converge.

2) AI Applications: The model continues its open-source strategy, with input and output costs significantly reduced, and further enhancements in capabilities such as context length and Agent, benefiting complex application scenarios and companies with competitive advantages.

3) Model Manufacturers: DeepSeek's new generation model is expected to collaborate with other domestic models to drive China's AI advancement globally. Meanwhile, the cost of model training is further reduced, and cheaper tokens are driving an overall increase in the global large model API usage.

Note: This article is excerpted from the 'Computer Industry Intelligence Leadership (AI SOTA) Series Report 9 - DeepSeek V4 Detailed Analysis: Cost Reduction and Efficiency Improvement, Exploring Agent Opportunities' report published by CITIC Securities Research Department on April 26, 2026; Analysts: Yang Zechuan S1010517080002, Sun Jingyao S1010524080014, Ding Qi S1010519120003, Zhu Jueqi S1010525030005, Pan Ruchen S1010520110001, Xu Zhengyuan S1010525100005, Ma Qingliu S1010522090001, Han Linxuan S1010525120004.

Editor/joryn

The translation is provided by third-party software.


The above content is for informational or educational purposes only and does not constitute any investment advice related to EleBank. Although we strive to ensure the truthfulness, accuracy, and originality of all such content, we cannot guarantee it.