Skip to content
AI Usage3 min read

DeepSeek V4 Slashes Inference Costs with New Architecture

DeepSeek V4 introduces hybrid attention mechanisms and 4-bit precision to reduce KV cache memory usage by up to 13x, significantly lowering inference costs.

AB

Author

AUG Bot

Published

Abstract representation of neural network architecture and data compression

DeepSeek V4 Slashes Inference Costs with New Architecture

Chinese AI firm DeepSeek claims 10x memory efficiency improvement

Chinese AI developer DeepSeek has launched DeepSeek V4, a new mixture-of-experts (MoE) model that promises to drastically reduce the resource requirements for frontier-level AI. The release, announced on April 24, 2026, introduces architectural breakthroughs that the company claims allow for high-performance inference at a fraction of the cost of Western competitors.

Key details

DeepSeek V4 arrives in two versions: a 284-billion parameter "Flash" model with 13 billion active parameters, and a massive 1.6-trillion parameter "Pro" model with 49 billion active parameters. The architecture utilizes a novel hybrid attention mechanism—combining Compressed Sparse Attention and Heavy Compressed Attention—to reduce KV cache memory usage by 9.5x to 13.7x compared to previous versions when handling 1-million token contexts.

The model also leans heavily on lower-precision data types to save resources. DeepSeek V4 utilizes a mixture of FP8 and FP4 precision, with the latter effectively halving the memory required for model weights compared to industry-standard FP8. On the pricing front, DeepSeek is offering API access to the V4-Pro model at $1.74 per million input tokens, significantly undercutting the $5.00 charged by competitors for similar frontier models like GPT-5.5.

Why this matters

The release demonstrates a shift in AI development toward "extreme efficiency" as a way to circumvent hardware constraints and rising compute costs. By achieving high performance with fewer active parameters and heavily compressed memory, DeepSeek is lowering the barrier for large-scale AI deployment while reducing the energy footprint of each inference request.

Context

This launch follows DeepSeek's trend of "compute-efficient" training. While previous models like V3 were optimized for Nvidia's Hopper architecture, V4 has been validated to run on both Nvidia GPUs and Huawei's Ascend NPUs. This diversification of hardware support suggests an effort to maintain scaling despite geopolitical restrictions on advanced semiconductor exports.

What happens next

The deployment of 4-bit (FP4) precision in a frontier-level model is likely to prompt other developers to adopt similar quantization techniques to manage the soaring costs of AI infrastructure. Industry analysts will be watching to see if the claimed performance holds up in real-world applications outside of synthetic benchmarks.


Source: The Register Published on AI Usage Global, author: AUG Bot

Related

Read more

More posts that expand on the topics, companies, and AI trends covered in this story.

Data center campus with gas power infrastructure and emissions
AI Usage

AI Gas Power Plans Could Emit 129 Million Tons a Year

A WIRED review of air permits finds 11 gas-powered AI data center projects could emit more than 129 million tons of greenhouse gases per year.

Server rack and semiconductor components
AI Usage

AI Server Demand Triggers Global Component Shortage for 2026

TrendForce downgrades server growth forecasts as AI hardware demand creates critical shortages of power and management chips, with lead times stretching up to 40 weeks.

Data center infrastructure and energy grid
AI Usage

UK Firms Offshoring AI Workloads Due to High Energy Costs

A new report finds that 20% of UK firms have moved AI workloads abroad as high electricity prices and grid bottlenecks hinder domestic AI infrastructure growth.