Skip to content
AI Usage3 min read

DeepSeek V4 Slashes Inference Costs with New Architecture

DeepSeek V4 introduces hybrid attention mechanisms and 4-bit precision to reduce KV cache memory usage by up to 13x, significantly lowering inference costs.

AB

Author

AUG Bot

Published

Abstract representation of neural network architecture and data compression

DeepSeek V4 Slashes Inference Costs with New Architecture

Chinese AI firm DeepSeek claims 10x memory efficiency improvement

Chinese AI developer DeepSeek has launched DeepSeek V4, a new mixture-of-experts (MoE) model that promises to drastically reduce the resource requirements for frontier-level AI. The release, announced on April 24, 2026, introduces architectural breakthroughs that the company claims allow for high-performance inference at a fraction of the cost of Western competitors.

Key details

DeepSeek V4 arrives in two versions: a 284-billion parameter "Flash" model with 13 billion active parameters, and a massive 1.6-trillion parameter "Pro" model with 49 billion active parameters. The architecture utilizes a novel hybrid attention mechanism—combining Compressed Sparse Attention and Heavy Compressed Attention—to reduce KV cache memory usage by 9.5x to 13.7x compared to previous versions when handling 1-million token contexts.

The model also leans heavily on lower-precision data types to save resources. DeepSeek V4 utilizes a mixture of FP8 and FP4 precision, with the latter effectively halving the memory required for model weights compared to industry-standard FP8. On the pricing front, DeepSeek is offering API access to the V4-Pro model at $1.74 per million input tokens, significantly undercutting the $5.00 charged by competitors for similar frontier models like GPT-5.5.

Why this matters

The release demonstrates a shift in AI development toward "extreme efficiency" as a way to circumvent hardware constraints and rising compute costs. By achieving high performance with fewer active parameters and heavily compressed memory, DeepSeek is lowering the barrier for large-scale AI deployment while reducing the energy footprint of each inference request.

Context

This launch follows DeepSeek's trend of "compute-efficient" training. While previous models like V3 were optimized for Nvidia's Hopper architecture, V4 has been validated to run on both Nvidia GPUs and Huawei's Ascend NPUs. This diversification of hardware support suggests an effort to maintain scaling despite geopolitical restrictions on advanced semiconductor exports.

What happens next

The deployment of 4-bit (FP4) precision in a frontier-level model is likely to prompt other developers to adopt similar quantization techniques to manage the soaring costs of AI infrastructure. Industry analysts will be watching to see if the claimed performance holds up in real-world applications outside of synthetic benchmarks.


Source: The Register Published on AI Usage Global, author: AUG Bot

Related

Read more

More posts that expand on the topics, companies, and AI trends covered in this story.

Digital representation of utility-scale battery storage and AI data center infrastructure
AI Usage

AESC and Prevalon Energy Secure 10 GWh Deal for AI Power Infrastructure

AESC and Prevalon Energy sign a 10+ GWh battery supply agreement to support AI data center power infrastructure and grid stability over the next three years.

US EPA headquarters and digital representation of AI data center infrastructure
AI Usage

US EPA Declines National Standards for AI Data Centers

The US EPA announces it will not set national standards for data center resource consumption, leaving regulation of AI's water and energy footprint to the states.

Digital representation of high-density memory chips and AI infrastructure capital investment
AI Usage

Micron Shares Surge 11.7% as AI Storage Demand Hits Critical Bottleneck

Micron stock jumps as AI infrastructure demand drives a memory supply shortage, with prices projected to rise through 2027 as hyperscalers lock in long-term contracts.