DeepSeek DSpark Boosts AI Inference Speed by 85%

Researchers from Peking University and DeepSeek have released DSpark, an open-source speculative decoding framework capable of accelerating large language model inference by 60 to 85 percent per user in live production environments. The release marks DeepSeek's first significant technical contribution since the company closed a $7 billion funding round, arriving alongside the Chinese AI lab's preparations to officially launch its V4 model family in mid-July.

The timing is deliberate. DSpark is already fully deployed across DeepSeek's online services, and the V4 launch will introduce a new peak-and-off-peak API pricing mechanism designed to manage demand at scale. Together, the two developments signal that DeepSeek is building its next infrastructure layer around efficiency — squeezing more performance from existing hardware rather than simply scaling compute.

DeepSeek founder Liang Wenfeng co-authored the accompanying research paper, titled "DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation," underscoring how centrally the lab is treating this work.

How DSpark's Speculative Decoding Architecture Works

Draft and Verify: The Core Mechanism

Speculative decoding splits text generation into two distinct roles. A small, fast draft model proposes a batch of candidate tokens. The full target model then verifies that batch in a single forward pass, accepting every token it agrees with. This approach compresses what would otherwise be multiple sequential steps into a single parallel verification, directly cutting latency without touching model weights or degrading output quality.

DSpark improves on earlier speculative decoding approaches with two core additions that make the system more practical for production use.

Grafted Speculative Head — No Separate Draft Model Required

Rather than training an entirely separate draft model from scratch, DSpark grafts a lightweight speculative head directly onto the existing model checkpoint. The underlying model's output quality is preserved exactly, because the speculative head is additive rather than a replacement. This removes a major engineering burden that has historically made speculative decoding difficult to deploy at scale: operators no longer need to maintain, train, and version-match a parallel draft model alongside their primary one.

Confidence Scoring and the Hardware-Aware Scheduler

The second addition is a confidence-scoring system that assigns each drafted token a probability of surviving verification. A hardware-aware scheduler then uses those scores to dynamically adjust how many tokens get checked, based on current GPU load.

When server traffic is light, the scheduler allows longer runs of speculative tokens to be verified — capturing more potential speedup. When traffic is heavy and GPU resources are under pressure, the system proactively discards low-confidence tokens before they consume compute on verifications likely to fail. The result is an adaptive inference loop that self-tunes to real-time conditions rather than operating at a fixed, potentially wasteful, speculation depth.

Performance Results in Live Production and Offline Benchmarks

Production Gains on DeepSeek V4-Flash and V4-Pro

In DeepSeek's online production environment — handling real user traffic under live load — DSpark delivered 60 to 85 percent faster single-user generation on V4-Flash, and 57 to 78 percent faster on V4-Pro, compared to DeepSeek's prior MTP-1 baseline.

Under certain latency-favorable conditions, throughput gains reached as high as 661 percent on V4-Flash and 406 percent on V4-Pro. These are not synthetic benchmark figures; they reflect performance measured in a production system serving actual requests.

Offline Benchmark Comparisons

On offline evaluations, DSpark increased accepted token length — the key metric measuring how many speculative tokens survive verification per batch — by 26 to 31 percent over Eagle3 and 16 to 18 percent over DFlash. Both Eagle3 and DFlash represent recent state-of-the-art baselines in speculative decoding research, making these margin improvements meaningful rather than trivial.

Critically, DSpark reduces wasted GPU compute from invalid token verifications while maintaining output quality that is identical to the base model. There is no quality-speed trade-off here — the speedup is purely architectural.

Model-Agnostic Compatibility and the DeepSpec Release

Works Across Model Families

DSpark is not locked to DeepSeek's own checkpoints. The team demonstrated compatibility with Alibaba's Qwen3 and Google's Gemma model families, confirming that the framework can be applied to third-party architectures without modification to the underlying weights. This model-agnostic design meaningfully expands the potential adoption surface for the framework beyond DeepSeek's own ecosystem.

DeepSpec: Open-Source Training Infrastructure

Alongside DSpark, the team open-sourced DeepSpec — a full-stack codebase for training and evaluating speculative decoding drafters. DeepSpec gives researchers and engineers the tooling to build, test, and benchmark their own speculative decoding implementations. Both DSpark and DeepSpec are released under an MIT license on GitHub, making them freely available for commercial and research use.

DeepSeek V4 and the New API Pricing Model

The DSpark release arrives as DeepSeek prepares to officially launch V4 in mid-July. The V4 launch will introduce a peak-and-off-peak API pricing mechanism — a tiered pricing approach that adjusts costs based on when requests are made. With DSpark already deployed across all online services, DeepSeek enters the V4 era with a materially more efficient inference stack than it operated before, positioning the lab to handle higher request volumes at lower per-token compute cost.