DeepSeek V4 updates DSpark, boosting inference speed by 80%

DeepSeek V4 has received an update, introducing the speculative decoding framework DSpark and open-sourcing DeepSpec, the full-stack speculative decoding framework supporting this release. DeepSeek-V4-Pro-DSpark integrates a speculative decoding module into the existing DeepSeek-V4-Pro, with the update focused on engineering deployment rather than iteration on the model’s intrinsic capabilities. DSpark has been deployed in real-world online traffic for DeepSeek-V4 (both Flash and Pro variants), significantly accelerating inference speed for large language models. The core motivation behind DSpark is to address latency and throughput bottlenecks in production environments, especially under high-concurrency scenarios. DSpark combines high-throughput 'parallel generation' with adaptive 'load-aware verification.' It introduces a semi-autoregressive generation architecture that retains the high-throughput advantage of parallel draft models while incorporating a lightweight sequential module to model intra-block token dependencies. DSpark also features hardware-aware confidence-scheduled verification, introducing a confidence head to estimate the survival probability of each token and dynamically customizing the optimal verification length per request. To deploy effectively, DSpark’s scheduler uses an asynchronous mechanism compatible with Zero-Overhead Scheduling (ZOS) and continuous CUDA graph replay, leveraging historical predictions to determine dynamic truncation length.