Nvidia says software cuts DeepSeek V4 token costs fivefold

Nvidia announced that its inference software has reduced token costs for the DeepSeek V4 model on the Blackwell platform by up to five times, following software improvements made over about a month. The announcement focuses on the economics of running artificial intelligence models at scale, arguing that cost per token is becoming more important than raw chip specifications as companies move from pilot projects to live deployments. Nvidia said its software stack spans model serving, runtime scheduling, kernels, communication libraries and hardware-level optimisation, and on Blackwell systems, those layers can combine techniques such as disaggregated serving, large expert parallelism over NVLink, NVFP4 precision and multi-token prediction to increase throughput by as much as 20 times compared with a baseline setup. Nvidia said the software layer determines whether tasks use compute resources efficiently or add cost through idle capacity, poor scheduling or inefficient communication. Customer examples include Baseten, which used TensorRT-LLM to serve DeepSeek V4 Pro and delivered up to 50% more tokens per second; Cognition, using the Dynamo inference framework to manage inference graphics processors; Deep Infra, serving open source frontier models on Blackwell; and Together AI, using TensorRT-LLM on Blackwell for Cursor's real-time coding service. Nvidia framed the improvement within the open source software ecosystem built around CUDA, noting PyTorch, first launched with native CUDA support, has evolved alongside its chip architectures.