How we keep GPUs reliable across Databricks AI

At Databricks AI, training workloads run at massive scale every week, where failures occur continuously across hardware, fabric, and software. Most GPU failures at scale fall into three categories: crashed jobs, silent slowdowns, and numerical corruption. Crashed jobs often surface as the same symptom—the NCCL watchdog timeout message—but the timeout itself reveals almost nothing about the root cause, requiring tracing across hardware, fabric, filesystem, and software layers. Silent slowdowns involve a degraded GPU that continues training with logs and loss trending down, but throughput is bottlenecked on the slowest GPU, wasting compute and money; these come from hardware running in a degraded state, with signals such as DCGM throttle reasons like HW_SLOWDOWN or HW_THERMAL_SLOWDOWN for thermal issues, or link health for interconnects. Numerical corruption is also noted as a failure category. This first post in a series covers the failure modes encountered running GPUs, the diverse workloads that surface them, and the multi-stage health check system that catches them, with training as the most demanding workload class and focus.