Why NVIDIA RTX 5090 Virtualization Bug Requires System Reboot?

Abu Hojayfa is a seasoned digital journalist with over five years of experience, focusing on business and technology. Passionate about storytelling, she combines her love for food and travel with a flair for Malayalam movie references, adding flavor to every conversation and a unique touch to her professional narrative.
give your feedback.
Why NVIDIA RTX 5090 Virtualization Bug Requires System Reboot

Attention NVIDIA fans: NVIDIA’s latest flagship GPUs – the GeForce RTX 5090 and RTX PRO 6000 – have a serious virtualization bug that renders them comically unresponsive, requiring an entire OS reboot to recover. A developer, particularly those who are working on cloud platforms or building AI workloads that need high GPU uptime, would be worried due to this NVIDIA RTX 5090 bug.

Virtualization Issues Take Down NVIDIA’s Flagship Blackwell GPUs

CloudRift, a GPU cloud platform for developers, was the first to report the problems on either of the newly low-energy NVIDIA RTX 5090 or RTX PRO 6000. A virtual machine used these GPUs but they stopped responding after only a couple of days. This specific NVIDIA RTX 5090 bug locks the GPU so that it cannot be accessed afterwards without rebooting the host system, which is a disaster for large-scale VM hosts.

Interestingly, this issue does not impact older or other NVIDIA models either, like the RTX 4090, Hopper H100 or Blackwell B200 GPUs. More broadly, it appears the NVIDIA RTX 5090 bug—and the issue in the RTX PRO 6000 that resembles it—are largely confined to these new Blackwell-based GPUs.

What Causes the NVIDIA RTX 5090 Issue?

The NVIDIA RTX 5090 bug appears when it is being shared with a VM using device driver virtualization (VFIO). Stop responding within the GPU after a Function Level Reset (FLR). This causes a kernel soft lock as the two sides become deadlocked, with the host waiting for the client while the client waits for the host.

But for companies such as CloudRift that have hundreds of guest machines, rebooting the entire host system — the only way to recover — leads to substantial downtime and even more substantial operating nightmares.

Broader Implications of the RTX 5090 Virtualisation Flaw

NVIDIA RTX 5090 bug not limited to CloudRift — The Verge A different Proxmox user found the same kind of behavior and crashing of the entire host while trying to shutdown a windows client. It means they are not just one platform and the problem is intrinsic to the virtualization behavior of NVIDIA RTX 5090 and RTX PRO 6000.

Update: NVIDIA has allegedly confirmed internal reproduction of the NVIDIA RTX 5090 bug. A keylogger is a type of malware that records the user’s keyboard activity, potentially compromising accounts and passwords. Although Microsoft has not announced a patch yet, the company is working on a permanent one.

CloudRift’s $1,000 Bug Bounty

CloudRift is offering a $1,000 bounty to any developer or researcher who can work around or address the NVIDIA RTX 5090 virtualization bug. This step depicts the urgency of the problem for organizations whose workloads depend on seamless functioning of AI and GPU workloads.

NVIDIA will surely be rolling out a fix soon considering that the NVIDIA RTX 5090 bug is hitting mission-critical environments. In the meantime, enterprises dependent on these GPUs must brace for possible disruption.

We’ve tested several machines with RTX 5090 and RTX PRO 6000 based on AMD EPYC Rome and Milan platforms. All exhibit similar issues. GPU gets stuck, and VM creation fails with the following error:

libvirt:  error : internal error: Unknown PCI header type '127' for device '0000:26:00.0'

The relevant errors from dmesg:

[572205.636684] pcieport 0000:40:01.1: broken device, retraining non-functional downstream link at 2.5GT/s
[572206.637508] pcieport 0000:40:01.1: retraining failed
[572207.639663] pcieport 0000:40:01.1: Data Link Layer Link Active not set in 1000 msec
[572208.876663] vfio-pci 0000:26:00.0: not ready 1023ms after FLR; waiting
[572209.964657] vfio-pci 0000:26:00.0: not ready 2047ms after FLR; waiting
[572212.076705] vfio-pci 0000:26:00.0: not ready 4095ms after FLR; waiting
[572216.364619] vfio-pci 0000:26:00.0: not ready 8191ms after FLR; waiting
[572225.068466] vfio-pci 0000:26:00.0: not ready 16383ms after FLR; waiting
[572241.964374] vfio-pci 0000:26:00.0: not ready 32767ms after FLR; waiting
[572275.244028] vfio-pci 0000:26:00.0: not ready 65535ms after FLR; giving up
[572302.229867] watchdog: BUG: soft lockup - CPU#246 stuck for 26s! [worker:1274725]

FAQ

So, what is the NVIDIA RTX 5090 bug?

The behaviuor of the NVIDIA RTX 5090 bug is a virtualization bug where the GPU freezes when inside a VM and needs a full system restart to recover from.

What GPUs have this virtualization bug?

At the moment, this only bites into NVIDIA RTX 5090 and RTX PRO 6000 Blackwell-based GPUs. The RTX 4090, Hopper H100, and B200 GPUs, remain unscathed.

Has NVIDIA acknowledged the issue?

NVIDIA has indeed confirmed the RTX 5090 bug and that it’s working on a fix but, as the company stated, there isn’t yet a formal patch for it.

What solutions are available now?

The only way to recover is to reboot the entire system. The company, CloudRift are offering a $1,000 based reward for a work around or a fix.

TAGGED:
Abu Hojayfa is a seasoned digital journalist with over five years of experience, focusing on business and technology. Passionate about storytelling, she combines her love for food and travel with a flair for Malayalam movie references, adding flavor to every conversation and a unique touch to her professional narrative.
Chief Sub Editor
Follow:
Abu Hojayfa is a seasoned digital journalist with over five years of experience, focusing on business and technology. Passionate about storytelling, she combines her love for food and travel with a flair for Malayalam movie references, adding flavor to every conversation and a unique touch to her professional narrative. Main category: Tech + Mobile + Hardware
- Advertisement -
- Advertisement -