NVIDIA K2 on VMware vSphere 6 crash
Today I was working for a customer that is having issues in their VDI environment. I thought, as twitter won’t let me write books larger then 140 characters, to make a post about it. So a short post about an issue that is bugging a customer and me. The issue revolves around NVIDIA and VMware so far.
The environment is a VMware vSphere 6 environment with a couple of hosts where some are designated for normal VDI work and some for advanced or even power VDI GPU users. The environment is a VMware Horizon 7 VDI desktop where several desktops are fitted with a NVIDIA K2 Q240 profile. The desktops are non-persistent and VMware UEM is used to manage the user.
Users are engineers designing so they need the power and the desktop.
The issue is that since a couple of weeks the users are experiencing disconnects and after that a inability to reconnect to the desktop. This disconnect is not a random something but happens to all virtual machines on the host. Some might say the host has lost connection but it hasn’t. Others might argue that it’s just a loss of network and you could reconnect when that is up again, but that’s also not true.
User can’t reconnect after that, so a whole host of VDI desktops is running fine but user are not able to connect. When I say the desktops are running fine I mean we can browse to the desktop, search files, execute remote command anything. What the desktop is lacking is video which has crashed somehow. We have a couple of hosts running and not all have a NVIDIA GPU for not all users are engineers. The host without the GPU is running fine, no crashes (fingers crossed) so far. Also when the issue occurs on the other host (never on all of them at once) the normal VDI host is doing just fine.
There are not much errors or info to play with, we found this one at the vCenter event tab.. first a lot of disconnected messages showing that something is disconnected from the machine, next a message that some plugin couldn’t be loaded. It refers to the NVIDIA K240Q
ssh’n into the host and running nvidia-smi, one is at 99%… the rest is doing nothing… seems not related to Nvidia looking at this screen.
Haven’t found one yet, in contact with NVIDIA (Thanks Rachel), VMware and a number of people in the community giving good thoughts. So far we know our drivers are not the latest, so we will update them asap but it seems that something else has to be the issue, can’t just be a driver that suddenly decides I’m done with it after functioning fine for months.
our drivers by the way are: 362.56_grid_win10_64bit_english for Windows 10 and the NVIDIA-vGPU-kepler-VMware_ESXi_6.0_Host_Driver_361.45.09-1OEM.600.0.0.2494585.vib for the host. The environment is not changing that rapidly, they are a solid customer with a small IT department that want a solid solution. This is our quest we are on, any thoughts bring them on.