NVIDIA K2 on VMware vSphere 6 crash


NVIDIA K2 on VMware vSphere 6 crash

Today I was working for a customer that is having issues in their VDI environment. I thought, as twitter won’t let me write books larger then 140 characters, to make a post about it. So a short post about an issue that is bugging a customer and me.  The issue revolves around NVIDIA and VMware so far.

The environment

The environment is a VMware vSphere 6 environment with a couple of hosts where some are designated for normal VDI work and some for advanced or even power VDI GPU users. The environment is a VMware Horizon 7 VDI desktop where several desktops are fitted with a NVIDIA K2 Q240 profile. The desktops are non-persistent and VMware UEM is used to manage the user.

Users are engineers designing so they need the power and the desktop.

The issue

The issue is that since a couple of weeks the users are experiencing disconnects and after that a inability to reconnect to the desktop. This disconnect is not a random something but happens to all virtual machines on the host. Some might say the host has lost connection but it hasn’t. Others might argue that it’s just a loss of network and you could reconnect when that is up again, but that’s also not true.

User can’t reconnect after that, so a whole host of VDI desktops is running fine but user are not able to connect. When I say the desktops are running fine I mean we can browse to the desktop, search files, execute remote command anything. What the desktop is lacking is video which has crashed somehow. We have a couple of hosts running and not all have a NVIDIA GPU for not all users are engineers. The host without the GPU is running fine, no crashes (fingers crossed) so far. Also when the issue occurs on the other host (never on all of them at once) the normal VDI host is doing just fine.

Errors

There are not much errors or info to play with, we found this one at the vCenter event tab.. first a lot of disconnected messages showing that something is disconnected from the machine, next a message that some plugin couldn’t be loaded. It refers to the NVIDIA K240Q

NVIDIAIf we look at the details of that one we see that it tries to load but can’t

NVIDIA

 

ssh’n into the host and running nvidia-smi, one is at 99%… the rest is doing nothing… seems not related to Nvidia looking at this screen.

Nvidia

 

Resolution

Haven’t found one yet, in contact with NVIDIA (Thanks Rachel), VMware and a number of people in the community giving good thoughts. So far we know our drivers are not the latest, so we will update them asap but it seems that something else has to be the issue, can’t just be a driver that suddenly decides I’m done with it after functioning fine for months.

our drivers by the way are: 362.56_grid_win10_64bit_english  for Windows 10 and the NVIDIA-vGPU-kepler-VMware_ESXi_6.0_Host_Driver_361.45.09-1OEM.600.0.0.2494585.vib for the host. The environment is not changing that rapidly, they are a solid customer with a small IT department that want a solid solution. This is our quest we are on, any thoughts bring them on.

 

 

 


3 Responses

  1. vikrant says:

    Thank you so much to share the error . We were facing the same issue randomly but when we have checked the issue was with host connectivity . did you check the host connectivity ?
    but anyways your problem has resolved .You show me the another solution of this issue is to update the driver .Next time if this error will come again I will try this solution also .
    Thanks for sharing.

    • Rob Beekmans says:

      Hi,

      Our issue is not resolved yet… we have updated the driver but we also deployed a pool without the GPU to avoid the crash.
      The issue is still under investigation.
      Wehen the issue occured the host was functioning and reachable. the GPU was accessible and I could run nvidia-smi on it.
      The only thing that was odd was that all the outputs of the commands came back really slow… like seconds before you get output.

      On a host that functioned fine it was instant.
      I noted that the K2 still thought the machines were connected, tried to reset it but that didn’t work.
      Only a reboot (cold boot) of the host worked

  1. October 6, 2016

    […] Read the entire article here, NVIDIA K2 on VMware vSphere 6 crash […]

Leave a Reply

https://tracking.cirrusinsight.com/869c29e2-3a9b-48c5-9232-0b95e7993ae8/controlup-com-pixel-php