NVIDIA K2 – Not populating GPU profiles
There are those days that you think it’s a easy day, just creating a new environment and let’s have some fun, but in the end you are searching for hours wondering about an issue. I had one of those days this week, it was because my NVIDIA K2 vGPU profiles didn’t populate when I hoped they would. So let’s take you on the trip I took to solve this, the solution was too simple to be true of course but before I realised that the day was over, staring in the dark I would call it. I didn’t install the hosts and until today everything looked fine, i never checked the driver before as I got no errors.
The environment we got running is a vSphere 6 environment with a couple of APEX hosts and a couple of K2 hosts for “special” users. The environment is a VMware Horizon 7 environment with Microsoft Windows 10 and RES One workspace for UEM, basic stuff nothing we never done before.
First let’s talk about the symptom that triggered the day searching. So I opened the properties of a newly created golden image and added the shared PCI device to enable vGPU for that virtual machine. When you add a shared PCI device you will see NVIDIA GRID vGPU listed as it is the only option there. Below you will see another dropbox that will allow you to select the vGPU profile. The profile you select determines the number of displays/resolution and hence users on a host.
So what we noticed is that the GPU Profile list didn’t populate, no profiles were available for selection. So first we dived in and did some troubleshooting, checking if the driver was installed and making sure ESXi understood it was a GPU it had running. NVIDIA-SMI gives good details about that. We couldn’t see anything out of the order there (it was there but we missed it).
So we looked at the host, we connected to the host through a browser and checked the Graphics, nothing out of the order there it seems. The host was seeing a NVIDIA K2 card and they are shared. That was my first thought perhaps they are on passthrough and not shared but that idea was left alone soon enough.
So we move on, we checked the security profile to see if the X.org service was running. It was not running but even starting it didn’t help us. No population of the GPU profiles yet. Let’s move on.
As I mentioned before if you set the NVIDIA K2 in a different mode they are available but only for one virtual machine of a couple of virtual machine. So we checked in the console if that was the case. There was no listing of this at all so we had to look further.
We got to the hardware and checked whether the NVIDIA K2 cards were listed and again if they were in passthrough mode. You can never check enough because perhaps you overlook something. But still the issue was not solved and everything looked fine.
Slowly we were getting to the issue, we checked the software on the system. What driver had been installed there? With the command esxcli software vib list you can see the listing of the drivers on the system. It will be a long list of drivers but search for NVIDIA and you will find it soon enough.
As you see here the driver name is NVIDIA-Kepler-VMware_ESXi_6.0_Host_Driver and it looks good but look again because it is the wrong driver. I saw the driver and overlooked a small detail, that small detail took me the best part of the day. the driver name should include vGPU also otherwise you got the wrong driver. If you are looking for a driver here is the link to the latest one – link –
another way to find the driver is to use the command esxcli software vib list | grep NVIDIA
So now we knew the driver was wrong and we had to replace it, so let’s proceed with that.
At first we have to remove the driver from the host, so let’s put the host in maintenance mode and continue.
Next we connect to the host with Putty to execute the command that will uninstall the driver on that host. So you open Putty and type in the hostname of the host and log on with the root account.
First we need to stop the xorg service. So execute the command /etc/init.d/xorg stop
When that is done we continue with removing the NVIDIA VMKernel driver with the command vmkload_mod -u nvidia
So next up is the removal of the driver but for that you need the name of the vib that is installed. So if we return to the command I mentioned earlier you can find the name as the first part of the details you get.
To remove the vib/driver you execute the command esxcli software vib remove -n NVIDIA-kepler-VMware_ESXi_6.0_Host_driver
At the end you will get a status overview of what was done, you don’t need to reboot yet let us first put a driver on there and reboot after that
So with the driver gone we now can add a new driver to the system, the one that we need there. So download the correct driver with the link I showed earlier and make sure the file is uploaded to a datastore or to the host. For now I uploaded the file to the /tmp folder on the host.
To install the driver you execute the command esxcli software vib install -v NVIDIA-vGPUkepler-VMware_ESXi_6.0_Host_driver_361.45.09-10EM.600.0.0.24945585.vib (copy the name from the file)
At the end it will show you the status and will say a reboot is not needed but I disagree there. So after the next step we will do a reboot to get it all working. The last step is to start the xorg service, so use the command /etc/init.d/xorg start for that.
Exit maintenance mode
So after the reboot we take the host out of maintenance mode so that it will operate normal again,
Virtual machine properties
Open the virtual machine properties, add a Shared PCI device and click on ADD. the NVIDIA vGPU is added as a device.
Next we come to the part that was why I started this journey. We want to select profiles so as we now look at the drop box we see that we have the ability to select profiles. The driver is working and we can build our Microsoft Windows 10 golden image.
If we look at the device manager of Microsoft Windows 10 we now wee the GRID profile listed. Doing a benchmark test showed that we get between 25 and 66 frames per second without any tuning. These test are nothing like employees will do here, so I’m sure this will work out just fine.
After we found that the driver was the issue it wasn’t that hard to solve. I hope this blog will help you with fixing your issue when you experience the same. They told me to check the driver but even though I did I missed it. Wondering if more coffee would have helped there 🙂