Troubleshooting tools for VMware Horizon (view)
This week started interesting, a customer where we are building a new environment experienced a complete disaster as their current environment went down. As I was first on the site I heard people talk about that they couldn’t use “the environment”, so I rushed up and took a look at what was wrong and to start the troubleshooting. At the time I logged on, less than 10% of the desktops were still available and 90% was in provisioning (missing) or deleting (missing) state.
The environment was in a state that no one knew which information about the virtual machines (linked clones) was correct. Composer had some info, vCenter had some info and the connection servers ADAM database(s) had some info. Unfortunately no one trusted what was having the correct info, vCenter – if they would have the issue was resolved fast. Now I needed to troubleshoot and fix stuff.
So we have a classic no one trusts anyone and I needed to get 250 desktops up and running asap. Together with VMware we went through the environment and first tried to get a picture of what was happening. It took in total about 10 hours to get everything up and running so I thought perhaps it is good to share what we did and what we used.
It will be a short blog but I hope you will get the info you need if something like this happens to you.
Tools we used
We used the following tools / commands etc;
- ADSI Edit – read about how here – link –
- ViewDBCHK – found in the bin folder of the connection server from version 6 up.
- Notepad++ – Read the composer and connection server log files
- RepAdmin – check on replication status and remove orphan replication partners
- View Administrator console – remove virtual machines
I think that is what I used, it’s a combination of things you need. Let me talk you through the most important ones.
We saw pretty soon that we had a replication issue which was the source of the issue, Composer was getting mixed messages from connection servers but not all information was consistent so it freaked out. We started with looking in the log files but soon after we went to some of the connection servers and look at repadmin to see what it was doing.
The following command show the status: repadmin.exe /showrepl localhost:389 DC=vdi,DC=vmware,DC=int
So we noted that one of the servers was having a bad time and was out of sync, it was showing a access denied on replication. To fix that we removed the software, so VMware View Connection server and the AD LDS Instance VMwareVDMDS software. Also make sure you remove the Role from the Server manager. Once the software is gone you need to remove it from the ADAM database otherwise the rest still thinks it is there.
With the command, vdmadmin.exe -S -r -s connection_server_name you remove the server from the ADAM database. You need to run this command from the other server. The name you enter is not the FQDN but the hostname.
After this you reinstall the Connection server software, we did, and run the first command again to check the status. That was part 1, now the environment got the right information again from all servers. Now we needed to get those 250 desktops available again because still the information was wrong on both sides.
From here on you work with ADSI Edit and VIEWDBCHK to troubleshoot.
There is a new tool to help you troubleshoot, not that new but somehow new. If you run version 5.3 or newer it is available on the connection server. It is one of the hidden gems of VMware that you can’t miss in these cases. Before we had this tool we needed to work with ADSI Edit and that was no fun (actually it was but that is my sick mind).
So VIEWDBCHK it is, you have a couple of option to run it but there are two you need to remember to debug and fix.
The command ViewDBChk –scanMachines will scan your environment and all you desktop pools to see if there are desktops with errors. It will only report so it is nice to run and see what it gives. We had 139 desktop with errors and 110 that were in such a state even ViewDBchk had no idea what to do with them 🙂
after you run it you are asked if you want to disable provisioning of a pool, if you answer “y” or “n” it will just end, you need to answer “yes” or “no”.
The next command ViewDBChk –scanMachines –limit 10 –force will help you actually clean up the mess. The limit of 10 is set so that is will only delete 10 machines and end again, you can put a higher number there but I think it is better to do thing is small numbers. The –Force will not ask you for deletion with will delete desktops in error state.
There are more options, look for them here – link –
You need to run this multiple times, it will find more desktops with errors every time. We ran this tool perhaps 20 times before we got no desktops with errors found.
The last tool we used is ADSI Edit, before I wrote that you don’t need ADSI Edit anymore and that ViewDBchk will help you. Well that’s partly true, it will fix 95% of your desktops by deleting them but that 5% will not be detected as it is a unknown state. For those desktops we still revert to ADSI Edit.
Connect to the environment by entering dc=vdi,dc=vmware,dc=int distinguished name (DN) or naming context and enter localhost:389 under server name. Next you create a search with the following parameters : (&(objectClass=pae-VM)(pae-displayname=VirtualMachineName))
After all these steps VMware is fixing it self, the bad desktops that were in missing state have been deleted and new ones were created. Those new ones did conflict with the ones already in vCenter but at this time VMware knew which desktop was the right one, so the vCenter desktop was deleted and recreated and the environment was slowly coming back to life. I hope this article helps you when you encounter something like this, troubleshooting is cool with the right tools.