Performance counters gone bad… Citrix XenApp
Performance counters gone bad…
Citrix Independent Management Architecture and Windows performance counters go hand in hand and Citrix is 100% dependent on the correct working of the counters. If the counters are not functioning correctly you’re in trouble. This blog will guide you through the steps to recover from this.
How it all started
I got called to a customer asking to deploy an image a system manager had prepared, he had a training and was unable to deploy the image for testing. So said so done, deployed the image to several servers, published a Test Desktop application for a couple of users and off we go.
The whole testing came to a sudden stop after a few minutes when all of a sudden no connection was possible to any of the servers. Here my quest started, what had gone wrong? I haven’t made the image, so my knowledge of what is in the image is limited to what I see on the surface.
When something like this occurs out of the blue you need to start the analysis. I’m living for analysis, I love to analyse stuff and really do a deep dive. At first it looked like just a basic problem, but it turned to something much bigger pretty soon.
First I checked all the common Citrix components to see if I misconfigured something, looking at the eventlogs at the Citrix XenApp servers showed nothing that pointed to an issue. They were spot clean and reported no issue that brought me closer to the issue. Looking at the AppCenter configuration neither showed any issue. So I turned to the Citrix Web Interface to resolve why I couldn’t start the published desktop. The Eventlogs there provided me with lot’s of reason to get worried.
The Citrix Web Interface reported that the Citrix XenApp servers was to busy to handle the request. As a consultant working with Citrix since Winframe 1.7 you get worried at that point. So I turned to the Citrix XenApp servers I just deployed.
The two new server reported a load of 10000, meaning full load.
With Citrix a load of 10000 means a full load, a 20000 load is a license server issue.
When a server starts it will be at 20000 for a moment, than change to 10000 untill the server is ready. the load will drop slowly after that.
In this case the load ever dropped lower than 10000, no matter how long I waited.
Load 10000 is a long time issue with Citrix mostly related to WMI. there are several knowledge base articles that give clues and even so many more pointing you in the wrong direction. Problem is you just don’t know where it’s coming from at first.
One other thing I noticed was that the logon times at the server were long and I mean really long. After the logon was finished, which could take minutes, it would still be a while before the IMA service started and the server was operational at all. This was clearly the result of the underlaying issue, but at this point I didn’t know why. It seemed like the CPU was busy a lot with doing nothing.
I turned to QureyDS to determine why the server was so damn busy…
QueryDS is found on the DVD under the support folder, I copied it locally and ran the following command;
C:Temp> QueryDS /Table:LMS_ServerLoadTable
The output of this command is a bit cryptic but with a bit of explanation it still gives some insight.
The load reported by this command was 2710 HEX which is 10000 decimal. So the load reported by Qfarm was correct.
The RuleLoad reported by this command was 1:64;d:0;6:0;3:0;.
This is even more cryptic but also not that much ones you know what it means.
Below is a list that explains the numbers and characters of this RuleLoad.
a: Application user load
b: Server User load
d: Load Throttling
1: CPU Utilization
2: Context switches
3: Memory Usage
4: Page Faults
6: Page Swaps
7: Disk Data I/O
8: Disk operations
9: IP Range
So the only one that was interesting is 1:64 for all the other are 0. and 0 is what I wanted to achieve in the load.
64 HEX is 100 Decimal. 1 in this list means the CPU Utilization. So combining these to means that the CPU is kept busy all the time by something and therefore reports Full load to the IMA service.
Looking at the Task manager I couldn’t find the samen results, so what kept the CPU this busy? I turned to the Performance monitor.
Starting the Performance monitor was a good thing to do, it brought me closer to the cause of the issue.
It seems that something had removed or corrupted the performance counters.
Looking at CTX129350 you can repair this with the command LodCTR /R.
So off I went, ran the command and it reports back that it recovered the counters.
If you want to read more about LodCTR, click here.
So the sky was clear and I was hopefull this simple fix would do it…
No so fast, looking at Qfarm /load the numbers didn’t drop not even after a reboot. So back to the server to figure out what was wrong.
I turned to the eventviewer to find out if something was there and found that some counters had issues. the reinstallation of the counters didn’t go as planned it seemed.
So repairing the Performance monitor counters didn’t go as planned, something went wrong somewhere.
The events here took me out of my comfort zone and into the unknown. I had to repair a couple of counters with the LodCTR tool manually. Never before did I use this tool…
As mentioned before the tool is used to register performance counters or unregister performance counters.
I had two counters that had issues;
- ASP.NET service
- MAV Client Perfmon Provider
So the way to go was to first remove the performance counter before you add it again.
To remove it you use UNLODCTR instead of LODCTR.
Using the command UNLODCTR “MAV Client Perfmon Provider” removes the counter… in my case it reported that the counters wasn’t there at all.
To add the counter again you open the command prompt, browse to the C:WindowsINFMAV Client Perfmon Provider 009 folder and run the command LODCTR 529da********.INI
you won’t get any message saying it finished correctly it just does.
The second counter I removed was the ASP.NET counter. This one is a bit more tricky for ASP.NET you say? I had four folder with ASP.NET and some versioning behind it. So I just went like this;
Then I browsed to the folder ASP.NET_4.0.30319 000 and entered to command LODCTR ASPNET_PERF.INI to add the counter again.
I did the same for the other ASP.NET folder entering the command LODCTR ASPNET2_PERF.INI and the counters where back on track.
During this process I noticed that one ASP.NET folder in the INF folder dissapeared. I’m not sure why and if it’s something to worry about.. I will take a deeper look at why that happened later on today.
Having done all this, I checked the eventlogs and noticed that there was no error about the counters. So before rebooting the server I quickly checked the load on the servers and it was at 0. Thank goodness for that.
So a reboot to really load the counters correctly. After the reboot it was back at 10000.. but I was to eager for I wrote myself earlier that it will stay there untill the server is ready.. so after a few moments of biting my nails it dropped and I could conclude the issue was resolved.
I put the disk in Standard mode, Cache to device RAM at 16384 and deployed it to two test servers. within minutes users could work with the test desktop like planned.
Debugging is a hell of a job, it is so important to make sure you create an overview before you go for a deep dive. I tend to draw a quick flowchart to have all components in place before I do a deep dive. This was a interesting case that I never encountered before. It was a good pratice for me and hopefully whenever you have the same issue my blog helps you.