Performance counters gone bad… Citrix XenApp


Performance counters gone bad…

Citrix Independent Management Architecture and Windows performance counters go hand in hand and Citrix is 100% dependent on the correct working of the counters. If the counters are not functioning correctly you’re in trouble. This blog will guide you through the steps to recover from this.

How it all started

I got called to a customer asking to deploy an image a system manager had prepared, he had a training and was unable to deploy the image for testing. So said so done, deployed the image to several servers, published a Test Desktop application for a couple of users and off we go.
The whole testing came to a sudden stop after a few minutes when all of a sudden no connection was possible to any of the servers. Here my quest started, what had gone wrong? I haven’t made the image, so my knowledge of what is in the image is limited to what I see on the surface. 

Analysis

When something like this occurs out of the blue you need to start the analysis. I’m living for analysis, I love to analyse stuff and really do a deep dive. At first it looked like just a basic problem, but it turned to something much bigger pretty soon.
First I checked all the common Citrix components to see if I misconfigured something, looking at the eventlogs at the Citrix XenApp servers showed nothing that pointed to an issue. They were spot clean and reported no issue that brought me closer to the issue. Looking at the AppCenter configuration neither showed any issue. So I turned to the Citrix Web Interface to resolve why I couldn’t start the published desktop. The Eventlogs there provided me with lot’s of reason to get worried.
The Citrix Web Interface reported that the Citrix XenApp servers was to busy to handle the request. As a consultant working with Citrix since Winframe 1.7 you get worried at that point. So I turned to the Citrix XenApp servers I just deployed.
The two new server reported a load of 10000, meaning full load.
With Citrix a load of 10000 means a full load, a 20000 load is a license server issue.
When a server starts it will be at 20000 for a moment, than change to 10000 untill the server is ready. the load will drop slowly after that.
In this case the load ever dropped lower than 10000, no matter how long I waited.
Load 10000 is a long time issue with Citrix mostly related to WMI. there are several knowledge base articles that give clues and even so many more pointing you in the wrong direction. Problem is you just don’t know where it’s coming from at first.

Logon

One other thing I noticed was that the logon times at the server were long and I mean really long. After the logon was finished, which could take minutes, it would still be a while before the IMA service started and the server was operational at all. This was clearly the result of the underlaying issue, but at this point I didn’t know why. It seemed like the CPU was busy a lot with doing nothing.

QueryDS

I turned to QureyDS to determine why the server was so damn busy…
QueryDS is found on the DVD under the support folder, I copied it locally and ran the following command;
C:Temp> QueryDS /Table:LMS_ServerLoadTable
The output of this command is a bit cryptic but with a bit of explanation it still gives some insight.
The load reported by this command was 2710 HEX which is 10000 decimal. So the load reported by Qfarm was correct.
The RuleLoad reported by this command was 1:64;d:0;6:0;3:0;. 
This is even more cryptic but also not that much ones you know what it means.
Below is a list that explains the numbers and characters of this RuleLoad.
a: Application user load
b: Server User load
d: Load Throttling
1: CPU Utilization
2: Context switches
3: Memory Usage
4: Page Faults
5: Scheduling
6: Page Swaps
7: Disk Data I/O
8: Disk operations

9: IP Range

So the only one that was interesting is 1:64 for all the other are 0. and 0 is what I wanted to achieve in the load.
64 HEX is 100 Decimal. 1 in this list means the CPU Utilization. So combining these to means that the CPU is kept busy all the time by something and therefore reports Full load to the IMA service.
Looking at the Task manager I couldn’t find the samen results, so what kept the CPU this busy? I turned to the Performance monitor.

Performance monitor

Starting the Performance monitor was a good thing to do, it brought me closer to the cause of the issue.
It seems that something had removed or corrupted the performance counters.
Looking at CTX129350 you can repair this with the command LodCTR /R.
So off I went, ran the command and it reports back that it recovered the counters.
If you want to read more about LodCTR, click here.
So the sky was clear and I was hopefull this simple fix would do it…
No so fast, looking at Qfarm /load the numbers didn’t drop not even after a reboot. So back to the server to figure out what was wrong.

EventViewer

I turned to the eventviewer to find out if something was there and found that some counters had issues. the reinstallation of the counters didn’t go as planned it seemed. 
So repairing the Performance monitor counters didn’t go as planned, something went wrong somewhere.
The events here took me out of my comfort zone and into the unknown. I had to repair a couple of counters with the LodCTR tool manually. Never before did I use this tool…

LODCTR

As mentioned before the tool is used to register performance counters or unregister performance counters.
I had two counters that had issues;
  • ASP.NET service
  • MAV Client Perfmon Provider
So the way to go was to first remove the performance counter before you add it again.
To remove it you use UNLODCTR instead of LODCTR.
Using the command UNLODCTR “MAV Client Perfmon Provider” removes the counter… in my case it reported that the counters wasn’t there at all.
To add the counter again you open the command prompt, browse to the C:WindowsINFMAV Client Perfmon Provider009 folder and run the command LODCTR 529da********.INI
you won’t get any message saying it finished correctly it just does.
The second counter I removed was the ASP.NET counter. This one is a bit more tricky for ASP.NET you say? I had four folder with ASP.NET and some versioning behind it. So I just went like this;
UNLODCTR ASP.NET_2.0.5727
UNLODCTR ASP.NET_64_2.0.50727
UNLODCTR ASP.NET
Then I browsed to the folder ASP.NET_4.0.30319000 and entered to command LODCTR ASPNET_PERF.INI to add the counter again.
I did the same for the other ASP.NET folder entering the command LODCTR ASPNET2_PERF.INI and the counters where back on track.
During this process I noticed that one ASP.NET folder in the INF folder dissapeared. I’m not sure why and if it’s something to worry about.. I will take a deeper look at why that happened later on today.

Test

Having done all this, I checked the eventlogs and noticed that there was no error about the counters. So before rebooting the server I quickly checked the load on the servers and it was at 0. Thank goodness for that.
So a reboot to really load the counters correctly. After the reboot it was back at 10000.. but I was to eager for I wrote myself earlier that it will stay there untill the server is ready.. so after a few moments of biting my nails it dropped and I could conclude the issue was resolved.
I put the disk in Standard mode, Cache to device RAM at 16384 and deployed it to two test servers. within minutes users could work with the test desktop like planned.

Conclusion

Debugging is a hell of a job, it is so important to make sure you create an overview before you go for a deep dive. I tend to draw a quick flowchart to have all components in place before I do a deep dive. This was a interesting case that I never encountered before. It was a good pratice for me and hopefully whenever you have the same issue my blog helps you.

4 Responses

  1. Anonymous says:

    Nice blog!!
    Do you have any idea how these counters gets corrupted? I have the same issue in a Xendesktop 5 / PVS environment. Citrix support also came up with the "LODCTR /R" fix, but I want to know what caused the corruption of these counters.

    Regards,
    Rene

    • I have no idea so far. The server is deployed with a RES Software Automation Manager runbook. I'm gonna take a look in the jobs there but that will take some time. The next time the vDisk is changed, this runbook will be used again and I'm anxious waiting for that moment to see if we can reproduce this issue.

      Thanks of the comment.

    • The same here, I also use RES AM to build the vDisk. I am going to check the change log from the previous build. Keep you posted.

      Rene

  2. We found out that the performance counters were corrupted in our Windows 7 base image (on Xenserver). Luckely there is no need te debug the runbook in RES AM…
    Created a new Windows 7 template, ran the RES AM runbook. The issue we had was that because of the corrupt performace counters the Xendesktop VDA agent crashed. With the new image this problem is solved.

    Rene

Leave a Reply

https://tracking.cirrusinsight.com/869c29e2-3a9b-48c5-9232-0b95e7993ae8/controlup-com-pixel-php