Sizing matters, let’s talk size for a moment
At VMworld in Las Vegas we had a meeting on Wednesday afternoon with all of the VMware EUC Champions and product manager, vendors and so on. During the time we talked with NVIDIA we had a discussion about sizing our environments. A very interesting discussion enrolled where we all agreed that sizing is more than just reading white papers and taking those numbers for the truth for any environment. A couple of weeks later we have some issues at customers with GPU enabled VDI environments where the desktops lose their GPU all of a sudden. The discussion went on internally and on twitter about sizing NVIDIA cards. This blog is not about NVIDIA I wrote another blog about our issue at that customer – link -, this blog is about sizing in general. Sizing matters and we have to understand that.
Sizing for the numbers – real life example
For years there were white papers that talked about the scalability of servers and how many users/desktops would fit on a server. That was seen from a density perspective and not from a user experience perspective. The idea behind those papers was to show how many ideally would fit. To get to this number use cases were designed and users-load was fired on the machine through some “random” pattern. The numbers we saw there were amazing, 120, 150 or even 200 user/desktops per host.
Real life example
We had discussion around this before at customer, they hired a company to do these kind of test. Management (that’s what management does) knew for sure that the number coming from the tool would show how to size. We as IT guys were already working on building the environment. Our first batch of key-user were working on the environment about a week or two before the automated tests would started. So we got life data before the automated test would give any data. This was years back so we got 80 user on a VMware vSphere host with 4 Citrix XenApp server. Those 80 users were happy, things worked smoothly. Then we ran those automated tests and the number it showed was 40 or something around there. Management was in all states as 40 is too low, we need more hardware, were did we do wrong???????
Of course you let them run around for a bit and then slowly show them the console. Hey pal we got 80 users working happily on the server. We don’t need no tool to see if this is a good number to work with. Of course we calculated this on paper before and we estimated between 80-100 users on a host. So was the automated test wrong? It was not but the automated test worked off a batch of commands with some automated users based on a scenario they thought of in advance of the project. It seemed that the scenario was to heavy for the regular users working on the environment. They used the apps differently or less resource demanding. We went live with 80 users per host on average.
That was then, we moved forward as did the technique..
Size your environment for maximum not your hardware
When you design an environment you want measure or calculate is so it can handle the load. There are several tools on the market that can be used to size your environment, VMware/Systrack assessement tooling is one of them, Liquidware Labs Stratusphere FIT is another one. After running these tools you get a report that will say the average and the maximum for several resources.
Sizing the environment
We have discussions at customers about this, Should your sizing depend on the maximum or average? A lot of people seem to think that sizing your environment with the average load expected is gonna work out just fine. I think I know where this thought is coming from but I also strongly think it is wrong. When designing environments you think about how many users will be active at any time during the day, that is what you build for. If you had only one datacenter you need 100% capacity. It might happen that 100% of the users will be coming to work one day a year. Or perhaps you accept to have less performance during that one day. Resources need to be distributed among more users than normally at that one day.
If you design for two datacenter you could decide that you distribute the load over the two datacenter so 70/70. Where 70% can work in the main. If something goes wrong you have 70% of you capacity available in the secondary datacenter. For that one day that 100% of the user are working they will work in both datacenters. So our question for customer is how many percentage of your users will need to work when you have a disaster? 70% by the way is a good average number to size your environment on as people are sick, on holiday etc. An average only 70% is at work at one time. Some times we design 50/50 but that will only work if you can ship in more hardware when a disaster happens. The customer will need more than 50% online after that.
Same goes for IOPS, although less an issue these days, you need to make sure they are there if requested. If you size on average and the rest comes in for work user experience goes down fast. Size for maximums and make sure user experience is good.
Sizing the hardware
So from that perspective sizing for average and taking in the hit when more user come to work is understandable. If we look at sizing the hardware that is a different thing. You don’t want to size on average as that will guarantee you that anything above average will have a resource shortage. An average is calculated because of the fact that an x number of values are higher and y number of them are lower. They x number of values will need those resources. If things worked out nicely you would be fine with the average numbers but only if the higher and lowers are equally there. If there are more higher resource needing user than lower ones you go wrong. so sizing your hardware for maximum is better but there is more.
Sizing your hardware, like CPU, Memory etc to 100% is again not something we should ever do. When designing environments sizing CPU, Memory etc to 70% is better than filling them all the way to the top. Nothing, I can say that again, Nothing filling to the top will preform fine for a long time. If, lets take NVIDIA, says you can get 64 users on a card then that is not a message that you should. I might work and perhaps even work fine for a while but you’re pushing the card to the edge. If 64 would be the maximum you could be slowly overloading the card. The same goes for any CPU or vCPU. That’s also why monitoring tools like eG Innovations, Goliath or ControlUp have metrics to reports CPU usage over 70, 80 and 90% they need some headroom to keep on functioning.
So long story short give your hardware some room to breath. Although vendors will say you get that much resources out of a piece of hardware of that much users on a piece of hardware doesn’t mean you should. Use common sense and design like that. To close off let’s create little list to keep at hand;
- Environments with one datacenter: Design for 100% of total user
- Environment with more datacenters: Design for 70/70 or 50/50 depending on your setup and customer needs
- Hardware: Size for maximums
- Hardware: Size hardware at 70% capacity to leave some headroom.
- Automated test: Are as good as the scenario you written, real life is better.
Some thoughts. 70% for a GPU would make it 45 instead of 64 users, is that low and making the business case useless? With CPU 70% is reasonable and you won’t have any argument there. With GPU and the NVIDIA prices I’m sure you will get into a discussion. I think someone should do a duration test with the cards to see where the magic number is, and perhaps 70% is the best …
What I wanted to say with this article is that sizing matters and that sizing is not that simple as it looks. We need to think and use common sense instead of looking at marketing posters. Sizing goed hand in hand with user experience and no user ever was happy with 149 other users on the same host, there are just that much resources to spare. If you care about user experience you care about sizing.
Have a great day, the best day ever 😉