I recently worked with a client who suffered a service outage IMA in their XenDesktop 4 farm. The troubleshooting process taught me many things about XenDesktop, and wanted to share our results for the benefit of others XenDesktop 4 customers.
Customer Environment
This customer has a XenDesktop deployment of medium size. The customer's environment is
- VMware vSphere 4 DDCs accommodation and ADV
- DDCs are Windows Server 03, all 32-bit
- XenDesktop 4 SP1, the customer had deployed patches released until June 2011
- Approximately 1,500 virtual desktops Windows XP SP3 with a slightly older version of VDA in use
- static virtual desktops (not usage by customers)
- the SDC roles are divided according to our best practices which I have blogged in the past:
- agricultural dedicated master server (maxworkers = 0, the highest electoral preference)
- master server backup firm dedicated, which is also the primary XML broker (maxworkers = 0, second electoral preference)
- 4 "brokerage" DDCs (maxworkers not settled, the preference of the election by default)
Service Without
thereSeveral weeks, the customer has suffered a failure IMA service on the master server farm. This has caused disruption to the operation of the farm, and prevent new desktop sessions from being established.
The roles being implemented by the master server farm would have failed on the master server to the backup farm. However, this has not occurred. Naturally, the efforts of customer-focused catering service so that diagnostic information was not captured at the time. To understand why the role of master of the farm has not missed more we used the CDF control to capture traces across a number of DDCs in the farm. These traces have shown us a line by line description of each IMA event and message, and gave us a good understanding of the behavior of the farm at the time of the failure of the IMA service was held.
When we began examining the master server farm we could find no clue as to why service failure occurred. The customer, of course wanted to know the cause in order to avoid a recurrence of the problem.
We have set up a process so that if the problem occurred again, we could capture enough diagnostic information to find the root cause. The process was:
- Log onto the master server farm and start a CDF Trace using all the modules
- Take note, or taking screenshots / videos any behavior that you can see ,. for example by double vCentre commands, other errors, etc.
- Stop the CDF trace after 5 minutes.
- Excerpt system and SDC newspaper applications.
- Save the CDF trace, event logs and screen shots or other evidence to support off the DDC for later download Citrix for analysis.
- using the vSphere client, suspend the master server farm.
- Use Veam FastSCP, WinSCP or vSphere Client (or your tool of choice) to extract the VMSS file outside the VMFS datastore. Zip the file for download later to Citrix for analysis.
- Power VM back on and allow it to resume from suspend.
- Perform a clean restart of the master server in the farm, using the restart of the operating control system (not a forced shutdown).
- Of course, if the virtual machine does not restart, then you'll need to force a reboot.
- Download the file VMSS, the CDF Trace, event logs and any other information using our FTP service.
While the client set up this process, we have continued to look for clues as to why the service failed originally.
As discussed specific steps the customer has taken to restore service, it appeared they had arrested the IMA Service, and use the command "dsmaint recreatelhc" to refresh the host cache local, the always reliable assumption that corrupt LHC could have caused the failure. In this case, it has no catering service; the customer had to restart the master server farm.
Of course, when this is done, the previous LHC is saved as .bak file. We watched the .bak file and saw that he was 2 Go!
2GB local host cache
What could explain such a large cache local host? One of our Escalation engineers suggested to check for any use of scripts in the environment.
A quick call with the customer determined that they ran scripts on the environment throughout the early morning and throughout the day. The customer has observed the LHC file while the scripts executed, and even shows an increase in file size.
A limitation of the JET database format (commonly known as Access databases for its extension .mdb file) is a maximum size of 2GB. There is a very good article on the format JET database on Wikipedia here: http://en.wikipedia.org/wiki/Microsoft_Jet_Database_Engine)
We actually had aa identified specific problems with an earlier version of the SDK which caused a sharp scripting growth of the LHC, see here for the latest SDK that includes the fix: http://support.citrix.com/article/CTX127167. In this case, my client was already using the latest version of the SDK in their environment.
Drawing and measuring the problem
to prove the amount of data written to the LHC Escalation Engineer investigation the problem asked the customer to capture a CDF track while the scripts were executed. We limited the CDF control to only capture data being written in the local host cache.
The traces were analyzed and found that the script was written data on each line in the local host cache that referenced virtual desktops. Execution of simple mathematics, Escalation Engineer believes that the LHC would increase by about 56 MB per day due to client scripts.
And of course, we must remember to XenDesktop administrators and other administrative changes and agricultural general behavior would also increase the size of the LHC.
Conclusion
We provided our findings to the client and informed that our SDK cmdlets and the LHC has been operating as intended; there was no specific bug in XenDesktop cause this behavior.
The customer has implemented regular maintenance to recreate the local host cache and thus restore the size while considering other options.
We suggested they review their design and implementation scripts. In particular, we recommended to the automated monitoring of the size of the local host cache so that if she pushes the limit of 2 GB, it will trigger an alert to the support team.
0 Komentar