I found out the hard way that there are somethings that should remain untouched when it comes to Citrix MCS and Nutanix AHV. I'm hoping this post help others who may have come across similar issues.
TLDR; Do not delete XDSNAP Snapshots on AHV if you are using Citrix MCS for your Virtual Apps and Desktops with the hosting on Nutanix.
I recently deployed a Nutanix cluster to utilise Citrix Cloud with Nutanix AHV (AOS 5.8.2) as the hosting. This solution uses Citrix MCS with Citrix App Layering for the purpose of image management.
The Virtual Apps and Desktops portion of this cluster is utilising Citrix Cloud in a hybrid setup - Citrix Delivery Controllers are on Citrix Cloud whereas the VDAs and StoreFront servers are on-prem. All on-prem servers are Windows Server 2016 based VMs.
Once all the catalogs, delivery groups and VDAs were all set and tested I decided to do some house cleaning in AHV because I noticed a lot of orphaned snapshots when building Machine Catalogs. So I logged into one of the CVMs and take a look at the snapshots that were on the cluster.
I noticed a lot of snapshots existed that started with XDSNAPxxxxxpreparationxxx. I noticed that when I create a Machine Catalog a preparation VM is created, booted and the deleted so I just figured the process just doesn't remove the leftover snapshot. Being the clean freak that I am, I decided to remove these seemingly orphaned and unused snapshots.
So off I went
acli snapshot.delete XDNAP*. All went well and the overall running of the cluster and VM was unchanged. Happy with my house cleaning I proceeded to move through the project and started towards a Pilot phase.
Here is where things started to get a little weird.
Just before going into pilot I wanted to test image updates and more specifically scheduled restarts from Citrix Studio (Citrix Cloud). So I created a new Image in Citrix App Layering, published to the cluster and then went through the process of updating the Machine Catalog with the latest image ready to update the VDAs during the scheduled restart window that I had configured on the Delivery Groups. This restart was scheduled to restart the VDAs at 1:30am. I finished up for the day and went home expecting that by the time i got back to work the next day the VDAs would be updated with the new image.
When I got to work the next day however, all the VDAs were powered off. I figured that there must have been some sort of glitch so I tried to power on the VDAs from the Citrix Studio console. Nothing. I tried again. Still nothing. Thinking that it might be an issue with that VM I tried to power on another VDA from Citrix Studio in another machine catalog. Nada. Manually powering on the VM from the Nutanix side obviously works and the power state is reflected in Citrix Studio but the VM does not get the updated image as the power on command needs to come from Citrix Studio. Given that the VM was now on, I decided to see if the shutdown command worked from Studio - it did.
So now I was extremely confused. The shutdown command works but not the start command. How was this possible.
I started troubleshooting.
- Refreshed my Citrix Cloud session
- Deleted the VDAs and recreated
- Checked that the HostedMachineID in Citrix Cloud matched the VM ID in Nutanix
- Found the Powershell command to start the VDA from Citrix Cloud - Powershell command was accepted but still nothing happened.
At this point I figured there was an issue between the Citrix Cloud Connector (on-prem VM) where the Nutanix AHV plugin was installed and the Nutanix API to start the VM. So I grabbed the Citrix CDFTrace tool and ran it on the Citrix Cloud Connector VM, Started a trace and tried to power on the VM. There was nothing in the logs from the CDFTracte tool that told me that a command was ever received to start the VM. There was however logs to show me that the shutdown command was sent to Nutanix and that it worked.
There was definitely something going on with the communication from Citrix Cloud to Nutanix. By now I had a case open with Nutanix and Citrix. Neither of which could tell me what was going on or why this was happening. I was confused.
The solution to this problem was super simple. So simple it is almost comedic.
Given that this issue was going on far too long and I know this had worked in the past I decided to go back to square one. I deleted the VMs, and also deleted the Machine Catalogs. I decided to leave the Delivery Groups in place as I would be re-creating the Machine Catalogs anyway.
I created a catalog, watched the preparation VM startup and shutdown and then the catalog was created successfully. I added the VMs created with that new catalog to an existing Delivery Group and what do you know, the VMs started. I wasn't entirely surprised because I knew this had worked in the past.
Once the VMs had started and the VDAs were registered I did some testing. Shutdown the VM from Citrix Studio and it worked instantly. The real test was powering it back on with Citrix Studio. I crossed my fingers clicked the start button. Waited a few seconds and up came the VM! It had worked! 🙃 I waited for it to register and repeated the process. It had worked a second time.
Now that one of the catalogs was working I set about re-creating the remaining catalogs and performed the same tests. Worked. Every. Time. 🤗
The one ting I did differently this time? I didn't delete the XDSNAP snapshot that is created when the Machine Catalog is created.
Multiple days have passed now and the VMs are restarting as they should according to their schedules without any issue.
The XDSNAP is the master vDisk of the Machine Catalog. When you remove it, everything will ok until a reboot. The master vDisk will be the source of all reads at first, later, data is moved locally for performance and scalability hence why it works while still running. Remove it and the initial reads will fail. the only point at which these snapshots can be deleted is when you have pushed an upaded image and all VDAs have rebooted and are now running off the new image.
Bottom line is - Don't delete your snapshots too quickly.
Thank you to Kees Baggerman (Nutanix) for the explanation of how the XDSNAP is used.