I consider myself to be fairly cautious when implementing a new server level application, especially so for the level in which ESX runs and how much depends on it to be stable. Something escaped my attention, or perhaps it wasn’t talked about much in the docs I read. Either way it’s too late now and I’m not about to dig through all of those pdf’s to see if I missed it.The ESX 3.0 server was setup just over a year ago and from my previous trials with ESX, and VMWare’s other virtualization products, snapshots have been fantastic and the ability to revert to them in various scenarios such as if a recent OS/Application patch went sour saved a lot of time for peoples. Unfortunately I forgot the saying “if it looks too good to be true, then it probably is”, and such is the case it turns out with the snapshot.
Over the past couple weeks I’ve noticed that the space in the datastore was decreasing at a more accelerated rate then usual. Peeking my curiosity I poke around and with a recent influx in data I contributed it to that, but I didn’t tally precise number to see if they balanced. Being a busy time and a lot to do I took it for what it was and moved on. A couple days later I wake up to one of the virtual machines no longer being accessible and checking the datastore again I see it dropped over 30gb’s during those couple days and reduced the free space down to 2mb. I’m surprised, confused, and aggravated that a machine went down. I’m sure many admin’s have experienced this at one point or another. By either browsing the to virtual machine’s folder in the service console or through browsing the datastore when you right click it by looking for files with the word “delta” in them will indicate if they are from a snapshot. I no longer have the exact error message that was displayed at that point but I was given the option to “Retry” or “Abort”. I clicked retry and then was faced with:
There is no more space for the redo log of ComputerName-000002.vmdk. You maybe be able to continue this session by freeing disk space on the relevant partition, and clicking Retry. Otherwise click Abort to terminate this session.
I proceed to run various commands in both the console and virtual client to work the problem. One of the threads in vmware communities that came up more than once in the searches is http://communities.vmware.com/message/510545#510545. I’ve tried various suggestions in there and I think some additional ones as well. I first tried to remove the snapshot that was in the snapshot manager (this was by clicking delete and not delete all) and after several hours of processing it removed the snapshot from the gui but the vmdk files were certainly still there. After which I tried:
vmware-cmd <cfg> hassnapshot
hassnapshot() =
Yes, thats a blank. For whatever reason it wasn’t detecting that any snapshot exists even though there are numerous delta files in the virtual machines directory. I then proceeded to create a snapshot in the snapshot manager and then delete it, this time with delete all. Still no luck, they were all still there. I continued by removing the vm from the inventory (not from the drive! – be careful there, there’s a big difference) and re-added it. No dice. With the outlook becoming more gloomy I tried creating the snapshot in the service console with:
vmware-cmd <cfg> createsnapshot <name of snapshot>
and was returned with the error of
VMControl error -11: No such virtual machine
I checked the path then checked it again, it was correct. I searched around google for a while too and didn’t find anything helpful with the message. I was thinking that it may have been a somewhat generic message that could have meant several things.
In the end I have resorted to removing a vm that was recently built to clear up enough space to boot it, and thankfully it has not been configured yet so not much time gone there, and remove the data from the vm so it can be completely removed to be removed and a fresh vm built. This particular server was used as storage for the network and to hold backups so I am thankful there isn’t much configuration that needs to occur once it’s rebuilt. I thought as the data was transferring, all 75gb or so, that I would write this article up. There sure is a lesson learned here – regardless of how much you may trust a piece of software to work right, it can always turn on you. This goes for the mac users out there too.
On a side note, the <cfg> tag’s above is a common abbreviation used in VMWare’s documentation which corresponds to the full path and file name of the vmx file. For example, in this scenario mine is similar to:
/vmfs/volumes/storage1/vmname/vmname.vmx