First post. Hope I'm doing this right.
I upgraded from VMWare Workstation Pro v12 to v15 about 6 months ago. Shortly after doing so I started noticing an issue where, during my normal workflow (fire up a VM, run some tests, revert the VM to a Snapshopt, fire up the VM again to run more tests) the VM would suddenly no longer be able to boot. Moreover, once in the bad state, reverting to snapshots also failed. Until today I'd had trouble tracking down a chain of events that cause the behavior. I'm usually engrossed in my work (testing and validating bugs and bugfixes) so I'm usually not paying close enough attention to VMWare itself to write up what I would consider an adequate bug report -- and I understand how frustrating it is to try to track down a badly defined bug with missing or fuzzy repro steps.
I used v12 with this flow for YEARS and never ran into this issue, and since upgrading I run into it at least a couple times a month, so it would appear to be a regression somewhere between v12 and v15.5. In case it may be a factor, both my host and guest OS (in this case) are Windows 10 x64, but I've also experienced the issue with various other Windows VMs. For what it is worth, I have a colleague who has the same workflow as I do -- and he has experienced the same intermittent issue as I have. He's only ever had Workstation v15.5, and is running on completely different hardware than mine, so the issue doesn't appear to have anything to do with my having upgraded Workstation previously or to my specific system configuration.
This has mostly just been a nuisance because I can usually just restore from a backup, but today I made some lengthy changes to my test environment and had not yet backed them up. Then the issue happened:
Rather than spending hours trying to restore from a backup and repeat my work I decided to try to recover. The total size of the VM hadn't changed (it was in the ballpark of 48GB before and after the snapshot restore) so it was unlikely that any data had actually been lost. Checking the VM's folder I can confirm that the 000004 file did not in-fact exist. What was strange was that it appeared that there *shouldn't* be a 4th vmdk diff file since the VM only has 3 snapshots. I checked a couple other vmx files for other VMs that also had snapshots and realized that the only line that seemed to be suspect in my broken VM was: scsi0:0.fileName = "PAR8500 VMWare Converted-000004.vmdk". I changed the 4 to a 3, saved, relaunched the GUI, and my VM was able to boot and move between snapshots again. I'd figured out the breaking point in the failure and how to recover from it.
Steps leading up to the error:
- VM Running
- Restore Offline Snapshot
- Immediately After UI says that snapshot has been restored, change the RAM allotted to the VM (I'd mistakenly saved my offline snapshot with too little RAM).
- Attempt to boot VM -> Error
I'm pretty sure that step 3 in that flow is what triggers the issue, and timing is likely important. Recalling previous failures with hindsight, I know that changing other VM properties (not RAM) can also cause the failure. I've tried to repro the issue and have not yet been able to. My theory is that SOMETHING about restoring a Snapshot and then immediately attempting to alter the VM properties is causing the the UI to write the wrong value for the .vmdk that the machine should be referencing. To the best of my knowledge, the issue is always n+1, but I cannot confirm.
I hope this can be resolved in the not too distant future -- not for me since I now have a workaround -- but for anyone else experiencing the issue who doesn't know what is going on and might be losing valuable work.
Cheers!