So we've been running ESXi 6.0 build 3825889 for about 6 months now without any issues. Out of nowhere, one of our production Windows Server 2008 R2 VM's powered off this morning. On powering it back up, it was clear the power off was ungraceful. There were no signs of a blue screen or minidump file inside the guest. I then looked to vSphere for information and saw relevant events for said VM (redacted info in brackets):
Description: | Type: | Date Time: |
---|---|---|
Error message from [Host IP]: We will respond on the basis of your support entitlement. | error | 2/10/2017 7:35:34 AM |
Virtual machine on [Host IP] is powered off | info | 2/10/2017 7:35:41 AM |
Alarm 'Virtual machine CPU usage' changed from Green to Gray | info | 2/10/2017 7:35:42 AM |
Alarm 'Virtual machine memory usage' changed from Green to Gray | info | 2/10/2017 7:35:42 AM |
Task: Power On virtual machine (when I powered it back on) | info | 2/10/2017 8:03:17 AM |
On checking the vmware.log file for the VM, I discovered the following:
2017-02-10T14:35:27.351Z| vcpu-0| W110: MONITOR PANIC: vcpu-2:VMM fault 14: src=MONITOR rip=0x200000001 regs=0xfffffffffc607d20
2017-02-10T14:35:27.351Z| vcpu-0| I120: Core dump with build build-3825889
2017-02-10T14:35:27.351Z| vcpu-3| I120: Exiting vcpu-3
2017-02-10T14:35:27.351Z| vcpu-1| I120: Exiting vcpu-1
2017-02-10T14:35:27.351Z| vcpu-0| W110: Writing monitor corefile "/vmfs/volumes/574e0774-f40bbd18-1006-246e96162c88/APPS/vmmcores.gz"
2017-02-10T14:35:27.351Z| vcpu-2| I120: Exiting vcpu-2
2017-02-10T14:35:27.361Z| vcpu-0| W110: Dumping core for vcpu-0
2017-02-10T14:35:27.361Z| vcpu-0| I120: CoreDump: dumping core with superuser privileges
2017-02-10T14:35:27.361Z| vcpu-0| I120: VMK Stack for vcpu 0 is at 0x439162a13000
2017-02-10T14:35:27.361Z| vcpu-0| I120: Beginning monitor coredump
2017-02-10T14:35:27.784Z| vcpu-0| I120: End monitor coredump
2017-02-10T14:35:27.785Z| vcpu-0| W110: Dumping core for vcpu-1
2017-02-10T14:35:27.785Z| vcpu-0| I120: CoreDump: dumping core with superuser privileges
2017-02-10T14:35:27.785Z| vcpu-0| I120: VMK Stack for vcpu 1 is at 0x439162b13000
2017-02-10T14:35:27.785Z| vcpu-0| I120: Beginning monitor coredump
2017-02-10T14:35:28.203Z| vcpu-0| I120: End monitor coredump
2017-02-10T14:35:28.204Z| vcpu-0| W110: Dumping core for vcpu-2
2017-02-10T14:35:28.204Z| vcpu-0| I120: CoreDump: dumping core with superuser privileges
2017-02-10T14:35:28.204Z| vcpu-0| I120: VMK Stack for vcpu 2 is at 0x439162b93000
2017-02-10T14:35:28.204Z| vcpu-0| I120: Beginning monitor coredump
2017-02-10T14:35:28.629Z| vcpu-0| I120: End monitor coredump
2017-02-10T14:35:28.629Z| vcpu-0| W110: Dumping core for vcpu-3
2017-02-10T14:35:28.629Z| vcpu-0| I120: CoreDump: dumping core with superuser privileges
2017-02-10T14:35:28.629Z| vcpu-0| I120: VMK Stack for vcpu 3 is at 0x439162c13000
2017-02-10T14:35:28.629Z| vcpu-0| I120: Beginning monitor coredump
2017-02-10T14:35:29.047Z| vcpu-0| I120: End monitor coredump
2017-02-10T14:35:34.860Z| vcpu-0| W110: A core file is available in "/vmfs/volumes/574e0774-f40bbd18-1006-246e96162c88/APPS/vmx-zdump.000"
2017-02-10T14:35:34.860Z| vcpu-0| I120: Msg_Post: Error
2017-02-10T14:35:34.860Z| vcpu-0| I120: [msg.log.error.unrecoverable] VMware ESX unrecoverable error: (vcpu-0)
2017-02-10T14:35:34.860Z| vcpu-0| I120+ vcpu-2:VMM fault 14: src=MONITOR rip=0x200000001 regs=0xfffffffffc607d20
2017-02-10T14:35:34.860Z| vcpu-0| I120: [msg.panic.haveLog] A log file is available in "/vmfs/volumes/574e0774-f40bbd18-1006-246e96162c88/APPS/vmware.log".
2017-02-10T14:35:34.860Z| vcpu-0| I120: [msg.panic.requestSupport.withoutLog] You can request support.
2017-02-10T14:35:34.860Z| vcpu-0| I120: [msg.panic.requestSupport.vmSupport.vmx86]
2017-02-10T14:35:34.860Z| vcpu-0| I120+ To collect data to submit to VMware technical support, run "vm-support".
2017-02-10T14:35:34.860Z| vcpu-0| I120: [msg.panic.response] We will respond on the basis of your support entitlement.
2017-02-10T14:35:34.860Z| vcpu-0| I120: ----------------------------------------
2017-02-10T14:35:34.861Z| vcpu-0| I120: Vigor_MessageRevoke: message 'msg.panic.response' (seq 2260) is revoked
Which led me to: Understanding VMM fault and VMM64 fault virtual machine monitor and executable failures (1021174) | VMware KB
Type: | Example error: | Description: |
---|---|---|
Exception 14 (Page Fault) | MONITOR PANIC: vcpu-0:VMM64 fault 14: src=MONITOR rip=0xfffffffffc2e99d3 regs=0xfffffffffc008e98\ | Occurs when a program attempts to access a page mapped in the virtual address space, but it has not been successfully loaded into memory. |
I did a vm-support dump and sent the resulting .tgz archive to Dell ProSupport (figured I'd try them first since we only have vSphere Essentials and would have to pay for an incident with VMware). Dell essentially came to the same point as I did (the VMware KB mentioned above) and tentatively recommended upgrading to a newer ESXi build, but they lacked the tools to analyze the coredump and dig deeper. The VM has been running fine since being powered back on after the incident this morning, but I'd still like to know what caused this. I thought I'd try my luck here before paying for an incident. I'm happy to provide the coredump files mentioned above if needed. Any help is greatly appreciated.
P.S. No settings had been changed on this VM recently (guest or vSphere) and the Event Viewer in the guest VM didn't shed any light on it.. just the expected errors after booting up from an unexpected shutdown.
Edit: I find it odd that the VMM fault was seemingly 32-bit even though the VM is 64-bit.
Edit2: The hardware version of said VM is v10
Edit3: The host is a Dell PowerEdge R730