Hi vSAN folks. I'm building a lab and I'm aware my hardware choice is not officially supported, but also aware that it's a popular choice even with some VMware engineers.
I'm wondering if anybody out there has a VMware vSAN environment running successfully on NUC10i7FNH hardware, or NUC10s in general? Or does anybody recognise the error?
I've set up a 3 node cluster for a small vSAN all-flash eval lab. Lab hardware overview:
- 3x NUC10i7FNH3
- 3x Samsung PM883 960GB 2.5" SATA3 Enterprise SSD/Solid State Drive (Capacity Tier)
- 3x WD Black 250GB SN750 NVMe SSD (Flash Tier)
I've built the cluster and everything looks great for a bit. Then the hosts start to mark their disk group as failed!The errors in the logs which seem relevant:
- This occurs at exactly the point when the disk group is marked as Unhealthy, on each host. I know it usually represents a disk error/write failure/faulty media, but in this case all disks have been swapped and this happens on all NUCs. The media itself isn't failing, but for some reason it's returning an I/O error, causing it to be marked as dead:
WARNING: LSOMCommon: IORETRYParentIODoneCB:2219: Throttled: split status I/O error
WARNING: PLOG: PLOGElevWriteMDCb:746: MD UUID 52b7d790-0e5d-a8b2-c290-8db105925979 write failed I/O error
- This error is repeated fairly frequently in the logs:
WARNING: NvmeScsi: 149: SCSI opcode 0x1a (0x453a411fe1c0) on path vmhba1:C0:T0:L0 to namespace t10.NVMe____WDS250G3X0C2D00SJG0______________________50E0DE448B441B00 failed with NVMe error status: 0x2
WARNING: translating to SCSI error H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0
WARNING: NvmeScsi: 149: SCSI opcode 0x85 (0x453a40fbc680) on path vmhba1:C0:T0:L0 to namespace t10.NVMe____WDS250G3X0C2D00SJG0______________________50E0DE448B441B00 failed with NVMe error status:
WARNING: translating to SCSI error H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0
- Maybe not relevant, but during Boot I can see:
nvme_pcie00580000:NVMEPCIEAdapterInit:446:workaround=0
WARNING: Invalid parameter: vmknvme_client_type -1, set to default value 1.
WARNING: Invalid parameter: vmknvme_io_queue_num 0, set to default value 1.
WARNING: NVMEPSA:2003 Failed to query initiator attributes: Not supported
Here are some things I've tried, in an effort to narrow this down:
- This happens with both ESXi 6.7 Update 3, and ESXi 7. (I have built both environments from scratch to test this, no change. Latest updates are applied to both.)
- I suspected it was because of some incompatibility with the NVME disk brand, so I have replaced the NVME disks I originally bought (Samsung 970 EVO Plus) with WD Black SN750 in all hosts. No change at all.
- I tested with a 4th host, same issue
- I upgraded the SSD used for capacity tier to a disk that's on the HCL (Samsung) and the nature of the errors and events logged didn't change, still no joy.
- I've tried this on original NUC firmware v37 and the latest v39
- I've entirely disabled power management in the BIOS
- I've tried UEFI and Legacy boot
I have been looking at this for a week or two now, and it's causing me more grey hair. Does anybody have any ideas, or even better, a NUC10 environment where this works?