Quantcast
Channel: VMware Communities : All Content - All Communities
Viewing all articles
Browse latest Browse all 175326

vSAN all-flash lab failures

$
0
0

Hi vSAN folks. I'm building a lab and I'm aware my hardware choice is not officially supported, but also aware that it's a popular choice even with some VMware engineers.

I'm wondering if anybody out there has a VMware vSAN environment running successfully on NUC10i7FNH hardware, or NUC10s in general? Or does anybody recognise the error?

I've set up a 3 node cluster for a small vSAN all-flash eval lab. Lab hardware overview:

  • 3x NUC10i7FNH3
  • 3x Samsung PM883 960GB 2.5" SATA3 Enterprise SSD/Solid State Drive (Capacity Tier)
  • 3x WD Black 250GB SN750 NVMe SSD (Flash Tier)

I've built the cluster and everything looks great for a bit. Then the hosts start to mark their disk group as failed!The errors in the logs which seem relevant:

  • This occurs at exactly the point when the disk group is marked as Unhealthy, on each host. I know it usually represents a disk error/write failure/faulty media, but in this case all disks have been swapped and this happens on all NUCs. The media itself isn't failing, but for some reason it's returning an I/O error, causing it to be marked as dead:
    WARNING: LSOMCommon: IORETRYParentIODoneCB:2219: Throttled: split status I/O error
    WARNING: PLOG: PLOGElevWriteMDCb:746: MD UUID 52b7d790-0e5d-a8b2-c290-8db105925979 write failed I/O error

 

  • This error is repeated fairly frequently in the logs:
    WARNING: NvmeScsi: 149: SCSI opcode 0x1a (0x453a411fe1c0) on path vmhba1:C0:T0:L0 to namespace t10.NVMe____WDS250G3X0C2D00SJG0______________________50E0DE448B441B00 failed with NVMe error status: 0x2
    WARNING: translating to SCSI error H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0
    WARNING: NvmeScsi: 149: SCSI opcode 0x85 (0x453a40fbc680) on path vmhba1:C0:T0:L0 to namespace t10.NVMe____WDS250G3X0C2D00SJG0______________________50E0DE448B441B00 failed with NVMe error status:
    WARNING: translating to SCSI error H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0

 

  • Maybe not relevant, but during Boot I can see:
    nvme_pcie00580000:NVMEPCIEAdapterInit:446:workaround=0
    WARNING: Invalid parameter: vmknvme_client_type -1, set to default value 1.
    WARNING: Invalid parameter: vmknvme_io_queue_num 0, set to default value 1.
    WARNING: NVMEPSA:2003 Failed to query initiator attributes: Not supported

 

Here are some things I've tried, in an effort to narrow this down:

  • This happens with both ESXi 6.7 Update 3, and ESXi 7. (I have built both environments from scratch to test this, no change. Latest updates are applied to both.)
  • I suspected it was because of some incompatibility with the NVME disk brand, so I have replaced the NVME disks I originally bought (Samsung 970 EVO Plus) with WD Black SN750 in all hosts. No change at all.
  • I tested with a 4th host, same issue
  • I upgraded the SSD used for capacity tier to a disk that's on the HCL (Samsung) and the nature of the errors and events logged didn't change, still no joy.
  • I've tried this on original NUC firmware v37 and the latest v39
  • I've entirely disabled power management in the BIOS
  • I've tried UEFI and Legacy boot

I have been looking at this for a week or two now, and it's causing me more grey hair. Does anybody have any ideas, or even better, a NUC10 environment where this works?


Viewing all articles
Browse latest Browse all 175326

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>