Hi,
i'm using a Samsung 1725b NVMe on ESXi 7.0 and wonder what are people using to
- monitor the health (tbw, errrors, temperature)
- predict failures based on these (few) data
For a normal SSD, i get a lot of information when using
# esxcli storage core device smart get -d ID
Parameter | Value Threshold Worst Raw |
Health Status | OK | N/A | N/A | N/A |
Media Wearout Indicator | 99 | 5 | 99 | 172 |
Write Error Count | 100 | 10 | 100 | 0 |
Power-on Hours | 92 | 0 | 92 | 151 |
Power Cycle Count | 99 | 0 | 99 | 14 |
Reallocated Sector Count | 100 | 10 | 100 | 0 |
Drive Temperature | 69 | 0 | 63 | 31 |
Write Sectors TOT Count | 99 | 0 | 99 | 39 |
Read Sectors TOT Count | 99 | 0 | 99 | 40 |
Initial Bad Block Count | 100 | 10 | 100 | 0 |
Program Fail Count | 100 | 10 | 100 | 0 |
Erase Fail Count | 100 | 10 | 100 | 0 |
Uncorrectable Error Count | 100 | 0 | 100 | 0 |
Pending Sector Reallocation Count 100 | 0 | 100 | 0 |
For the NVMe i only have this:
Parameter | Value Threshold Worst Raw |
Health Status | OK | N/A | N/A | N/A |
Power-on Hours | 1677 | N/A | N/A | N/A |
Power Cycle Count | 3 | N/A | N/A | N/A |
Reallocated Sector Count 0 | 90 | N/A | N/A | |
Drive Temperature | 36 | 79 | N/A | N/A |
There were some efforts to get smartctl up and running, but everything unofficial.
https://www.virten.net/2016/05/determine-tbw-from-ssds-with-s-m-a-r-t-values-in-esxi-smartctl/
Thanks for info.
-Mark