Hewlett Packard Enterprise (HPE) is once again warning its customers that certain Serial-Attached SCSI solid-state drives will fail after 40,000 hours of operation, unless a critical patch is applied.
The company made a similar announcement in November 2019, when firmware defect produced failure after 32,768 hours of running.
Affected drives
The current issue affects drives in HPE server and Storage products like HPE ProLiant, Synergy, Apollo 4200, Synergy Storage Modules, D3000 Storage Enclosure, StoreEasy 1000 Storage.
HPE Model Number | HPE SKU | HPE SKU DESCRIPTION | HPE Spare Part SKU | HPE Firmware Fix Date |
EK0800JVYPN | 846430-B21 | HPE 800GB 12G SAS WI-1 SFF SC SSD | 846622-001 | 3/20/2020 |
EO1600JVYPP | 846432-B21 | HPE 1.6TB 12G SAS WI-1 SFF SC SSD | 846623-001 | 3/20/2020 |
MK0800JVYPQ | 846432-B21 | HPE 800GB 12G SAS MU-1 SFF SC SSD | 846624-001 | 3/20/2020 |
MO1600JVYPR | 846436-B21 | HPE 1.6TB 12G SAS MU-1 SFF SC SSD | 846625-001 | 3/20/2020 |
The company says that this is a comprehensive list of impacted SSDs it makes available. However, the issue is not unique to HPE and may be present in drives from other manufacturers.
If the SSD in the HPE products runs a firmware version older than HPD7, they will fail after being powered on for 40,000 hours; this translates into 4 years, 206 days, 16 hours and it is about half a year shorter than the extended warranty available for some of them.
When the failure point is reached, neither the data nor the drive can be recovered. Preventing such a disaster is possible in environments with data backup setups.
HPE learned about the firmware bug from a SSD manufacturer and warns that if SSDs were installed and put into service at the same time they are likely to fail almost concurrently.
“Restoration of data from backup will be required in non-fault tolerance modes (e.g., RAID 0) and in fault tolerance RAID mode if more drives fail than what is supported by the fault tolerance RAID mode logical drive [e.g. RAID 5 logical drive with two failed SSDs]” - HPE advisory
The new firmware can be installed by using the online flash component for VMware ESXi, Windows, and Linux.
Last month, Dell EMC released new firmware to correct a bug causing nine SanDisk SSDs in its portfolio to fail "after approximately 40,000 hours of usage." Dell identified the following models to be impacted:
- LT0200MO
- LT0400MO
- LT0800MO
- LT1600MO
- LT0200WM
- LT0400WM
- LT0800WM
- LT0800RO
- LT1600RO
The update corrects a check for logging the circular buffer index value. "Assert had a bad check to validate the value of circular buffer's index value. Instead of checking the max value as N, it checked for N-1," Dell's advisory explains.
Customers that were shipped one or more of the affected SSD models were informed about this "potentially critical issue" with the recommendation to apply the update immediately.
Not as bad as last time
There is some good news, though. By checking the shipping dates from HPE and considering the 40,000 hours expiration limit, no affected SSD have failed because of this firmware bug.
HPE estimates that unpatched SSDs will begin to fail as early as October 2020. This gives plenty of time for admins to apply the corrected firmware.
Back in November, reports about storage drive failure came pouring on social media and forums, with ussers complaing about device collapsing in bulk, minutes apart.
Finding out the uptime of an affected drive is possible with the Smart Storage Administrator (SSA) utility, which offers the power-on time for every drive installed on the system.
Alternatively, users can run scripts that can check if the firmware on their SSDs has the 40,000 power-on-hours failure issue. The scripts work for certain HPE SAS SSDs and are available for Linux, VMware and Windows.
Update March 25, 09:05 EDT: Article updated with details about some SandDisk SSDs that could also fail after 40,000 hours of operation time.
h/t JohnC_21 (comment below)
Comments
JohnC_21 - 4 years ago
The company said in a bulletin that the “issue is not unique to HPE and potentially affects all customers that purchased these drives.” HPE has not identified the SSD maker and refused to do so, saying: “We’re not confirming manufacturers.”
However, a Dell EMC urgent firmware update issued last month also mentioned SSDs failing after 40,000 operating hours and specifically identified SanDisk SAS drives. The update included firmware version D417 as a fix.
The fault fixed by the Dell EMC firmware concerns an Assert function which had a bad check to validate the value of a circular buffer’s index value. Instead of checking the maximum value as N, it checked for N-1. The fix corrects the assert check to use the maximum value as N.
It seems likely that the HPE drives are SanDisk drives as well.
https://blocksandfiles.com/2020/03/24/hpe-enterprise-ssd-40k-hours-flaw/
ilaion - 4 years ago
Thanks JohnC_21.
the_moss_666 - 4 years ago
"...half a year shorter than the extended warranty". Is it really the bug, or HP just failed to convert their years of warranty to hours?
Remember, first time it'is an accident, second time it's a pattern and third time it's the intent.
Melr1 - 4 years ago
Planned obsolescence at its best.
Notice the "bug" in the firmware code is not shown to the public.
Yup, this shows deliberate intent. I'm certain it is only a short while before someone wakes up and takes them to court - and wins.