Traditionally, one way to determine whether a drive will fail is by looking for SMART trips. SMART attributes are drive health indicators generated by each storage drive, whose normalized values range from 1 to 253.

Each drive manufacturer sets its own thresholds for healthy and unhealthy SMART values. When a SMART attribute dips below its threshold, it “trips.”

However, because SMART attributes and their thresholds have not been standardized across manufacturers, drives vary in terms of which SMART trips they support, and what thresholds are considered unhealthy.

And according to a past study(1), SMART Trips only catch between 3-10% of drive failures, suggesting that they are limited in terms of the number of drive failures they can catch.

A newer way to determine whether a drive will fail is by feeding various drive health parameters through an AI algorithm, like the one in DA Drive Analyzer.

Rather than applying a simple threshold to each health indicator, an AI can look for patterns in a drive’s health indicators over time. When a drive displays certain patterns, it flags the drive as unhealthy, or at risk of failure.

So how does the AI-based approach compare with SMART trips when it comes to detecting failures?

Let’s define drive failure as when a drive experiences RAID deterioration and is subsequently removed by the user. Let’s also give each type of failure detection method 150 days to detect as many failed drives as it can.

In such a scenario, DA Drive Analyzer’s AI caught around 7-8 times as many failed drives as SMART trips. In other words, the AI-based drive failure detection method was more sensitive to impending failures than SMART trips.

Both methods kept the false positive rate under 2%.

 

Reference:

1. Basak, J., & Katz, R. H. (2017). Significance of Disk Failure Prediction in Datacenters. arXiv preprint arXiv:1707.01952.