Failure Trends In A Large Disk Drive Population
Essay by 24 • November 22, 2010 • 7,050 Words (29 Pages) • 1,799 Views
Appears in the Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07), February 2007
Failure Trends in a Large Disk Drive Population
Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz AndrÐ'Ò'e Barroso
Google Inc.
1600 Amphitheatre Pkwy
Mountain View, CA 94043
{edpin,wolf,luiz}@google.com
Abstract
It is estimated that over 90% of all new information produced
in the world is being stored on magnetic media, most of it on
hard disk drives. Despite their importance, there is relatively
little published work on the failure patterns of disk drives, and
the key factors that affect their lifetime. Most available data
are either based on extrapolation from accelerated aging experiments
or from relatively modest sized field studies. Moreover,
larger population studies rarely have the infrastructure in place
to collect health signals from components in operation, which
is critical information for detailed failure analysis.
We present data collected from detailed observations of a
large disk drive population in a production Internet services deployment.
The population observed is many times larger than
that of previous studies. In addition to presenting failure statistics,
we analyze the correlation between failures and several
parameters generally believed to impact longevity.
Our analysis identifies several parameters from the drive’s
self monitoring facility (SMART) that correlate highly with
failures. Despite this high correlation, we conclude that models
based on SMART parameters alone are unlikely to be useful
for predicting individual drive failures. Surprisingly, we found
that temperature and activity levels were much less correlated
with drive failures than previously reported.
1 Introduction
The tremendous advances in low-cost, high-capacity
magnetic disk drives have been among the key factors
helping establish a modern society that is deeply reliant
on information technology. High-volume, consumergrade
disk drives have become such a successful product
that their deployments range from home computers
and appliances to large-scale server farms. In 2002, for
example, it was estimated that over 90% of all new information
produced was stored on magnetic media, most
of it being hard disk drives [12]. It is therefore critical
to improve our understanding of how robust these components
are and what main factors are associated with
failures. Such understanding can be particularly useful
for guiding the design of storage systems as well as devising
deployment and maintenance strategies.
Despite the importance of the subject, there are very
few published studies on failure characteristics of disk
drives. Most of the available information comes from
the disk manufacturers themselves [2]. Their data are
typically based on extrapolation from accelerated life
test data of small populations or from returned unit
databases. Accelerated life tests, although useful in providing
insight into how some environmental factors can
affect disk drive lifetime, have been known to be poor
predictors of actual failure rates as seen by customers
in the field [7]. Statistics from returned units are typically
based on much larger populations, but since there
is little or no visibility into the deployment characteristics,
the analysis lacks valuable insight into what actually
happened to the drive during operation. In addition,
since units are typically returned during the warranty period
(often three years or less), manufacturers’ databases
may not be as helpful for the study of long-term effects.
A few recent studies have shed some light on field
failure behavior of disk drives [6, 7, 9, 16, 17, 19, 20].
However, these studies have either reported on relatively
modest populations or did not monitor the disks closely
enough during deployment to provide insights into the
factors that might be associated with failures.
Disk drives are generally very reliable but they are
also very complex components. This combination
means that although they fail rarely, when they do fail,
the possible causes of failure can be numerous. As a
result, detailed studies of very large populations are the
...
...