Essays24.com - Term Papers and Free Essays
Search

Failure Trends In A Large Disk Drive Population

Essay by   •  November 22, 2010  •  7,050 Words (29 Pages)  •  1,799 Views

Essay Preview: Failure Trends In A Large Disk Drive Population

Report this essay
Page 1 of 29

Appears in the Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07), February 2007

Failure Trends in a Large Disk Drive Population

Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz AndrÐ'Ò'e Barroso

Google Inc.

1600 Amphitheatre Pkwy

Mountain View, CA 94043

{edpin,wolf,luiz}@google.com

Abstract

It is estimated that over 90% of all new information produced

in the world is being stored on magnetic media, most of it on

hard disk drives. Despite their importance, there is relatively

little published work on the failure patterns of disk drives, and

the key factors that affect their lifetime. Most available data

are either based on extrapolation from accelerated aging experiments

or from relatively modest sized field studies. Moreover,

larger population studies rarely have the infrastructure in place

to collect health signals from components in operation, which

is critical information for detailed failure analysis.

We present data collected from detailed observations of a

large disk drive population in a production Internet services deployment.

The population observed is many times larger than

that of previous studies. In addition to presenting failure statistics,

we analyze the correlation between failures and several

parameters generally believed to impact longevity.

Our analysis identifies several parameters from the drive’s

self monitoring facility (SMART) that correlate highly with

failures. Despite this high correlation, we conclude that models

based on SMART parameters alone are unlikely to be useful

for predicting individual drive failures. Surprisingly, we found

that temperature and activity levels were much less correlated

with drive failures than previously reported.

1 Introduction

The tremendous advances in low-cost, high-capacity

magnetic disk drives have been among the key factors

helping establish a modern society that is deeply reliant

on information technology. High-volume, consumergrade

disk drives have become such a successful product

that their deployments range from home computers

and appliances to large-scale server farms. In 2002, for

example, it was estimated that over 90% of all new information

produced was stored on magnetic media, most

of it being hard disk drives [12]. It is therefore critical

to improve our understanding of how robust these components

are and what main factors are associated with

failures. Such understanding can be particularly useful

for guiding the design of storage systems as well as devising

deployment and maintenance strategies.

Despite the importance of the subject, there are very

few published studies on failure characteristics of disk

drives. Most of the available information comes from

the disk manufacturers themselves [2]. Their data are

typically based on extrapolation from accelerated life

test data of small populations or from returned unit

databases. Accelerated life tests, although useful in providing

insight into how some environmental factors can

affect disk drive lifetime, have been known to be poor

predictors of actual failure rates as seen by customers

in the field [7]. Statistics from returned units are typically

based on much larger populations, but since there

is little or no visibility into the deployment characteristics,

the analysis lacks valuable insight into what actually

happened to the drive during operation. In addition,

since units are typically returned during the warranty period

(often three years or less), manufacturers’ databases

may not be as helpful for the study of long-term effects.

A few recent studies have shed some light on field

failure behavior of disk drives [6, 7, 9, 16, 17, 19, 20].

However, these studies have either reported on relatively

modest populations or did not monitor the disks closely

enough during deployment to provide insights into the

factors that might be associated with failures.

Disk drives are generally very reliable but they are

also very complex components. This combination

means that although they fail rarely, when they do fail,

the possible causes of failure can be numerous. As a

result, detailed studies of very large populations are the

...

...

Download as:   txt (51.1 Kb)   pdf (440.7 Kb)   docx (38.3 Kb)  
Continue for 28 more pages »
Only available on Essays24.com