Essays24.com - Term Papers and Free Essays
Search

A Unified Peer-To-Peer Database Framework

Essay by   •  October 29, 2010  •  9,491 Words (38 Pages)  •  2,124 Views

Essay Preview: A Unified Peer-To-Peer Database Framework

Report this essay
Page 1 of 38

A Unified Peer-to-Peer Database Framework

and its Application for Scalable Service Discovery

Wolfgang Hoschek

CERN IT Division

European Organization for Nuclear Research

1211 Geneva 23, Switzerland

Wolfgang.Hoschek@cern.ch

Abstract

In a large distributed system spanning many administrative domains such as a Data-

Grid, it is often desirable to maintain and query dynamic and timely information about

active participants such as services, resources and user communities. However, in such a

database system, the set of information tuples in the universe is partitioned over one or

more distributed nodes, for reasons including autonomy, scalability, availability, performance

and security. It is not obvious how to enable powerful discovery query support and

collective collaborative functionality that operate on the distributed system as a whole,

rather than on a given part of it. Further, it is not obvious how to allow for search

results that are fresh, allowing dynamic content. It appears that a Peer-to-Peer (P2P)

database network may be well suited to support dynamic distributed database search, for

example for service discovery. In this paper, we take the first steps towards unifying the

fields of database management systems and P2P computing, which so far have received

considerable, but separate, attention. We extend database concepts and practice to cover

P2P search. Similarly, we extend P2P concepts and practice to support powerful generalpurpose

query languages such as XQuery and SQL. As a result, we devise the Unified

Peer-to-Peer Database Framework (UPDF), which is unified in the sense that it allows

to express specific applications for a wide range of data types, node topologies, query

languages, query response modes, neighbor selection policies, pipelining characteristics,

timeout and other scope options.

1 Introduction

The next generation Large Hadron Collider (LHC) project at CERN, the European Organization

for Nuclear Research, involves thousands of researchers and hundreds of institutions

spread around the globe. A massive set of computing resources is necessary to support it's

data-intensive physics analysis applications, including thousands of network services, tens of

thousands of CPUs, WAN Gigabit networking as well as Petabytes of disk and tape storage

[1]. To make collaboration viable, it was decided to share in a global joint effort - the European

DataGrid (EDG) [2, 3, 4, 5] - the data and locally available resources of all participating

laboratories and university departments.

Grid technology attempts to support flexible, secure, coordinated information sharing

among dynamic collections of individuals, institutions and resources. This includes data

sharing but also includes access to computers, software and devices required by computation

and data-rich collaborative problem solving [6]. These and other advances of distributed

computing are necessary to increasingly make it possible to join loosely coupled people and

resources from multiple organizations.

An enabling step towards increased Grid software execution flexibility is the (still immature

and hence often hyped) web services vision [2, 7, 8] of distributed computing where

programs are no longer configured with static information. Rather, the promise is that programs

are made more flexible and powerful by querying Internet databases (registries) at

runtime in order to discover information and network attached third-party building blocks.

Services can advertise themselves and related metadata via such databases, enabling the assembly

of distributed higher-level components. For example, a data-intensive High Energy

Physics analysis application sweeping over Terabytes of data looks for remote services that

exhibit a suitable combination of characteristics, including network load, available disk quota,

access rights, and perhaps Quality of Service and monetary cost.

More generally, in a distributed system, it is often desirable to maintain and query dynamic

and timely information about active participants such as services, resources and user

communities. As in a data integration system [9, 10, 11], the goal is to exploit several independent

information sources as if they were a single source. However, in a large distributed

database system spanning many administrative domains, the set of information tuples in the

universe is partitioned over one or more distributed nodes, for reasons including autonomy,

scalability, availability, performance and security. It is not obvious how to enable powerful

discovery query support and collective collaborative functionality that operate on the distributed

system as a whole, rather than on a given part of it. Further, it is not obvious

how

...

...

Download as:   txt (63.8 Kb)   pdf (521.2 Kb)   docx (43.3 Kb)  
Continue for 37 more pages »
Only available on Essays24.com