A Unified Peer-To-Peer Database Framework
Essay by 24 • October 29, 2010 • 9,491 Words (38 Pages) • 2,144 Views
A Unified Peer-to-Peer Database Framework
and its Application for Scalable Service Discovery
Wolfgang Hoschek
CERN IT Division
European Organization for Nuclear Research
1211 Geneva 23, Switzerland
Wolfgang.Hoschek@cern.ch
Abstract
In a large distributed system spanning many administrative domains such as a Data-
Grid, it is often desirable to maintain and query dynamic and timely information about
active participants such as services, resources and user communities. However, in such a
database system, the set of information tuples in the universe is partitioned over one or
more distributed nodes, for reasons including autonomy, scalability, availability, performance
and security. It is not obvious how to enable powerful discovery query support and
collective collaborative functionality that operate on the distributed system as a whole,
rather than on a given part of it. Further, it is not obvious how to allow for search
results that are fresh, allowing dynamic content. It appears that a Peer-to-Peer (P2P)
database network may be well suited to support dynamic distributed database search, for
example for service discovery. In this paper, we take the first steps towards unifying the
fields of database management systems and P2P computing, which so far have received
considerable, but separate, attention. We extend database concepts and practice to cover
P2P search. Similarly, we extend P2P concepts and practice to support powerful generalpurpose
query languages such as XQuery and SQL. As a result, we devise the Unified
Peer-to-Peer Database Framework (UPDF), which is unified in the sense that it allows
to express specific applications for a wide range of data types, node topologies, query
languages, query response modes, neighbor selection policies, pipelining characteristics,
timeout and other scope options.
1 Introduction
The next generation Large Hadron Collider (LHC) project at CERN, the European Organization
for Nuclear Research, involves thousands of researchers and hundreds of institutions
spread around the globe. A massive set of computing resources is necessary to support it's
data-intensive physics analysis applications, including thousands of network services, tens of
thousands of CPUs, WAN Gigabit networking as well as Petabytes of disk and tape storage
[1]. To make collaboration viable, it was decided to share in a global joint effort - the European
DataGrid (EDG) [2, 3, 4, 5] - the data and locally available resources of all participating
laboratories and university departments.
Grid technology attempts to support flexible, secure, coordinated information sharing
among dynamic collections of individuals, institutions and resources. This includes data
sharing but also includes access to computers, software and devices required by computation
and data-rich collaborative problem solving [6]. These and other advances of distributed
computing are necessary to increasingly make it possible to join loosely coupled people and
resources from multiple organizations.
An enabling step towards increased Grid software execution flexibility is the (still immature
and hence often hyped) web services vision [2, 7, 8] of distributed computing where
programs are no longer configured with static information. Rather, the promise is that programs
are made more flexible and powerful by querying Internet databases (registries) at
runtime in order to discover information and network attached third-party building blocks.
Services can advertise themselves and related metadata via such databases, enabling the assembly
of distributed higher-level components. For example, a data-intensive High Energy
Physics analysis application sweeping over Terabytes of data looks for remote services that
exhibit a suitable combination of characteristics, including network load, available disk quota,
access rights, and perhaps Quality of Service and monetary cost.
More generally, in a distributed system, it is often desirable to maintain and query dynamic
and timely information about active participants such as services, resources and user
communities. As in a data integration system [9, 10, 11], the goal is to exploit several independent
information sources as if they were a single source. However, in a large distributed
database system spanning many administrative domains, the set of information tuples in the
universe is partitioned over one or more distributed nodes, for reasons including autonomy,
scalability, availability, performance and security. It is not obvious how to enable powerful
discovery query support and collective collaborative functionality that operate on the distributed
system as a whole, rather than on a given part of it. Further, it is not obvious
how
...
...