Experience with a Persistent Object Manager for HEP data

Paper: 381
Session: C (talk)
Speaker: Ogg, Michael, University of Texas, Austin
Keywords: data bases, data management, object-oriented methods, file systems


Experience with a Persistent Object Manager for HEP data

Wei-Cheng Lai and Michael Ogg
University of Texas at Austin

Robert Grossman and Dave Northcutt
Magnify, Inc.

Most of today's running or planned HEP experiments have integrated data
samples ranging from 100 Terabytes to 10's of Petabytes. To access and
process these enormous amounts of data, there are at least two requirements:
the data must be served and processed by multiple systems in parallel, and
the data server must filter the data in some way. There are many ways to
approach this problem, including Object Databases, Object Relational
Databases, attribute servers, and Persistent Object Managers (POM). This
paper presents results and experience from using a particular POM,
PATTERN:Store.

We have populated a store with 50 GB of CLEO data. One of the key aspects of
POM performance is to identify attributes whose "easy" access would result
in less data being transferred. Specifically, if one pictures event analysis
as a sequence of cuts, with each successive cut rejecting events, then the
data needed for the "outer" cuts (most frequently accessed data) should be
much more accessible than those data needed for the "inner" cuts. To study
the performance, we have made not only a "reasonable" object representation
(schema), but also one in which each attribute is put into a separate store.
In all cases, the actual object decomposition is transparent to the user.

Since the interface to a POM (or at least PATTERN:Store) conforms to the
ODMG-93 standard, which in general is not the interface we wish to expose to
the physicist, we have also developed an intermediate layer. This layer
encapsulates the details of the POM, but gives the user a familiar STL
interface. Although additional layers can harm performance, it turns out
that the performance degradation due to STL is minimal.

PATTERN:Store is desiged to provide low overhead, high performance access to
large amounts of data by optimizing for access to data which is frequently
read, occasionally appended, but rarely updated. PATTERN:Store caches and
migrates physical collections of objects between disk and memory, between
disk and disk, and between disk and tertiary storage.

In this paper, we report on our experience managing and querying large data
sets with a POM and discuss issues effecting the scale up of this approach.