Nile Fast-Track: Fault-Tolerant Distributed Processing for CLEO

Paper: 364
Session: F (talk)
Speaker: Ogg, Michael, University of Texas, Austin
Keywords: C++, object-oriented methods, parallelization, large systems


Title: Nile Fast-Track: Fault-Tolerant Distributed Processing for CLEO

Authors: Michael Athanas and Daniel Riley

Affiliation: University of Florida at Gainesville
Cornell University

Collaboration: Nile/CLEO



Nile is a multi-disciplinary project building a distributed computing
environment for HEP which will manage distributed computing resources,
making a large array of commodity computers appear to the user as a seamless
uniprocessor environment. Nile fast-track is an early prototype of many key
design principles of the full Nile project, providing a fault-tolerant
analysis system for CLEO data compatible with the pre-existing CLEO data
format and analysis codes. Implementation is limited to a local area
network, in contrast to the full Nile project which is distributed computing
over a wide area network.

The fast-track system consists of several cooperating services, implemented
as distributed objects:

* The SiteManager is the main interface between Nile fast-track and its
clients. It is responsible for marshalling the computational and data
resources available at a site to most effectively execute HEP analysis
jobs.
* The DataLocationManager tracks the location and availability of all the
data in the system, including system data files, tapes, and user skims.
The DataLocationManager resolves data queries sent to it by the
SiteManager or users.
* The Provider is the representative of an individual processor and is
responsible for executing subjobs assigned to that processor.
* The LocalResourceReporter collects and monitors resource statistics for an individual processor. The statistics are made available to the
SiteManager for scheduling and to the system operators for monitoring.
* The LocalDataLocationManager manages the data on a particular data
server, reporting to the DataLocationManager.
* The DataServer provides HEP analysis jobs access to the data.

The project was released to the CLEO collaboration in the Fall of 1996. We
report on this working architecture and discuss several aspects of the
system, including data-flow optimization for HEP analysis jobs, scheduling
of analysis jobs to maximize resource utilization, and the performance of
the fast-track system. We reflect upon what we have learned and how this has
affected the full Nile architecture.