Nile's Distributed Computing Site Architecture

Paper: 360
Session: F (talk)
Speaker: Ogg, Michael, University of Texas, Austin
Keywords: CORBA, object-oriented methods, parallelization, large systems


Title: Nile's Distributed Computing Site Architecture

Authors: Michael Ogg, Fabio Previato, Aleta Ricciardi, Eric Rothfus

Affiliation: University of Texas at Austin

Collaboration: Nile/CLEO


At CHEP95, we introduced the Nile project, whose goal is to manage
distributed computing resources, making an arbitrarily large array of
commodity computers appear to the user as a seamless uniprocessor
environment. This talk will describe Nile's software architecture in detail,
focusing on its CORBA implementation and the issues we have faced in
building a fault-tolerant system with distributed objects. A working
prototype (called Nile Fast-Track) is described in other contributed papers.

In Nile's software architecture, (built specifically for CLEO, but not
limited to any particular environment), any collaborating institution is a
site. Each site controls some processing and data resources which could be
used locally, or globally (i.e. by any collaborator). A site provides an
interface to local users (and other sites) through which they submit,
control, and query jobs. Our architecture ensures that each site remains
available, and every job executes consistently to completion, despite the
failure of components.

Site functionality is implemented by a collection of communicating CORBA
objects. Objects critical to job control are replicated on a number of
different processors to achieve fault tolerance. Ensuring that all copies of
an object are in the same (or, more generally, a consistent) state at any
point in time is a non-trivial task. Nile is using Electra, an ORB built on
top of Isis, which is a special transport layer that ensures replica
consistency despite failures of some of the replicas. The following are the
objects that implement Nile's job processing functionality:

* The SiteManager is the main interface between Nile and all other
clients (users or other sites). It receives requests about jobs
(submission, control, status) and arranges for appropriate processing
by creating or contacting other objects.
* The JobDB (job database) stores the state of the system - jobs pending,
currently being processed, or completed.
* For each submitted job, the SiteManager creates a JobManager that is
responsible for managing and processing that job. The JobManager makes
an execution plan using information on available resources, then
creates a set of subjobs and arranges to execute them in parallel on
several machines. Finally, it collects and assembles the results, then
signals the SiteManager. If the execution plan involves resources at
another site, the JobManager will pass the relevant subjobs to the
appropriate SiteManagers.
* The ResourceDB and DataLocationManager store information regarding
available processors and data within the Nile system. This information
is used in determining execution plans.
* The SubJobProcessor creates a context and runs a given subjob,
controlling its execution. The SubJobProcessor object signals the
JobManager when a subjob has terminated.
* Each processor and data server is represented by a Monitor, which
transmits usage and availability information to the ResourceDB and the
DataLocationManager.

We will present performance results and experience, as well as experience
and lessons learned from building the system. A working demonstration will
be given in the poster session.

Fill in here the Title of the talk

Author name(s) come here
Institutional affiliation should be put here, together with address
This field is for the Collaboration name (if needed)

Abstract

Fill in here the abstract contents
(300-500 words summary which highlights
the scope and significance of the paper,
including a statement of the current status of the work).