The CP-PACS Parallel Computer - Its Development and Application for Lattice QCD -

Paper: 437
Session: F (talk)
Speaker: Ukawa, Akira, University of Tsukuba
Keywords: application programming, parallelization, simulation, massive parallel systems, special architectures

The CP-PACS Parallel Computer

- Its Development and Application for Lattice QCD -

Akira Ukawa for the CP-PACS Project
Institute of Physics, University of Tsukuba
Tsukuba, Ibaraki 305, Japan


The CP-PACS is an MIMD massively parallel computer with a peak speed of
614GFLOPS and 128GBytes of main memory, and 529GBytes of distributed disk
storage. The computer has been developed at the Center for Computatinal
Physics, University of Tsukuba, to advance research into computational
physics, one of the main targets being numerical simulations of lattice QCD.
The development of the CP-PACS computer started in 1992, and it began
operation in March 1996 with a 1024 processor configuration. An upgrade
to the final 2048 processor configuration has been completed in September 1996.

The CP-PACS consists of 2048 processing units (PU's) and 128 I/O units (IOU's)
connected together in an 8x17x16 array through the three-dimensional
Hyper-Crossbar network.

The processor of the CP-PACS is a PA-RISC-based custom CPU with 300MFLOPS
of peak speed for 64bit data at the clock cycle of 150MHrz. The processor
incorporates an enhancement of architecture called PVP-SW (pseudo vector
processor based on slide window registers), which has been developed to
sustain high performance of floating point operations for large data size
exceeding cache capacity. Each processor has 64MBytes of main memory made
of 4Mbit DRAM.

The Hyper-Crossbar network of the CP-PACS consists of a number of crossbar
switches in the x-, y- and z-directions connected together by an Exchanger
at each crossing point of the network. Each Exchanger has a PU or IOU
attached. This represents a very flexible architecture which allows
a data transfer of any pattern through at most three steps of crossbar
switches. A special protocol called RDMA (remote DMA) has been developed
to achieve low startup latency and high effective throughput. The peak
throughput is 300MByte/sec for each crossbar switch.

The IOU's of the CP-PACS forms an 8x16 array, which are attached at the end of
the y crossbar switches. A distributed system of RAID-5 disks,
employed for fault tolerance, is connected to the IOU's through SCSI-II bus.

The CP-PACS runs under the UNIX OSF/1 operating system, enhanced for
parallel processing. The programming language is Fortran90 and C.
The compiler produces codes incorporating the PVP-SW features. The message
passing via RDMA is made by calling library routines within the program.

The most CPU-time consuming part of lattice QCD calculations is inversion of
lattice Dirac operator. For the Wilson form of the operator commonly used
in lattice simulations, an optimized code for the red/black-preconditioned
MR algorithm in which the core part is programmed
by the assembly language to fully exploit the PVP-SW feature achieves
191MFLOPS for a single processor (64% of peak). The sustained speed for
a 48x48x48x84 lattice using 1024 PU's is 148MFLOPS incuding the communication
overhead of 23%. Other lattice QCD programs, so far coded with fortran, run at
100-130MFLOPS per processor.

Since the early summer of 1996 the CP-PACS has been running large-scale
numerical simulations for calculating light hadron masses in quenched lattice
QCD. Development and testing of codes for full QCD have also been pursued
since September 1996.

In this talk we present further details on (i) the hardware and software
characteristics of the computer, (ii) benchmark results for lattice QCD codes,
and (iii) physics projects currently underway.