BERKELEY, Calif. - High-performance computing experts from the U.S. Department of Energy’s (DOE) National Energy Research Scientific Computing (NERSC) Center and Cray Inc. have completed the initial operating system and messge-passing functional scalability testing for the Red Storm supercomputer currently being developed by Cray Inc. and DOE’s Sandia National Laboratories. The successful effort paves the way for the next stage of testing of I/O (input/output) of two potential Red Storm file systems. Red Storm is a massively parallel processing supercomputer developed for National Nuclear Security Administration's Advanced Simulation and Computing (ASCI) program. On Monday, Oct. 27, Cray announced that it is also planning to sell systems based on Red Storm to other customers.
For the testing, the NERSC Center at DOE’s Lawrence Berkeley National Laboratory provided Cray researchers with access to a 174-processor Linux cluster with 87 dual-processor nodes. The cluster is named “Alvarez” in honor of Berkeley Lab scientist and Nobel Laureate Luis Alvarez.
“By running several Alvarez login processors as Red Storm login processors, and the Alvarez compute processors as Red Storm compute processors using our Linux based IA32 versions of the compute-processor software, we could run layered on the existing Linux running on the machine. Not having to reboot the nodes was convenient for both Cray and NERSC,” reported Cray Operating Systems Manager Gail Alverson at the completion of the testing last month. “The Red Storm Linux-based software allows simulation of multiple virtual processors per physical processor, and using this we ran simulations of up to 1,000 processors on the Alvarez machine.”
Under a $90 million multiyear contract announced last year, Cray will collaborate with Sandia National Laboratories to develop and deliver Red Storm, which is expected to become operational in 2004. Cray will deliver a system with theoretical peak performance of 40 trillion calculations per second (teraflop/s).
Red Storm will be located at Sandia, a multiprogram laboratory managed by Lockheed Martin Corp. for the DOE's National Nuclear Security Administration. This new system is expected to be at least seven times more powerful than Sandia's current ASCI Red supercomputer when executing actual defense problems. ASCI Red was the first supercomputer delivered under the ASCI program.
One of the motivations for obtaining the Alvarez cluster was to assess various technologies for future high-performance computing systems. “The NERSC Center has long been a leader in testing and deploying leading edge systems and this collaborative effort with Cray is an extension of our efforts to provide the DOE scientific community with systems to advance scientific research,” said Bill Kramer, general manager of the NERSC Center.
Significant advances
Using Alvarez about one day a week for two months, the Cray team reported making advances in two major areas: system software and administration, and system scalability.
In the area of system configuration and administration, the Alvarez runs were some of the first runs made with all of the Red Storm software in place: Yod, PCT, PBS, mySQL, RCA, and CPA. “Consequently, it was on this platform that we developed a set of scripts for system configuration and easier job launch, as well as worked through a set of installation issues,” Alverson noted. “Development on Alvarez was transferred back to the Cray internal systems and continues to be used by the system test and integration group.”
In the area of system scalability, advances were made on a number of fronts:
• Launch time experiments provided some initial data on launch time for small and large (executable size) programs. While absolute times will differ on the real Red Storm hardware, the shape of the graph conveyed good information and exposed potential areas of future tuning.
• Experiments with simple MPI programs and HPL showed software functionality scaling to 1,000 processors. “This was a significant result for us,” Alverson reported.
• Experiments with real applications, such as CTH, and ITS, allowed the team to work with the codes and system with smaller numbers of virtual processors.
• The team found and fixed approximately 20 system bugs, from the portals layer up through MPI and system launch.
Future directions
While Cray has reached the end of their use of the Alvarez system for initial OS and MPI functional scalability testing, the I/O team is now ready to take over. Cray’s I/O team is starting to work with NERSC to use the GUPFS (Global Unified Parallel File System) testbed platform, and GUPFS joined together with Alvarez, to do scalability testing of two potential Red Storm file systems, PVFS and Lustre. They will also collaborate with NERSC staff who are working on global file systems that will eventually be deployed within the NERSC Center.