APPLICATIONS
Supercomputing Success Story
The challenge
The workhorse of modern day High Performance Computing (HPC) is the Linux Cluster. Almost all areas of science and engineering are dependent upon the power and performance offered by Linux clusters. Today's top computational problems now require performance in the TFLOP/s (10^12 Floating Point Operations per Second) range. In the very near future, the delivery of PFLOP/s (10^15) cluster systems will be a reality as well. One of the leading edge practitioners of Linux HPC computing is Lawrence Livermore National Lab (LLNL) located in Livermore, California. One of LLNL's primary missions for the National Nuclear Security Administration (NNSA) is ensuring the safety, security and reliability of the nation's nuclear deterrent through a program called stockpile stewardship. A cornerstone of this effort is NNSA's Advanced Simulation and Computing Program (ASC), which provides the integrating simulation and modeling capabilities and technologies needed to combine new and old experimental data, past nuclear test data, and past design and engineering experience into a powerful tool for assessment and certification of nuclear weapons and their components. In addition, LLNL computing supports research in many other areas including, molecular dynamics, turbulence, peta scale atomistic simulations with quantum accuracy, simulation of protein membranes, nano technology, ultrahigh resolution global climate models, fundamental material research, and laser plasma interactions for the National Ignition Facility (NIF). Currently LLNL provides 495 TFLOP/s of computing power, spread over eighteen x86 Linux clusters, to its user base. Of these clusters, 480 TFLOP/s are derived from the eleven systems supplied by Appro International. These clusters are available for both classified and unclassified work depending on the project. In addition, clusters are broken into two types: Capability Clusters, or those clusters that are designed to handle unusually large computing jobs, and Capacity Clusters, or those that are designed to handle a large number of different computing jobs at the same time.
- 1 SU = 144 nodes/ 2,304 processor cores
- 2 SU = 288 nodes/ 4,608 processor cores
- 4 SU = 576 nodes / 9,216 processor cores
- 6 SU = 864 nodes / 13,824 processor cores
- 8 SU = 1152 nodes / 18,432 processor cores
Part of the solution was a careful specification that placed clear boundaries between various aspects of the cluster solution (i.e. storage area networking and parallel file systems) and required only the delivery of a large quantity of 144-node SU and second level switches from the cluster provider. The Tri-Lab procurement consisted of 21 scalable units for an aggregate performance of 438 TFLOP/s which would be broken into eight clusters and 2 options where Lawrence Livermore would create the Juno cluster from (8 SUs), the Hype cluster from (1 SU), and the Eos cluster from (2 SUs). Los Alamos would create Lobo from (2 SUs) and Hurricane from (2 SUs). Sandia would create Unity from (2SUs), Whitney from (2SUs) and Glory from (2 SUs). LLNL as part of the Tri-Labs labs procurement had the option to purchase an additional 2 clusters, Hera from (6 SUs) and Nyx from (2 SUs). In the fall of 2007, Appro International was selected under the TLCC procurement as the solution provider to deliver the SUs to the three labs. Appro was chosen based on how well they addressed the requirements specified in the Request for Proposal, their proven HPC track record, system cost and project management skills. In addition, Appro was noted to work well with component suppliers and solve problems as they arise. In terms of hardware, Appro 1U- 1143H, Quad-socket based on Quad-Core AMD Opteron Processors was specified instead of bladed packaging. While blade based servers add some convenience and redundancy, the most economical choice was still rack-mount 1U servers. In terms of the processors, the Tri-Labs SU employed a Quad-Core AMD Opteron (Barcelona) connected by DDR (4x) InfiniBand. Each SU was designed to achieve a high compute density and therefore used quad-socket motherboards with four Quad-Core AMD Opteron Socket F processors running at 2.2 GHz (Model 8354) for a total of 16 cores. Each node was equipped with 16 2GB DDR2-667 DIMMs for 32 GBytes of memory and a 4x DDR InfiniBand Connect-X Host Channel Adapter (HCA) Card. In order to achieve a balanced computational system, each node was required to achieve an optimum memory Bytes/FLOP ratio for both the memory interface and interconnect. The Opteron nodes were able to achieve a 20GB/s/node memory bandwidth which fit nicely within the specifications. The Results Measuring success in HPC at LLNL is not so much a mater of up-times and peak performance, but rather how much faster does the science and engineering get done. In the case of LLNL, immediately after installation of the first SUs, the NIF team requested time to refine the optics needed for the NIF lasers. There was a pressing purchasing decision that needed to be completed as soon as possible. The NIF team was given a 1.5 months of time on one of the LLNL TLCC clusters to complete the calculations and make the right decisions. The NIF is slated to fire all 196 lasers in 2010. This historic event will be the culmination of many people years of work backed by many compute years of CPU time made possible by the Tri-Lab cluster acquisition. According to Mark Seager, "The users are very excited about the TLCC Linux clusters. Appro and AMD have been great partners to work with. They have exceeded our expectation in terms of working with us to resolve problems as they came up and to get these system fielded quickly". Mark Seager states, "It is amazing how much leverage we are getting out of the standard configuration at all three Labs. Because we work on the same problems when we are trying to field these clusters, each lab brings its own unique approach to the problem and we tend to solve these problems more quickly.” From a performance standpoint, the new system is not only a success, but also represents an improvement over the previous clusters in use at Lawrence Livermore that previously only had dual-core processors. With twice the number of cores and twice the memory capability, says Seager, “we’re seeing a performance boost of anywhere from 1.3 to 1.8 times compared to the previous system. The users are very excited about getting this kind of capability.” One group of users from the LLNL National Ignition Facility (NIF) focuses on both high energy-density physics research and new kinds of energy sources utilizing photon science. NIF requested three months of dedicated time on the clusters, but were assigned half that because the pressing demand for computing resources from other LLNL programs. “Still, they were able to do the research they needed on both the ignition research and optics,” says Seager. Standing behind this success is Appro International. Their ability to deliver high quality hardware and work closely with component vendors proved to be a success factor in this project. Appro maintained good lines of communication with all vendors and customers. Appro also adapted well to other aspects of the procurement. There were some modification required to the standard SU design because not all sites had the same power and cooling capabilities. The solution was to reduce the density of systems in the SU and still maintain the SU concept. The Summary Overall, the Tri-Labs project has been a huge success. Its challenge has proved to be beneficial to everyone involved. Mark Seager summarizes his thoughts on the entire project and says, “Can we put in all the information we need to put in, and when we do, what kind of science comes out in the results? Ultimately, that’s the most important attribute of the Tri-Labs challenge, and by that metric, the new clusters have been a big success.” The Tri-Lab SU concept reduced costs across the board. Reductions in procurement costs and time frames were noted as well as ongoing maintenance costs. A key component to the success was the project management and problem solving Appro International brought to the table. Based on the experience of the Tri-Lab project, the future success of grand challenge computing depends on teamwork, planning, and amortizing the costs across multiple government labs. As LLNL has shown, Linux HPC clusters are igniting or nations future. To fulfill time-urgent national security missions LLNL needs rapidly deployable high performance computing systems. This requires partners who understand HPC ecosystems as well as who possess the strong technical and management skills necessary to work with a large number of component vendors. Fielding 21 Scalable Units (3,744 compute nodes with an aggregate performance of 438 TFLOPS) at three Tri-Lab sites (LLNL, LANL, SNL) was a daunting project challenge that, in this case, was very well executed." Mark Seager Head of Advanced Computing Systems Lawrence Livermore National Laboratory