SCIENCE
World’s top supercomputing sites select Appro
By Chris O’Neal -- Appro today announced the historic award of supercomputing clusters for the Tri-Laboratory Linux Capacity Cluster 2007 (TLCC07) project to be delivered to three National Nuclear Security Administration (NNSA) weapons Laboratories: Lawrence Livermore National Laboratory (LLNL), Los Alamos National Laboratory and Sandia National Laboratories. The demand for capacity computing, as represented by this contract, at the Labs and elsewhere underscores how integral HPC has become to scientific endeavor, whether it's for national security, basic research, medicine or other new technology. One of the beauties of this story is that anybody can afford today’s modern supercomputing by using this innovative strategy. The approach makes supercomputing more accessible, scalable and affordable to companies of all sizes, budgets and needs. It is likely to help people get started in supercomputing, nucleating new science, new engineering and development. To learn more, Supercomputing Online interviewed Mark Seager, Advanced Computing Technology lead for LLNL and John Lee, Vice President of Advanced Technology Solutions Group at Appro International. SC ONLINE: How did the NNSA's Advanced Simulation and Computing (ASC) Program’s Tri-Laboratory Linux Capacity Cluster 2007 (TLCC07) project come about? SEAGER: TLCC07 actually has quite a long history. So I’ll give you the medium length version without the explicatives. The ASC Program started in 1995 and was primarily focused at shifting the stockpile stewardship program from an underground nuclear test centric basis with simulation as an interpolative tool, to a simulation based program with above ground and sub-critical nuclear tests. And over the ten years, we had to start from scratch on a new class of 3D, full physics, full system codes. We had to ramp up from about 50 gigaFLOP/s to 100 teraFLOP/s in ten years or less. And we built buildings and built up the other simulation environment components: visualization, networking and archive. So the focus was really on the high-end and on the capability systems. It turned out that in the early to mid years, most of the platforms ended up being used more in the capacity mode than in the capability mode. We did use them to demonstrate the capability initially. But the day to day work was primarily done at smaller scale to meet the programmatic objectives. With the advent of Purple and Blue Gene, the program wanted to focus more of those resources on capability type runs that use more than half of compute resources and RAM for weeks to months at a time. If you do that, you then have a tremendous gap in the computing aggregate capacity at the Laboratories. And there’s desperate need for capacity computing to back fill that in a very cost effect way to provide the free energy necessary to use those platforms as capability platforms. So in the late 2004 to early 2005 time frame, there was a huge demand for more capacity cycles. We proposed along with Los Alamos and Sandia to the NNSA program office headquarters that if we built Linux clusters in more intelligent fashion and combine the aggregate purchasing power of the three Laboratories for multiple years, then we would be able to dramatically reduce the cost of doing the procurements, the build, the deployment and integration of these machines. So that was the genesis of the idea. We started out in 2005 with a procurement activity that included all three Laboratories. And we got to point we where had an RFP ready to go and hit the button to turn it on, go live and get bids from the industry when the FY06 budget for ASC was cut. And that cut was primarily taken out of the capacity platform budget. So Livermore went ahead as single Laboratory with that procurement and changed it from a Tri-Lab to a single lab RFP, and that we called Peloton. In 2006, Appro was selected as the vendor provider for that. Appro delivered 20 scalable units (SU) in four separate Linux clusters at Livermore for multiple programmatic elements. We ended up deploying 10 scalable units in 2 clusters for Multi-programmatic and Institutional Computing (M&IC). M&IC is an institutionally funded program to provide large scale simulation capability to the entire Lab. These 2 M&IC clusters were deployed as one 2 SU cluster for capacity computing and then another 8 SU cluster for capability computing. With each one these Peloton SUs having about 5 plus teraFLOP/s peak, the 8 SU cluster was a 40 teraFLOP/s resource that doubled what was available in a single cluster to M&IC in capability mode. M&IC had a competition at the Laboratory where projects around the Lab submitted proposals for Grand Challenge computing. So far we have delivered millions of hours of computer time to run on that machine to these grand challenge projects. The 8 SU M&IC capability cluster is called Atlas. On the classified side, we fielded 2 clusters one of size 4 scalable units at about 20 teraFLOP/s and another one that we just completed of size 6 scalable units for 30 teraFLOP/s. One of the amazing things that we noticed in this activity was that as we started deploying these clusters one after the other the combination of great project management of John Lee, the team at Synnex, Voltaire, Supermicro and other suppliers, we were able to dramatically reduce the amount of time it took to go from starting the cluster build to actually having hardware here at Livermore in production. In particular, we went from having 4 SUs on the floor on a Thursday to bring 2 more SUs in for that final cluster and by Saturday we had them all wired up, burned in and were running Linpack. By Monday, we were running a synthetic workload acceptance test. By Friday, we had the machine accepted and starting to migrate it to the classified net for active service. That was record speed for the integration and deployment to classified operation at Livermore. The Tri-Labs went back to ASC program office and said we really need more capacity computing. The budget in FY07 held. So we out again with another procurement called Tri-Laboratory Linux Capacity Cluster FY07 (TLCC07) and worked on it all year and got it out in the early part of the summer. We got the bids back from the industry and in a very short period of time evaluated the bids, selected Appro, negotiated the contract, worked it through the DOE contract approval process and had it signed 2 weeks ahead of schedule. That procurement in aggregate has 21 scalable units this time with quad-socket Quad-Core AMD Opteron nodes. The peak of those scalable units is about 20 teraFLOP/s each. So 21 SUs is about 437 teraflops and we will be fielding that as 8 different clusters at 4 separate sites: Lawrence Livermore National Labs, Los Alamos National lab in 2 separate buildings and at Sandia in New Mexico and California. There are options in the contract to deploy an additional 10 scalable units which would bring the total up to 31 scalable units in access of 630 teraflops. We strongly believe that you need the right tool for the right job. If your job is to screw in a screw, then you need a screw-driver. If your job is to hammer in a nail, then you need a hammer. If you try to hammer in a screw, you’re going to mess things up. If you try to screw in a nail, it’s not going work. So we try to deploy the appropriate tool for the appropriate job. Linux clusters are spectacularly successful at capacity computing and reaching up into the capability regime. With Thunder, a number of years ago, we actually deployed a 23 teraFLOP/s, 1,024 node cluster based on single core Itanium 2 quad-socket box and that was number 2 on the TOP500 list at the time. One could argue that Blue Gene is a Linux cluster, but it is really an integrated system from IBM. Clearly that class of machine is a different design point than Linux clusters. Also, the Purple machine is a IBM pSeries cluster with dual plane Federation running AIX at about 93 teraFLOP/s. Those machines have made a huge contribution to the program at the high end as capability resources. That’s a topic for of another discussion. Focusing on the way we do business today, the ASC program is hugely successful at taking 3D applications that were develop to do full system, full physics, 3D, high res calculations that we’re doing on Purple and scaling that down either to 2D or low resolution or just doing the primary or just doing the secondary. And then running literally thousands of those simulations to do what we call ‘quantification of uncertainty’ which is modifying the designing parameters to see if the device is still reliable and safe and all that kind of good stuff. So there’s a vast amount of capacity computing required by the program. We have the algorithms and codes already, we just need more capacity to run them. So the whole emphasis for TLLC07 and Appro’s design point is lowering the total cost of ownership (TCO). We believe that through this procurement we can reduce that TCO by 30% to 50%. A 50% TCO reduction on 437 teraflops is a big deal. SC ONLINE: Great, thanks Mark. That leads me to my next question. Lawrence Livermore National Laboratory, Los Alamos National Laboratory and Sandia National Laboratories are the top supercomputing sites in the world. Why did you choose Appro’s supercomputing clusters? SEAGER: We had a very competitive procurement with 6 bids. Appro was selected by a Tri-Lab technical and business review team based on best value. The best value components included the quality of the technical solution, the risk, project plan, past performance and price. Overall, Appro clearly had the best price/performance. They also had a track record having done Peloton, which is roughly the same number of SUs the year before. SC ONLIE: Please elaborate on what is a "Scalable Unit." Why is this idea or concept so important for the Labs? SEAGER: When we build large Linux clusters, there’s infrastructure that needs to scale up with the number of nodes that you can put in the cluster. So as you put in more nodes you add more interconnects for example. And as you add more nodes you add more gateways for example, and as you more nodes you add more logins and so forth. So what we tried to do is find a sweet spot that was some number of nodes and some amount infrastructure that we could replicate some number of times to build large Linux clusters out of. In addition, it would allow us to build Linux cluster of various sizes. If you just build a large Linux cluster and that’s one design and then you go build another large Linux cluster and that’s a different design. With that approach each cluster is basically one-of a kind. With TLCC07 SU approach we are going to more of a production, assembly line kind of idea. With this approach we get a highly replicated scalable unit and then use economies of scale to build a large number of them at much lower cost. From this large number of SUs we can then build and deploy many Linux clusters of various sizes rapidly. If you choose that scalable unit with InfiniBand interconnect to be 144 nodes, it fits right into the natural scaling architecture of InfiniBand dual stage fat-tree interconnects. This represents a significantly different approach to Linux clusters. Up until this approach people would go and have a procurement to buy a Linux cluster. Then, they would wait a year or two and they would go out and buy another Linux cluster. And that may or may not look like the previous Linux cluster. Each cluster was one-of a kind. Every large Linux cluster that Linux Networx, Appro, Verari, Sun, HP or IBM deployed really was a one-of kind of system, customized to the requirements of the customer. They were cheaper than the fully integrated systems because they were based on Linux, commodity hardware and had economies of scale at that level. But there really wasn’t this cookie cutter approach to actual cluster design and build itself. And the procurement practice of buying one-of at a time limited the ability to really take advantage of these economies of scale of buying a large number of scalable units. So by aggregating all three Labs over two year procurement cycle we actually decreased the number of procurements from six (one procurement per year per Lab) down to one procurement. And we’re able to buy 8 clusters of various sizes all at once with one mechanism. You don’t get TCO reduction by a factor of 2 without really thinking hard about the fundamentals and coming up with a good strategy. John and his team came up with a really dense solution. A 2 SU optimization gives us an amazing 40 teraFLOP/s in 9 racks. This still astonishes me because if you compare with Thunder, which 2 or 3 years ago was number 2 on the TOP500 list, this has a peak that is 77% more than Thunder. It uses 20% or one-fifth of the power of Thunder. And it uses 6% or almost one-twentieth of the floor space of Thunder. It’s just amazing. SC ONLINE: Please describe the components of your total cost of ownership (TCO). What’s the key strategy to such a significant reduction in total cost of ownership? SEAGER: The components of TCO that we are addressing with TLCC07 are: cost of hardware, building, testing and integrating the hardware and software, hardware and software maintenance and applications support. The Scalable Unit concept is the key to the overall strategy to reduce TCO both on the build side, on the deployment and integration side and on the system and applications support side. It turns out when you try to build Linux clusters you have to scale up the infrastructure as you scale up the size of the cluster. It turns out, as mentioned earlier, you don’t always need the biggest Linux cluster you can possibly build. For capacity computing two half size clusters would probably actually be a lot easier to deal with. What we decided to do was to find a scalable unit that was relatively small and could be aggregated up into much larger systems that would allow us to field clusters of multiple sizes from the same building blocks. By buying a large number of the same building blocks, John and the Appro team could get volume of scale and come down the cost curve and the learning curve. One of the other TCO components is the software. All three Labs were developing and supporting their own Linux cluster software stacks. Once the Tri-Lab procurement was under way and it looked liked we were going to be coming to successful conclusion. Headquarters said all right now that we know what the hardware is going to be and it’s going to be the same everywhere let’s have a common software environment everywhere. They held a competition amongst the three Labs. A proposal from Livermore and Sandia was selected to provide a thing we’re calling TOSS, Tri-Lab Operating System Stack. It will be delivered on these TLCC07 scalable units to all three labs and all four sites. Most of the TOSS components come from Red Hat Enterprise Linux 5 update 1, which leverages our very large DOE site license and also an on-site Red Hat software engineer that we have at Livermore at no additional cost. So that’s a real inexpensive way to get the same software environment to all the sites. Through a combination of providing common hardware and common software we think we will be able to reduce total cost of ownership by 30% to 50%. That’s not price that’s total costs of floor space, the power, the amount of time it takes to integrate these systems and put them into production, the amount of man power that’s required to support these things from a system software perspective and also on the application side. When we have a single target hardware and software environment at all three Labs, then we can move the apps around and people can share the results very easily. SC ONLINE: How difficult was it to have the three labs agree for the first time, under a single contract, to purchase the same high performance computing systems for deployment at all the sites? SEAGER: It was quite difficult. We had to see our way though the technical issues of defining a scalable unit and what the base technology elements were and also the site specific security and production requirements. It involved quite a bit of discussion. We spent literally 6 months on the phone together once a week for an hour or two, exchanged emails with drafts and various documents and so forth. Then we’re also charting new procurement ground. From a contracting perspective this was exceeding complex for one Lab to contract with deliveries at all three Laboratories. In the end, we were able to find common ground and make progress because we all were responding to the huge pent up demand for capacity cycles at all three Labs. SC ONLINE: Thanks, Mark. I'm going to turn to John Lee now. Does Appro plan to incorporate this Scalable Unit concept to other markets? If yes, when? LEE: Definitely. As Mark was saying, the scalable unit concept is a really innovative way to build and deploy Linux clusters. I like to use the analogy of the Lego building blocks. I don’t know if you’ve been to Legoland in Carlsbad, California, it’s amazing what you can put together with little Lego building blocks. And his scalable unit concept is very similar. Once you have an uniform scalable unit architecture, it’s a matter of interconnecting multiple of those in order to scale the computing power of a cluster. So we’re jumping all over this idea. We think it’s a genius idea. So we’re probably going to have give Mark some royalties on it. We’re going to take it to the commercial market because we think this has a lot of play not just in the government space but also in the commercial space. And another thing you have to take into consideration is that the scalable unit architecture that Mark has defined is just one of many potential scalable unit building blocks that you can design. So one of the ideas that we are playing with in addition to the scalable unit architecture that we’re deploying at Lawrence Livermore and now at all three labs is to also define a smaller scalable unit at a rack cabinet level. So if a customer wants to deploy four of the scalable units then we deliver four cabinets or if they want to deploy 10 scalable units then we deploy 10 cabinets and interlink them together to scale the computing power that the customer wants. And obviously populating all the software packages that you need to make this run as a single entity or at a capacity level like Mark was saying. And that makes it easier for enterprise customers because they typically like to say I want 10 of gizmo A or 15 of gizmo B. So by doing this we’re taking some of the complexity out of building Linux super clusters. We will be officially unveiling this new product at the SuperComputing 2007 show. SC ONLINE: What does this win represent for Appro? LEE: That’s a very good question. I think it’s a validation about everything that Appro has been working towards for the past six years. We’ve built a reputation in the marketplace primarily by supporting Enterprise customers that had some reservation about putting Linux clusters in their production environment. So we’ve been very successful at that. Now, what we’re doing here is that we’re able to put together world-class clusters. The Atlas cluster that we put together for Livermore under the Peloton contract for instance is the number 19 on the latest TOP500 list. When you compare that to when we put together type of architecture cluster for TLLC07 the largest of which is going to be identical 8 SU but this time it’s going over 160 teraflops. If you take that number on the current TOP500 list, it would number 2 or 3. So it’s a validation that Appro is not only successful at supporting our enterprise customers, but also can deploy world class super clusters as well. Supercomputing Online wishes to thank Mark Seager and John Lee for their time and insight. It would also like to thank Don Johnston and Maria McLaughlin for their assistance.