ENGINEERING
A New Generation of Supercomputers
by Kurt Riesselmann, Fermilab Public Affairs --
Ever since Albert Einstein became the most famous physicist in the world, people have linked the word ‘theorist’ to three things: chalkboards, equations and bad hair. Well, it’s time to add an item to the list: supercomputers. Using paper and pencil, theorists have captured new ideas, revealed intricate mathematical relations and carried out page-long calculations. But times have changed. For many applications, physicists now prefer to attack their models and equations with the best computers available. The Fermilab theory group became one of the top players in the area of computational physics with the installation of a supercomputer called ACPMAPS in 1989.
“For a short period of time we had—by some measures—the fastest supercomputer in the world,” recalled Mark Fischler, who developed sophisticated software for the users of the 50-gigaflop supercomputer. “The rest of the world, of course, caught up quickly.”
Scientists developed ACPMAPS to carry out computations involving the strong force that binds quarks together. A theory called quantum chromodynamics provides the equations that describe—in principle— the evolution and properties of all quark systems. In reality, even today’s most powerful computers can handle only crude approximations of the full QCD theory using a four-dimensional lattice. A typical calculation approximates a volume as large as a proton by a grid of 24x24x24 points. The fourth dimension, which keeps track of the evolution in time, might be cut in 48 slices. For a thirteen-year-old supercomputer, such a 660,000-point lattice is tough to deal with even for simple approximations. Theorists, of course, would like to to obtain more accurate results by using lattices with even more points and better approximations.
“If you had infinite computing power, you could do anything,” said physicists Massimo di Pierro, who develops and programs algorithms for lattice calculations. “But in practice, the full QCD algorithms take too long. So you need to approximate. We absolutely understand how to approximate QCD. You develop tricks and new techniques. That’s a necessity. Only if we had an infinite amount of computing power, the uncertainty in our results would go to zero.”
Good lattice calculations are essential to extract information from some of the best particle experi-ments under way, including measurements on the mixing of quark systems.
“Without precise lattice calculations, measurements of the mixing of strange B and Bbar mesons here at Fermilab will not illuminate the tiny difference in the behavior of matter and antimatter,” said theorist Paul Mackenzie, a Fermilab expert on lattice calculations. Scientists rely on these results to learn more about the origin of all matter in the universe.
Since fall of 2001, Fermilab has gained the financial means of cranking up its computing power for lattice QCD. Introducing the Scientific Discovery through Advanced Computing program, the Department of Energy awarded Fermilab almost two million dollars over the course of three years to develop a new supercomputer system. The SciDAC grant is part of a nation-wide effort to provide teraflop computing power for lattice theory projects in nuclear and high energy physics, one of many SciDAC computing initiatives.
“Refinement of computing algorithms is important,” Mackenzie said. “But brute computing force plays a significant role, too. Both are important: they multiply each other.”
The Fermilab theory group has been a leader in developing the theoretical methods to make lattice calculations feasible and effective. With the grant money, Mackenzie and his colleagues can now improve on computing power as well. Collaborating with scientists from universities and national laboratories, Fermilab theorists have decided to build a computer cluster consisting of 512 PCs.
Order by catalogue
“Right now, we have a prototype of 80 nodes,” explained Don Holmgren, who is responsible for the assembly of the new machine. “This system is already more powerful than ACPMAPS. In the next six months, we will buy and integrate another 176 machines.”
In contrast to the old ACPMAPS machine, the new supercomputer will consist entirely of components available off the shelf. Each node will have a processor capable of more than one gigaflop, and all nodes will be able to communicate with each other.
“There are two opinions in the world on how to proceed: commodity-built or purpose-built,” explained Mackenzie, who leads the Fermilab project. “The ACPMAPS system is a purpose-built machine. It was built for Fermilab theorists and collaborators, and it is hard to upgrade. Our new machine is commodity-built. It will serve as a user facility and is easy to upgrade. That’s a great achievement of commodity hardware.”
Holmgren agreed.
“We bought the prototype cluster in 1999 for about three thousand dollars per node,” he said. “In June 2002 we’ll be buying components that are four times better while prices go down. Every year, we’ll buy the best PCs, eventually discarding the oldest ones. With this approach we’re riding the price-performance curve.”
The 80-node prototype, bought in 1999 for a quarter of a million dollars, is two to three times faster than ACPMAPS, which cost about three million dollars in 1989. The final cluster will be over ten times faster than the prototype.
Communication is the key
To create a teraflop machine (one million million floating point operations per second) for lattice theory, scientists need both high-speed processors and high-speed communication.
“For our machine, high-performance communication is more important than in experimental analysis,” Mackenzie pointed out.
The Computing Division and experimental groups at Fermilab have experience in building PC farms with hundreds of nodes. In these systems, however, there is less need for superfast communication among all nodes. Each node is capable of analyzing the data of a particle collision without exchanging data with parallel nodes. When the analysis of an event is complete, the node reports the results and requests the next set of data without input from the other nodes.
In lattice QCD computations, communication among different nodes is a crucial element of the calculation.
“We require communication among all nodes since they all need to share data on the quark fields,” explained Mackenzie. “Each node is responsible for a subset of lattice points, and each node constantly exchanges information with neighbors.”
For certain calculation procedures, like Fourier transforms, even the neighbor-to-neighbor communication is not sufficient. Instead, all nodes must share the values of their lattice points. Finding superfast communication hardware is the key to creating the best lattice supercomputers.
“If you spend all your time communicating, you’re not calculating,” Holmgren said. “But, all we care about is calculating. For our prototype system, we spent as much money on the communication switch as on the computers.”
The prototype system relies on a single powerful switch that can connect up to 128 nodes. With this switch, the cluster can simultaneously conduct 64 conversations between pairs of nodes without them slowing down each other.
Adding more and more PCs to the cluster creates new challenges by itself.
“In three years we may have about one thousand nodes,” Holmgren said. “At that point it becomes kind of a daunting task. If you need to do some-thing manually for each PC, it quickly adds up. Say it takes fifteen minutes to work on one machine. For all machines, that’s 250 manhours – six weeks for a single person. Clearly, we need to find smart ways of maintaining these machines.”
The SciDAC program will help to develop the necessary infrastructure as its grants fund both hardware and software. In addition, the program encourages research institutions to collaborate and make joint proposals.
“The science advisory committee recommended that the lattice community coordinates the use of all computers,” Mackenzie explained. “Subsequently, DOE asked for a long-term plan for lattice calculations in the U.S. For the first time, we see a coordinated effort. The SciDAC program has unified the efforts of individual groups into a national plan.”
Due to the new lattice initiative, experimenters know that their data will be even more valuable.
-----
Supercomputing Online thanks our friends at FermiNews for allowing us to share this article with our readers.