Let the games begin. At a conference on computing frontiers held at Ischia, Italy, scientists reported on the high-performance potential of a new processor designed for PlayStation 3.
To evaluate Cell's potential, Berkeley Lab computer scientists evaluated the processor's performance in running several scientific-application kernels, then compared this performance against other processor architectures. The results of the group's evaluation were presented at the ACM International Conference on Computing Frontiers, held May, 2006 in Ischia, Italy, in a paper by Samuel Williams, Leonid Oliker, Parry Husbands, Shoaib Kamil, and Katherine Yelick of the Future Technologies Group in Berkeley Lab's Computational Research Division, and by John Shalf from DOE's National Energy Research Scientific Computing Center, NERSC.
"Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency," the authors report. "We also conclude that Cell's heterogeneous multicore implementation is inherently better suited to the HPC" -- high-performance computing -- "environment than homogeneous commodity multicore processors."
Cell, designed by a partnership of Sony, Toshiba, and IBM (the STI in STI Cell), is a high-performance implementation of software-controlled memory hierarchy in conjunction with the considerable floating-point resources required for demanding numerical algorithms. Cell takes a radical departure from conventional multiprocessor or multicore architectures. Instead of using identical cooperating processors, it uses a conventional high-performance PowerPC core that controls eight single-instruction, multiple-data cores called synergistic processing elements (SPEs), each of which contains a synergistic processing unit, a local memory, and a memory-flow controller.
In addition to its radical departure from mainstream general-purpose processor designs, Cell is particularly compelling because the intended game market means it will be produced at high volume, making it cost-competitive with commodity central processor units. Moreover, the pace of commodity microprocessor clock rates is slowing as chip power demands increase, and these worrisome trends have motivated the community of computational scientists to consider alternatives like STI Cell.
Playing the science game
The authors examined the potential of using the STI Cell processor as a building block for future high-end parallel systems by investigating performance across several key scientific computing kernels: dense matrix multiplication, sparse matrix vector multiplication, stencil computations on regular grids, and one-dimensional and two-dimensional fast Fourier transforms.
The STI Cell processor can do science applications almost as readily as it plays games.
According to the authors, the current implementation of Cell is noted for its extremely high-performance, single-precision (32-bit) floating point resources. The majority of scientific applications require double precision (64 bits), however. Although Cell's peak double-precision performance is still impressive compared to its commodity peers (eight SPEs running at 3.2 gigahertz mean 14.6 billion floating-point operations per second),the group showed how a design with modest hardware changes, which they named Cell+, could improve double-precision performance.
The authors developed a performance model for Cell and used it to show direct comparisons of Cell against the AMD Opteron, Intel Itanium 2, and Cray X1 architectures. The performance model was then used to guide implementation development that was run on IBM's Full System Simulator, in order to provide even more accurate performance estimates.
The authors argue that Cell's three-level memory architecture, which decouples main memory accesses from computation and is explicitly managed by the software, provides several advantages over mainstream cache-based architectures. First, performance is more predictable, because the load time from an SPE's local store is constant. Second, long block transfers from off-chip DRAM (dynamic random access memory) can achieve a much higher percentage of memory bandwidth than individual cache-line loads. Finally, for predictable memory-access patterns, communication and computation can effectively be overlapped by careful scheduling in software.
"Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency," the authors state. While their current analysis uses hand-optimized code on a set of small scientific kernels, the results are striking.
On average, Cell is eight times faster and at least eight times more power-efficient than current Opteron and Itanium processors, despite the fact that Cell's peak double-precision performance is fourteen times slower than its peak single-precision performance. If Cell were to include at least one fully usable pipelined double-precision floating-point unit, as proposed in the Cell+ implementation, these performance advantages would easily double.
This work was supported by the U.S. Department of Energy's Office of Science.