SCIENCE
3 Questions: David Smith on InfiniBand
David Smith is Senior Product Manager, InfiniBand Products, at QLogic. For the past ten years he has focused on HPC server platforms, routers, storage systems and interconnects. Previously, he held product management positions at 3Par, Silicon Graphics, and 2Wire. He holds an MBA from San Jose State University. In an interview with SC Online, Mr. Smith shares his thoughts about InfiniBand as a cluster interconnect for supercomputing.
SC Online: What is QLogic doing to address these challenges differently from other InfiniBand vendors?
Smith: QLogic takes a system-level approach to InfiniBand – we design ASICs, adapters, and switches, but we have also invested heavily in system architecture including host system interfaces, application messaging patterns, fabric and I/O virtualization, fabric routing, and signal integrity and modeling. Additional software development services include scalable communication libraries, fabric management, installation services, and element management. With this broad perspective we have looked at the key challenges and specifically addressed them with features that are unique to our products.
Scalability – To improve scalability, QLogic uses a host-based interfacing approach. This is one fundamental difference between QLogic HPC solutions and those from other vendors.
In the early days of HPC clusters, we were dealing with single-core Pentium-class systems. Back then, other vendors employed the strategy of offloading everything to the adapters—a good idea back then. But with multi-core processors and larger clusters, the load on those on-board processors becomes too great and they actually become a bottleneck. Some of our tests show that the bottleneck begins with as little as 4 or 5 cores in a single node.
QLogic uses a host-based processing approach, so we specifically developed our host software and host silicon to excel at MPI workloads. As a result, the performance of our QDR InfiniBand HCAs actually increases as the number of cores in a cluster scales upward. We outperform competitive HCAs by more than 22% on a 256-node cluster running real-world applications, so we deliver better performance on small clusters and even better performance on large clusters. (This is based on SPEC_MP12007 results submitted in July 2009.)
Offloading work to adapters requires more power and space and is ultimately a flawed design that doesn’t take advantage of all the advancements on the server side. The competition is fighting against Moore’s Law. QLogic is enabling businesses to capitalize on it.
Efficiency – Treated as a single pipe, the HPC fabric can suffer significant losses in performance as multiple applications contest for resources. For example, if a message-intensive application is running with one microsecond latency, adding a storage application can increase the overall latency by up to 400 percent.
We use a feature called virtual fabrics with classes of service to eliminate resource contention and optimize performance for every application. Virtual fabrics is the ability to segregate traffic into different priority classes. A user may have different jobs that require different priorities, or he may decide to separate different traffic types into differing priority classes—such as compute traffic, storage traffic and management traffic. QLogic helps users to actually partition traffic flows to make sure that storage traffic doesn’t interfere with critical compute traffic.
QLogic’s virtual fabrics capability supports up to 16 service classes simultaneously. If a network administrator understands which applications must be supported and how the workloads occur on the fabric, he or she can use virtual fabrics to automatically optimize the fabric’s resources to ensure maximum performance for every job.
Another related feature is adaptive routing, which also makes the fabric much more efficient. Most HPC fabrics are designed to enable multiple paths between switches, but standard InfiniBand switches don’t necessarily take advantage of these paths to reduce congestion. As implemented by QLogic, adaptive routing is a capability that shifts network traffic from over-utilized links to less utilized links. Adaptive routing leverages intelligence in the switches themselves to maintain awareness of the performance of every available path, and to automatically choose the least congested path for each traffic flow.
QLogic’s implementation of adaptive routing is a great example of how we think through the entire problem before delivering a feature. Although other vendors also have adaptive routing, there are several key differences:
- Fabric intelligence vs. subnet manager intelligence – QLogic’s Adaptive Routing is built into the switch chips themselves, so the decisions about the ideal path are made in the switch. Our competitors’ implementation relies on the subnet manager, which recalculates routes and then sends commands to the switches for execution. It is far faster and more efficient for the switches themselves to make the decision, and this eliminates the chance that the subnet manager itself can become a bottleneck.
- Scalable fabric intelligence – by incorporating adaptive routing directly into its switch chips, QLogic actually allows the path selection intelligence to scale as the fabric grows: as switches are added, we add to the overall pool of knowledge about path characteristics as well. This is not the case with competitive products because they always rely on the subnet manager.
- Managing flows, not packets – QLogic’s approach is more in tune with the realities of fabric operations. Switches work on a per-flow basis, not a per-packet basis. Competitive products will continue to send packets down a congested path, or will send packets on an alternate path only to have them arrive at the destination out of order. At that point, they need to be reordered, which causes delays. By working on a per-flow basis, QLogic’s fabric software optimizes transport of the flow, not particular packets, and it can automatically re-route part of a flow down a different path when there is congestion and will automatically reassemble packets from divergent paths in the proper order for optimized processing at the destination.
Another important differentiator is QLogic’s distributed adaptive routing, which increases intelligence as the cluster scales. With distributed adaptive routing, every TrueScale ASIC that we put into the network has its own integrated RISC microprocessor with its own ability to evaluate local conditions and make determinations in conjunction with the other ASICs. So the more switches you add, the more microprocessors you add. This means that you are adding more intelligence to your HPC network. Host adapters are aware of multiple paths to any destination, and use routing intelligence to figure out how to route packets at any given time.
We also use a feature called dispersive routing to automatically load-balance traffic across the network, effectively leveraging the entire fabric to optimize application performance. Working via the host adapters, QLogic’s dispersive routing distributes traffic over multiple paths to a destination to load-balance the network while ensuring the likelihood that packets sent via disparate routes arrive in the proper order for processing at their destination.
MPI Support – While the message passing interface (MPI) is a common denominator in InfiniBand software support, hardware companies usually supply vendor-specific MPI libraries that optimize performance for their products. MPI libraries are developed with different focuses – performance or scalability, for example, and open source MPIs are not as feature-rich as some of the commercial ones. For example, not all MPIs have native QoS support.
Competitors’ fabric software typically only supports Open MPI libraries, so it does not take advantage of custom features in vendor-specific products. For example, another software company will support Open MPI in the OFD stack, but it adds no value to the vendor-specific libraries that are out there. With our software, QLogic enables all features for every vendor-specific MPI, so the advanced routing, path selection, and class of service features work in any customer environment without compromising performance.
Topology Flexibility – Fat Tree is the most common topology used in HPC cluster environments, but it’s not the only one. QLogic is the only company to fully support alternative topologies.
Many HPC centers move to Torus or Mesh topologies when the cluster grows to 2000 nodes or more because these topologies require fewer switches and are thus less expensive to use than Fat Tree.
Torus and mesh topologies enable full bandwidth to the nearest neighboring nodes, but they don’t provide as much bandwidth to nodes beyond that. This takes about 50 percent of the cost out of the network, but it also means that applications must be very intelligent in terms of how they distribute the workload across nodes.
QLogic supports these alternative topologies with built-in browsing. Applications can leverage the built-in browsing in the QLogic software to quickly build accurate maps of where to distribute workloads for optimum performance. As a result, QLogic’s implementation allows for much better performance in Torus or mesh topologies than the competition.
SC Online wishes to thank David Smith for sharing his time and insights.