ACADEMIA
Platform to Announce Enhanced Grid Computing Solutions
By Steve Fisher, Editor --
Next week, Platform Computing will announce significant enhancements to its Grid computing solutions, driven by new versions of LSF and MultiCluster. Based on extensive consultations with over 40 beta customers around the world, Platform has "dramatically" increased the scalability and performance of both LSF and MultiCluster to build one of the industry’s strongest foundations for Enterprise Grid computing. To learn more about LSF 5 Supercomputing Online spoke with James Pang, LSF Product Manager at Platform Computing. SCO: Please tell us about the new version of Platform's LSF. Of particular interest might be its Grid-enabled architecture, scheduler plug-in SDK, resource allocation limits, things of that nature.
PANG: LSF 5 is a major new re-architecture, designed specifically to accelerate the adoption of Enterprise Grid computing: global sharing of resources across all geographical locations across the enterprise and across all platforms.
LSF 5 is the first and only solution to provide all the elements required to meet the computing needs of today’s Enterprise Grids. LSF 5 is the first to offer massive scalability through its modular, Grid-enabled architecture, the first to provide the performance and scheduling sophistication required to manage resources and data at this scale, and the first to make Enterprise Grid computing easy.
By providing an SDK for tuning Grid policies to meet business rules, new policies are introduced for inter- and intra-cluster sharing to ensure the right work is done reliability and on time, while still achieving 100% resource utilization. LSF 5 also introduces new features such as Advance Reservation, to ensure critical work is done reliably and without interference; Cross-queue Fairshare, to ensure users groups and projects only get the share of their resources that the business rules dictates; Memory Reservation, for improving throughput of large memory jobs; and Resource Allocation Limits for effective partitioning of resource usage to meet service level agreements.
With Web-enabled access for secure remote management, and a unique Resource Leasing Model for global inter-cluster sharing, it is just as easy to share remote resources as local resources, while maintaining local ownership and control. Site specific authentication is supported across firewalls, and integration with Clearcase MultiSite makes access to data easy even if the work is done on remote clusters.
SCO: According to a press release I saw, "LSF 5 boosts scalability to over 100 clusters, with over 500,000 jobs per cluster, and more than 200,000 CPUs across the enterprise." How did you accomplish this boost in scalability.
PANG: To our knowledge, no other workload management solution has these capabilities. This significant boost in scalability was driven largely by some of our largest customers, who are pushing the limits of scalability to new heights. There are two things happening – first, jobs are getting larger and more complex, and secondly, clusters are being consolidated to create larger clusters. This requires more effective management than ever before to achieve business objectives.
With this in mind, we developed a new modular architecture for LSF 5, which introduces new scheduler algorithms, and new protocols for inter-cluster sharing. This new architecture splits the work into two processes: a manager and a scheduler. The manager is dedicated to servicing users and agents, to give users fast response time, and to be able to quickly process load information sent from the agents. The scheduler just has to focus on matching the right resources to the right jobs. We also introduced new scheduling algorithms, such as our new algorithm for fairshare, to boost the performance of matching jobs to resources. The new algorithms, which are patent pending, use a resource centric approach at matching resources to the right jobs quickly. Lastly, we introduced a new protocol for inter-cluster sharing, called Resource Leasing, that allows us to improve the performance of scheduling work across clusters.
SCO: How does LSF 5 compare to offerings from other vendors?
PANG: LSF 5 is really in a unique position, as the only totally integrated workload management solution that addresses enterprise-wide needs across all platforms (Linux, UNIX, Windows, Apple), all geographical locations, across workstations and HPC servers, with extensability to desktop computing as well. It is the only solution that provides enterprise-wide scalability, enterprise-wide policy tuning, enterprise-wide manageability and usability, including cluster management.
There are other lower-end and open source products that address the basic queuing needs of smaller, departmental clusters, but LSF 5 is the only solution that provides enterprises with a end-to-end solution to meet growing compute demands, while building a strong foundation for Enterprise Grid computing.
SCO: Any early adopters and feedback that you'd care to share with us?
PANG: We have just completed our largest Beta program ever, with over 40 Beta sites, over a six-month period. Most of these customers have depended on Platform LSF for several years and include some of the largest IT infrastructures in the world, including AMD, nVidia, Sharcnet, and HP (Convex), and SLAC (Stanford Linear Accelerator Center), among others. Some of our largest Beta customers, such as AMD, are already using LSF 5 in production, and will use it to merge several clusters into a single significantly larger cluster.
In addition to comments about stability, performance and scalability, our beta customers are excited about the new features like advance reservation and fairshare functionality, which gives superusers a way to pre-empt other jobs and reserve adequate resources to get their jobs through. Other users like the new open architecture in LSF 5, because it allows them to accommodate custom application integration research projects, while maintaining an extremely stable environment. We’ve also had great feedback about the new web interface and the easy, automated installation, which makes for a very quick, painless upgrade.
More and more enterprises looking at Enterprise Grid computing as a way to improve ROI and productivity – simplify chaos – complexity, get more mileage out of what they already have through collaboration and sharing with other departments.
SCO: On a more general note, I think it's probably fair to say that Platform is in fact huge in distributed computing these days. Why do you think you've been more successful than other companies in this space? If Platform considers itself number one in distributed computing, what does it have to do to stay atop this rapidly growing market segment?
PANG: Platform has been successful because we listen, respond and work closely with our customers. LSF 5 was built to address the ever increasing demands on computing resources from our Enterprise customers: more complex designs, shorter time to market, increasing user productivity, maximizing IT efficiency, all of which were pushing the limit of the number of cpus, clusters, jobs, projects, user, sites, etc. that could be supported.
With LSF 5, we have built the foundation to meet not just the customers' current needs, but their future needs as well. Customers will continue to push the limits of global sharing, and Platform must be there to make global sharing feasible by expanding our scalability limits and supporting heterogeneous platforms and environments, and making global sharing easy for our users through application integration and easy-to-use management interfaces. Platform will continue to focus on providing integrated solutions for the whole spectrum of distributed computing: workload management, performance management, and service management.
SCO: Is there anything you'd like to add?
PANG: With LSF 5, we have captured the accumulated experience of 10 years of distributed computing expertise, with over 1500 customers and 140 software developers. This exciting milestone would not be possible without the dedication and effort of our world-class development, professional services and support teams, and the support of our customers and partners.