
The world of modern computing potentially offers many helpful methods and tools to scientists and engineers, but the fast pace of change in computer hardware, software, and algorithms often makes practical use of the newest computing technology difficult. The Scientific and Engineering Computation series focuses on rapid advances in computing technologies, with the aim of facilitating transfer of these technologies to applications in science and engineering. It will include books on theories, methods, and original applications in such areas as parallelism, large-scale simulations, time-critical computing, computer-aided design and engineering, use of computers in manufacturing, visualization of scientific data, and human-machine interface technology. 

The series is intended to help scientists and engineers understand the current world of advanced computation and to anticipate future developments that will affect their computing environments and open up new capabilities and modes of computation. 

This volume in the series describes the increasingly successful distributed/parallel system called Beowulf. A Beowulf is a cluster of PCs interconnected by network technology and employing the message-passing model for parallel computation. Key advantages of this approach are high performance for low price, system scalability, and rapid adjustment to new technological advances. 

This book includes how to build, program, and operate a Beowulf system based on the Linux operating system. A companion volume in the series provides the same information for Beowulf clusters based on the Microsoft Windows operating system. 

Beowulf hardware, operating system software, programming approaches and libraries, and machine management software are all covered here. The book can be used as an academic textbook as well as a practical guide for designing, implementing, and operating a Beowulf for those in science and industry who need a powerful system but are reluctant to purchase an expensive massively parallel processor or vector computer. 


Janusz S. Kowalik We know two things about progress in parallel programming: 


	Like nearly all technology, progress comes when effort is headed in a common, focused direction with technologists competing and sharing results. 

	Parallel programming remains very difficult and should be avoided if at all possible. This argues for a single environment and for someone else to do the programming through built-in parallel function (e.g., databases, vigorous applications sharing, and an applications market). 


After 20 years of false starts and dead ends in high-performance computer architecture, the way is now clear: Beowulf clusters are becoming the platform for many scientific, engineering, and commercial applications. Cray-style supercomputers from Japan are still used for legacy or unpartitionable applications code; but this is a shrinking fraction of supercomputing because such architectures arent scalable or affordable. But if the code cannot be ported or partitioned, vector supercomputers at larger centers are required. Likewise, the Top500 share of proprietary MPPs1 (massively parallel processors), SMPs (shared memory, multiple vector processors), and DSMs (distributed shared memory) that came from the decade-long government-sponsored hunt for the scalable computer is declining. Unfortunately, the architectural diversity created by the hunt assured that a standard platform and programming model could not form. Each platform had low volume and huge software development costs and a lock-in to that vendor. 

Just two generations ago based on Moores law (19952), a plethora of vector supercomputers, non scalable multiprocessors, and MPP clusters built from proprietary nodes and networks formed the market. That made me realize the error of an earlier prediction that these exotic shared-memory machines were supercomputings inevitable future. At the time, several promising commercial off-the-shelf (COTS) technology clusters using standard microprocessors and networks were beginning to be built. Wisconsins Condor to harvest workstation cycles and Berkeleys NOW (network of workstations) were my favorites. They provided one to two orders of 


1MPPs are a proprietary variant of clusters or multicomputers. Multicomputers is the name Allen Newell and I coined in our 1971 book, Computer Structures, to characterize a single computer system comprising connected computers that communicate with one another via message passing (versus via shared memory). In the 2001 list of the worlds Top500 computers, all except a few shared-memory vector and distributed shared-memory computers are multicomputers. Massive has been proposed as the name for clusters over 1,000 computers. 

2G. Bell, 1995 Observations on Supercomputing Alternatives: Did the MPP Bandwagon Lead to a Cul-de-Sac?, Communications of the ACM 39, no. 3 (March 1996) 1115. 

magnitude improvement in performance/price over the proprietary systems, including their higher operational overhead. 

In the past five years, the Beowulf way has emerged. It developed and integrated a programming environment that operates on scalable clusters built on commodity partstypically based on Intel but sometimes based on Alphas or PowerPCs. It also leveraged a vendor-neutral operating system (Linux) and helped mature tools such as GNU, MPI, PVM, Condor, and various schedulers. The introduction of Windows Beowulf leverages the large software base, for example, applications, office and visualization tools, and clustered SQL databases. 

Beowulfs lower price and standardization attracted a large user community to a common software base. Beowulf follows the personal computer cycle of innovation: platform availability attracts applications; applications attract users; user demand attracts platform competition and more applications; lower prices come with volume and competition. Concurrently, proprietary platforms become less attractive because they lack software, and hence live in niche markets. 

Beowulf is the hardware vendors worst nightmare: there is little profit in Beowulf clusters of commodity nodes and switches. By using COTS PCs, networks, free Linux/GNU-based operating systems and tools, or Windows, Beowulf enables any groupto buy and build its own supercomputer. Once the movement achieved critical mass, the world tipped to this new computing paradigm. No amount of government effort to prop up the ailing domestic industry, and no amount of industry lobbying, could reverse that trend. Today, traditional vector supercomputer companies are gone from the United States, and they are a vanity business in Japan, with less than 10% of the Top500 being vector processors. Clusters beat vector supercomputers, even though about eight scalar microprocessors are still needed to equal the power of a vector processor. 

The Beowulf movement unified the cluster community and changed the course of technical computing by commoditizing it. Beowulf enabled users to have a common platform and programming model independent of proprietary processors, interconnects, storage, or software base. An applications base, as well as an industry based on many low-cost killer microprocessors, is finally forming. 

You are the cause of this revolution, but theres still much to be done! There is cause for concern, however. Beowulf is successful because it is a common base with critical mass. 

There will be considerable pressure to create Linux/Beowulf dialects (e.g., 64bit flavor and various vendor binary dialects), which will fragment the community, user attention span, training, and applications, just as proprietary-platform Unix dialects sprang from hardware vendors to differentiate and lock in users. The community must balance this pseudo-and incremental innovation against standardization, because standardization is what gives the Beowulf its huge advantage. 

Having described the inevitable appearance of Linux/Beowulf dialects, and the associated pitfalls, I am strongly advocating Windows Beowulf. Instead of fragmenting the community, Windows Beowulf will significantly increase the Beowulf community. A Windows version will support the large community of people who want the Windows tools, layered software, and development style. Already, most users of large systems operate a heterogeneous system that runs both, with Windows (supplying a large scalable database) and desktop Visual-X programming tools. Furthermore, competition will improve both. Finally, the big gain will come from cross-fertilization of .NET capabilities, which are leading the way to the truly distributed computing that has been promised for two decades. 


Beowulf Becomes a Contender 


In the mid-1980s an NSF supercomputing centers program was established in response to Digitals VAX minicomputers.3 Although the performance gap between the VAX and a Cray could be as large as 100,4 the performance per price was usually the reverse: VAX gave much more bang for the buck. VAXen soon became the dominant computer for researchers. Scientists were able to own and operate their own computers and get more computing resources with their own VAXen, including those that were operated as the first clusters. The supercomputer centers were used primarily to run jobs that were too large for these personal or departmental systems. 

In 1983 ARPA launched the Scalable Computing Initiative to fund over a score of research projects to design, build, and buy scalable, parallel computers. Many of these were centered on the idea of the emerging killer microprocessor. Over forty startups were funded with venture capital and our tax dollars to build different parallel computers. All of these efforts failed. (I estimate these efforts cost between one and three billion dollars, plus at least double that in user programming that is best written off as training.) The vast funding of all the different species, which varied only superficially, guaranteed little progress and no applications market. The user community did, however, manage to defensively create lowest common 


3The VAX 780 was introduced in 1978. 

4VAXen lacked the ability to get 520 times the performance that a large, shared Cray provided for single problems. 

denominator standards to enable programs to run across the wide array of varying architectures. 

In 1987, the National Science Foundations new computing directorate established the goal of achieving parallelism of 100X by the year 2000. The goal got two extreme responses: Don Knuth and Ken Thompson said that parallel programming was too hard and that we shouldnt focus on it; and others felt the goal should be 1,000,000X! Everyone else either ignored the call or went along quietly for the funding. This call was accompanied by an offer (by me) of yearly prizes to reward those who achieved extraordinary parallelism, performance, and performance/price. In 1988, three researchers at Sandia obtained parallelism of 600X on a 1000-node system, while indicating that 1000X was possible with more memory. The announcement of their achievement galvanized others, and the Gordon Bell prizes continue, with gains of 100% nearly every year. 

Interestingly, a factor of 1000 scaling seems to continue to be the limit for most scalable applications, but 20100X is more common. In fact, at least half of the Top500 systems have fewer than 100 processors! Of course, the parallelism is determined largely by the fact that researchers are budget limited and have only smaller machines costing $1,000$3,000 per node or parallelism of < 100. If the nodes are in a center, then the per node cost is multiplied by at least 10, giving an upper limit of 100010,000 nodes per system. If the nodes are vector processors, the number of processors is divided by 810 and the per node price raised by 100X. 

In 1993, Tom Sterling and Don Becker led a small project within NASA to build a gigaflops workstation costing under $50,000. The so-called Beowulf project was outside the main parallel-processing research community: it was based instead on commodity and COTS technology and publicly available software. The Beowulf project succeeded: a 16-node, $40,000 cluster built from Intel 486 computers ran in 1994. In 1997, a Beowulf cluster won the Gordon Bell Prize for performance/price. The recipe for building ones own Beowulf was presented in a book by Sterling et al. in 1999.5 By the year 2000, several thousand-node computers were operating. In June 2001, 33 Beowulfs were in the Top500 supercomputer list (www.top500. org). Today, in the year 2001, technical high schools can buy and assemble a supercomputer from parts available at the corner computer store. 

Beowulfs formed a do-it-yourself cluster computing community using commodity microprocessors, local area network Ethernet switches, Linux (and now Windows 2000), and tools that have evolved from the user community. This vendor-neutral 


5T. Sterling, J. Salmon, D. J. Becker, and D. V. Savarese, How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters, MIT Press, Cambridge, MA, 1999. 

related to each other than they are now. I can finally see the environment that I challenged the NSF computer science research community to build in 1987! 

By 2010 we can expect several interesting paths that Beowulf could host for more power through parallelism: 


	In situ Condor-scheduled workstations providing de facto clusters, with scaleup of 10010,000X in many environments 

	Large on-chip caches, with multiple processors to give much more performance for single nodes 

 Disks with embedded processors in a network attached storage architecture, as opposed to storage area networking that connects disks to nodes and requires a separate system area network to interconnect nodes 

Already in 2001, a relatively large number of applications can utilize Beowulf technology by avoiding parallel programming, including the following: 

	Web and Internet servers that run embarrassingly parallel to serve a large client base 

	Commercial transaction processing, including inherent, parallelized databases 

	Monte Carlo simulation and image rendering that are embarrassingly parallel 


Great progress has been made in parallelizing applications (e.g., n-body problems) that had challenged us in the past. The most important remaining challenge is to continue on the course to parallelize those applications heretofore deemed the province of shared-memory multiprocessors. These include problems requiring random variable access and adaptive mesh refinement. For example, automotive and aerodynamic engineering, climate and ocean modeling, and applications involving heterogeneous space remain the province of vector multiprocessors. We need to have a definitive list of challenges to log progress; but, unfortunately, the vector supercomputer community have not provided this list. 

Another challenge must be to make the use of multi computers for parallel operation as easy as scalar programming. Although great progress has been made by computational scientists working with computer scientists, the effort to adopt, understand, and train computer scientists in this form of parallelism has been minimal. Few computer science departments are prepared to take on this role. 

Based on two decades of no surprises in overall architectures, will there be any unforeseen advances outside of Moores law to help achieve peta flops? What will high-performance systems look like in two or four more generations of Moores law, considering processing, storage, networking, and user connections? Will Beowulf evolve to huge (100,000-node) clusters built from less costly nodes? Or will clusters be just part of the international computing Grid? 


Gordon Bell Microsoft Research 


Within the past three years, there has been a rapid increase in the deployment and application of computer clusters to expand the range of available system capabilities beyond those of conventional desktop and server platforms. By leveraging the development of hardware and software for these widely marketed and heavily used mainstream computer systems, clusters deliver order of magnitude or more scaling of computational performance and storage capacity without incurring significant additional R&D costs. Beowulf-class systems, which exploit mass-market PC hardware and software in conjunction with cost-effective commercial network technology, provide users with the dual advantages of unprecedented price/performance and configuration flexibility for parallel computing. Beowulf-class systems may be implemented by the end users themselves from available components. But with their growth in popularity, so has evolved industry support for commercial Beowulf systems. Today, depending on source and services, Beowulf systems can be installed at a cost of between one and three dollars per peak megaflops and of a scale from a few gigaflops to half a teraflops. Equally important is the rapid growth in diversity of application. Originally targeted to the scientific and technical community, Beowulf-class systems have expanded in scope to the broad commercial domain for transaction processing and Web services as well as to the entertainment industry for computer-generated special effects. Right now, the largest computer under development in the United States is a commodity cluster that upon completion will be at a scale of 30 teraflops peak performance. It is quite possible that, by the middle of this decade, commodity clusters in general and Beowulf-class systems in particular may dominate middle and high-end computing for a wide range of technical and business workloads. It also appears that for many students, their first exposure to parallel computing is through hands-on experience with Beowulf clusters. 

The publication of How to Build a Beowulf by MIT Press marked an important milestone in commodity computing. For the first time, there was an entry-level comprehensive book showing how to implement and apply a PC cluster. The initial goal of that book, which was released almost two years ago, was to capture the style and content of the highly successful tutorial series that had been presented at a number of conferences by the authors and their colleagues. The timeliness of this book and the almost explosive interest in Beowulf clusters around the world made it the most successful book of the MIT Press Scientific and Engineering Computation series last year. While other books have since emerged on the topic of assembling clusters, it still remains the most comprehensive work teaching hardware, software, and programming methods. Nonetheless, in spite of its success, How to Build a Beowulf addressed the needs of only a part of the rapidly growing commodity cluster community. And because of the rapid evolution in hardware and software, aspects of its contents have grown stale in a very short period of time. How to Build a Beowulf is still a very useful introduction to commodity clusters and has been widely praised for its accessibility to first-time users. It has even found its way into a number of high schools across the country. But the community requires a much more extensive treatment of a topic that has changed dramatically since that book was introduced. 

In addition to the obvious improvements in hardware, over the past two years there have been significant advances in software tools and middleware for managing cluster resources. The early Beowulf systems ordinarily were employed by one or a few closely associated workers and applied to a small easily controlled workload, sometimes even dedicated to a single application. This permitted adequate supervision through direct and manual intervention, often by the users themselves. But as the user base has grown and the nature of the responsibilities for the clusters has rapidly diversified, this simple mom-and-pop approach to system operations has proven inadequate in many commercial and industrial-grade contexts. As one reviewer somewhat unkindly put it, How to Build a Beowulf did not address the hard problems. This was, to be frank, at least in part true, but it reflected the state of the community at the time of publication. Fortunately, the state of the art has progressed to the point that a new snapshot of the principles and practices is not only justified but sorely needed. 

The book you are holding is far more than a second addition of the original How to Build a Beowulf; it marks a major transition from the early modest experimental Beowulf clusters to the current medium-to large-scale, industrial-grade PC-based clusters in wide use today. Instead of describing a single depth-first minimalist path to getting a Beowulf system upand running, this new reference work reflects a range of choices that system users and administrators have in programming and managing what may be a larger user base for a large Beowulf clustered system. Indeed, to support the need for a potentially diverse readership, this new book comprises three major parts. The first part, much like the original How to Build a Beowulf, provides the introductory material, underlying hardware technology, and assembly and configuration instructions to implement and initially use a cluster. But even this part extends the utility of this basic-level description to include discussion and tutorial on how to use existing benchmark codes to test and evaluate new clusters. The second part focuses on programming methodology. Here we have given equal treatment to the two most widely used programming frameworks: MPI and PVM. This part stands alone (as do the other two) and provides detailed presentation of parallel programming principles and practices, including some of the most widely used libraries of parallel algorithms. The largest and third part of the new book describes software infrastructure and tools for managing cluster resources. This includes some of the most popular of the readily available software packages for distributed task scheduling, as well as tools for monitoring and administering system resources and user accounts. 

To provide the necessary diversity and depth across a range of concepts, topics, and techniques, I have developed collaboration among some of the worlds experts in cluster computing. I am grateful to the many contributors who have added their expertise to the body of this work to bring you the very best presentation on so many subjects. In many cases, the contributors are the original developers of the software component being described. Many of the contributors have published earlier works on these or other technical subjects and have experience conveying sometimes difficult issues in readable form. All are active participants in the cluster community. As a result, this new book is a direct channel to some of the most influential drivers of this rapidly moving field. 

One of the important changes that has taken place is in the area of node operating system. When Don Becker and I developed the first Beowulf-class systems in 1994, we adopted the then-inchoate Linux kernel because it was consistent with other Unix-like operating systems employed on a wide range of scientific compute platforms from workstations to supercomputers and because it provided a full open source code base that could be modified as necessary, while at the same time providing a vehicle for technology transfer to other potential users. Partly because of these efforts, Linux is the operating system of choice for many users of Beowulf-class systems and the single most widely used operating system for technical computing with clusters. However, during the intervening period, the single widest source of PC operating systems, Microsoft, has provided the basis for many commercial clusters used for data transaction processing and other business-oriented workloads. Microsoft Windows 2000 reflects years of development and has emerged as a mature and robust software environment with the single largest base of targeted independent software vendor products. Important path-finding work at NCSA and more recently at the Cornell Theory Center has demonstrated that scientific and technical application workloads can be performed on Windows-based systems. While heated debate continues as to the relative merit of the two environments, the market has already spoken: both Linux and Windows have their own large respective user base for Beowulf clusters. 

As a result of attempting to represent the PC cluster community that clearly embodies two distinct camps related to the node operating system, my colleagues and I decided to simultaneously develop two versions of the same book. Beowulf Cluster Computing with Linux and Beowulf Cluster Computing with Windows are essentially the same book except that, as the names imply, the first assumes and discusses the use of Linux as the basis of a PC cluster while the second describes similar clusters using Microsoft Windows. In spite of this marked difference, the two versions are conceptually identical. The hardware technologies do not differ. The programming methodologies vary in certain specific details of the software packages used but are formally the same. Many but not all of the resource management tools run on both classes of system. This convergence is progressing even as the books are in writing. But even where this is not true, an alternative and complementary package exists and is discussed for the other system type. Approximately 80 percent of the actual text is identical between the two books. Between them, they should cover the vast majority of PC clusters in use today. 

On behalf of my colleagues and myself, I welcome you to the world of low-cost Beowulf cluster computing. This book is intended to facilitate, motivate, and drive forward this rapidly emerging field. Our fervent hope is that you are able to benefit from our efforts and this work. 


Acknowledgments 


I thank first the authors of the chapters contributed to this book: 


David Bailey, Lawrence Berkeley National Laboratory Peter H. Beckman,Turbolinux Remy Evard, Argonne National Laboratory Al Geist, Oak Ridge National Laboratory William Gropp, Argonne National Laboratory David B. Jackson, University of Utah James Patton Jones, Veridian Jim Kohl, Oak Ridge National Laboratory Walt Ligon, Clemson University Miron Livny, University of Wisconsin Ewing Lusk, Argonne National Laboratory Karen Miller, University of Wisconsin Bill Nitzberg, Veridian Rob Ross, Argonne National Laboratory Daniel Savarese, University of Maryland Stephen Scott, Oak Ridge National Laboratory Todd Tanenbaum, University of Wisconsin Derek Wright, University of Wisconsin 


Many other people helped in various ways to put this book together. Thanks are due to Michael Brim, Philip Carns, Anthony Chan, Andreas Dilger, Michele Evard, Tramm Hudson, Andrew Lusk, Richard Lusk, John Mugler, Thomas Naughton, John-Paul Navarro, Daniel Savarese, Rick Stevens, and Edward Thornton. 

Jan Lindheim of Caltech provided substantial information related to networking hardware. Narayan Desai of Argonne provided invaluable help with both the node and network hardware chapters. Special thanks go to Rob Ross and Dan Nurmi of Argonne for their advice and help with the cluster setup chapter. 

Paul Angelino of Caltech contributed the assembly instructions for the Beowulf nodes. Susan Powell of Caltech performed the initial editing of several chapters of the book. 

The authors would like to respectfully acknowledge the important initiative and support provided by George Spix, Svetlana Verthein, and Todd Needham of Microsoft that were critical to the development of this book. Dr. Sterling would like to thank Gordon Bell and Jim Gray for their advice and guidance in its formulation. 

Gail Pieper, technical writer in the Mathematics and Computer Science Division at Argonne, was an indispensable guide in matters of style and usage and vastly improved the readability of the prose. 

Introduction 


Thomas Sterling 


Clustering is a powerful concept and technique for deriving extended capabilities from existing classes of components. In nature, clustering is a fundamental mechanism for creating complexity and diversity through the aggregation and synthesis of simple basic elements. The result is no less than the evolution and structure of the universe, the compound molecules that dictate the shape and attributes of all materials and the form and behavior of all multi cellular life, including ourselves. To accomplish such synthesis, an intervening medium of combination and exchange is required that establishes the interrelationships among the constituent elements and facilitates their cooperative interactions from which is derived the emergent behavior of the compound entity. For compound organizations in nature, the binding mechanisms may be gravity, coulombic forces, or synaptic junctions. In the field of computing systems, clustering is being applied to render new systems structures from existing computing elements to deliver capabilities that through other approaches could easily cost ten times as much. In recent years clustering hardware and software have evolved so that today potential user institutions have a plethora of choices in terms of form, scale, environments, cost, and means of implementation to meet their scalable computing requirements. Some of the largest computers in the world are cluster systems. But clusters are also playing important roles in medium-scale technical and commerce computing, taking advantage of low-cost, mass-market PC-based computer technology. These Beowulf-class systems have become extremely popular, providing exceptional price/performance, flexibility of configuration and upgrade, and scalability to provide a powerful new tool, opening up entirely new opportunities for computing applications. 

1.1 Definitions and Taxonomy 


In the most general terms, a cluster is any ensemble of independently operational elements integrated by some medium for coordinated and cooperative behavior. This is true in biological systems, human organizations, and computer structures. Consistent with this broad interpretation, computer clusters are ensembles of independently operational computers integrated by means of an interconnection network and supporting user-accessible software for organizing and controlling concurrent computing tasks that may cooperate on a common application program or workload. There are many kinds of computer clusters, ranging from among the worlds largest computers to collections of throwaway PCs. Clustering was among the first computer system architecture techniques for achieving significant improvements in overall performance, user access bandwidth, and reliability. Many research clusters have been implemented in industry and academia, often with proprietary networks and/or custom processing nodes. 

Commodity clusters are local ensembles of computing nodes that are commercially available systems employed for mainstream data-processing markets. The interconnection network used to integrate the compute nodes of a commodity cluster is dedicated to the cluster system and is also commercially available from its manufacturer. The network is dedicated in the sense that it is used internally within the cluster supporting only those communications required between the compute nodes making up the cluster, its host or master nodes, which are themselves worldly, and possibly the satellite nodes responsible for managing mass storage resources that are part of the cluster. The network of a commodity cluster must not be proprietary to the cluster product of a single vendor but must be available for procurement, in general, for the assembly of any cluster. Thus, all components of a commodity cluster can be bought by third-party systems integrators or the end-user installation site itself. Commodity clusters employ software, which is also available to the general community. Software can be free, repackaged and distributed for modest cost, or developed by third-party independent software vendors (ISVs) and commercially marketed. Vendors may use and distribute as part of their commodity cluster products their own proprietary software as long as alternate external software is available that could be employed in its place. The twin motivating factors that drive and restrict the class of commodity computers is (1) their use of non specialty parts that exploits the marketplace for cost reduction and stable reliability and (2) the avoidance of critical unique solutions restricted to a specific cluster product that if unavailable in the future would disrupt end-user productivity and jeopardize user investment in code base. 

Beowulf-class systems are commodity clusters that exploit the attributes derived from mass-market manufacturing and distribution of consumer-grade digital electronic components. Beowulfs are made of PCs, sometimes lots of them; cheap EIDE (enhanced integrated drive electronics) (usually) hard disks; and low-cost DIMMs (dual inline memory modules) for main memory. A number of different microprocessor families have been used successfully in Beowulfs, including the long-lasting Intel X86 family (80386 and above), their AMD binary compatible counterparts, the Compaq Alpha 64-bit architecture, and the IBM PowerPC series. Beowulf systems deliver exceptional price/performance for many applications. They use low cost/no cost software to manage the individual nodes and the ensemble as a whole. A large part of the scientific and technical community using Beowulf has employed the Linux open source operating system, while many of the business and commercial users of Beowulf support the widely distributed commercial Microsoft Windows operating system. Both types of Beowulf system use middleware that is a combination of free open software and commercial ISV products. Many of these tools have been ported to both environments, although some still are restricted to one or the other environment. The nodes of Beowulfs are either uniprocessor or symmetric multiprocessors (SMPs) of a few processors. The price/performance sweet spot appears to be the dual-node SMP systems, although performance per microprocessor is usually less than for single-processor nodes. Beowulf-class systems are by far the most popular form of commodity cluster today. 

At the other end of the cluster spectrum are the constellations. A constellation is a cluster of large SMP nodes scaled such that the number of processors per node is greater than the number of such nodes making up the entire system. This is more than an arbitrary distinction. Performance of a cluster for many applications is derived through program and system parallelism. For most commodity clusters and Beowulf systems, the primary parallelism exploited is the inter node parallelism. But for clusters, the primary parallelism is intra node, meaning most of the parallelism used is within the node. Generally, processors within an SMP node are more tightly coupled through shared memory and can exploit finer-grained parallelism than can Beowulf clusters. But shared-memory systems require the use of a different programming model from that of distributed-memory systems, and therefore programming constellations may prove rather different from programming Beowulf clusters for optimal performance. Constellations are usually restricted to the largest systems. 







1.2 Opportunities and Advantages 


Commodity clusters and Beowulf-class systems bring many advantages to scalable parallel computing, opening new opportunities for users and application domains. Many of these advantages are a consequence of superior price/performance over many other types of system of comparable peak capabilities. But other important attributes exhibited by clusters are due to the nature of their structure and method of implementation. Here we highlight and expand on these, both to motivate the deployment and to guide the application of Beowulf-class systems for myriad purposes. 


Capability Scaling. 


More than even cost effectiveness, a Beowulf systems principle attribute is its scalability. Through the aggregation of commercial off-the-shelf components, ensembles of specific resources deemed critical to a particular mode of operation can be integrated to provide a degree of capability not easily acquired through other means. Perhaps most well known in high-end computing circles is peak performance measured in flops (floating-point operations per second). Even modest Beowulf systems can attain a peak performance between 10 and 100 gigaflops. The largest commodity cluster under development will achieve 30 teraflops peak performance. But another important capability is mass storage, usually through collections of hard disk drives. Large commodity disks can contain more than 100 gigabytes, but commercial database and scientific data-intensive applications both can demand upwards of 100 terabytes of on-line storage. In addition, certain classes of memory intensive applications such as those manipulating enormous matrices of multivariate data can be processed effectively only if sufficient hardware main memory is brought to bear on the problem. Commodity clusters provide one method of accumulating sufficient DRAM (dynamic random access memory) in a single composite system for these large datasets. We note that while clusters enable aggregation of resources, they do so with limited coupling, both logical and physical, among the constituent elements. This fragmentation within integrated systems can negatively impact performance and ease of use. 


Convergence Architecture.


Not anticipated by its originators, commodity clusters and Beowulf-class systems have evolved into what has become the de facto standard for parallel computer structure, having converged on a community wide system architecture. Since the mid-1970s, the high-performance computing industry has dragged its small user and customer base through a series of often-disparate parallel architecture types, requiring major software rework across successive generations. These changes were often a consequence of individual vendor decisions and resulted in low customer confidence and a strong reticence to invest in porting codes to a system that could easily be obsolete before the task was complete and incompatible with any future generation systems. Commodity clusters employing communitywide message-passing libraries offer a common structure that crosses vendor boundaries and system generations, ensuring software investment longevity and providing customer confidence. Through the evolution of clusters, we have witnessed a true convergence of parallel system architectures, providing a shared framework in which hardware and software suppliers can develop products with the assurance of customer acceptance and application developers can devise advanced user programs with the confidence of continued support from vendors. 


Price/Performance. No doubt the single most widely recognized attribute of Beowulf-class cluster systems is their exceptional cost advantage compared with other parallel computers. For many (but not all) user applications and workloads, Beowulf clusters exhibit a performance-to-cost advantage of as much as an order of magnitude or more compared with massively parallel processors (MPPs) and distributed shared-memory systems of equivalent scale. Today, the cost of Beowulf hardware is approaching one dollar per peak megaflops using consumer-grade computing nodes. The implication of this is far greater than merely the means of saving a little money. It has caused a revolution in the application of high-performance computing to a range of problems and users who would otherwise be unable to work within the regime of supercomputing. It means that for the first time, computing is playing a role in industry, commerce, and research unaided by such technology. The low cost has made Beowulfs ideal for educational platforms, enabling the training in parallel computing principles and practices of many more students than previously possible. More students are now learning parallel programming on Beowulf-class systems than all other types of parallel computer combined. 


Flexibility of Configuration and Upgrade.


 Depending on their intended user and application base, clusters can be assembled in a wide array of configurations, with very few constraints imposed by commercial vendors. For those systems configured at the final site by the intended administrators and users, a wide choice of components and structures is available, making possible a broad range of systems. Where clusters are to be dedicated to specific workloads or applications, the system structure can be optimized for the required capabilities and capacities that best suit the nature of the problem being computed. As new technologies emerge or additional financial resources are available, the flexibility with which clusters are imbued is useful for upgrading existing systems with new component technologies as a midlife kicker to extend the life and utility of a system by keeping it current. 


Technology Tracking. 


New technologies most rapidly find their way into those products likely to provide the most rapid return: mainstream high-end personal computers and SMP servers. Only after substantial lag time might such components be incorporated into MPPs. Clustering, however, provides an immediate path to integration of the latest technologies, even those that may never be adopted by other forms of high-performance computer systems. 


High Availability. 


Clusters provide multiple redundant identical resources that, if managed correctly, can provide continued system operation through graceful degradation even as individual components fail. 


Personal Empowerment.


 Because high-end cluster systems are derived from readily available hardware and software components, installation sites, their system administrators, and users have more control over the structure, elements, operation, and evolution of this system class than over any other system. This sense of control and flexibility has provided a strong attractor to many, especially those in the research community, and has been a significant motivation for many installations. 


Development Cost and Time. The emerging cluster industry is being fueled by the very low cost of development and the short time to product delivery. Based on existing computing and networking products, vendor-supplied commodity clusters can be developed through basic systems integration and engineering, with no component design required. Because the constituent components are manufactured for a much larger range of user purposes than is the cluster market itself, the cost to the supplier is far lower than custom elements would otherwise be. Thus commodity clusters provide vendors with the means to respond rapidly to diverse customer needs, with low cost to first delivery. 







1.3 A Short History 


Cluster computing originated within a few years of the inauguration of the modern electronic stored-program digital computer. SAGE was a cluster system built for NORAD under Air Force contract by IBM in the 1950s based on the MIT Whirlwind computer architecture. Using vacuum tube and core memory technologies, SAGE consisted of a number of separate standalone systems cooperating to manage early warning detection of hostile airborne intrusion of the North American continent. Early commercial applications of clusters employed paired loosely coupled computers, with one computer performing user jobs while the other managed various input/output devices. 

Breakthroughs in enabling technologies occurred in the late 1970s, both in hardware and software, which were to have significant long-term effects on future cluster computing. The first generations of microprocessors were designed with the initial development of VLSI (very large scale integration) technology, and by the end of the decade the first workstations and personal computers were being marketed. The advent of Ethernet provided the first widely used local area network technology, creating an industry standard for a modestly priced multi drop interconnection medium and data transport layer. Also at this time, the multitasking Unix operating system was created at AT&T Bell Labs and extended with virtual memory and network interfaces at the University of CaliforniaBerkeley. Unix was adopted in its various commercial and public domain forms by the scientific and technical computing community as the principal environment for a wide range of computing system classes from scientific workstations to supercomputers. 

During the decade of the 1980s, increased interest in the potential of cluster computing was marked by important experiments in research and industry. A collection of 160 interconnected Apollo workstations was employed as a cluster to perform certain computational tasks by the National Security Agency. Digital Equipment Corporation developed a system comprising interconnected VAX 11/750 computers, coining the term cluster in the process. In the area of software, task management tools for employing workstation farms were developed, most notably the Condor software package from the University of Wisconsin. Different strategies for parallel processing were explored during this period by the computer science research community. From this early work came the communicating sequential processes model more commonly referred to as the message-passing model, which has come to dominate much of cluster computing today. 

An important milestone in the practical application of the message-passing model was the development of PVM (Parallel Virtual Machine), a library of linkable functions that could allow routines running on separate but networked computers to exchange data and coordinate their operation. PVM (developed by Oak Ridge National Laboratory, Emery University, and the University of Tennessee) was the first widely deployed distributed software system available across different platforms. By the beginning of the 1990s, a number of sites were experimenting with clusters of workstations. At the NASA Lewis Research Center, a small cluster of IBM workstations was used to simulate the steady-state behavior of jet aircraft engines in 1992. The NOW (network of workstations) project at UC Berkeley began operating the first of several clusters there in 1993, which led to the first cluster to be entered on the Top500 list of the worlds most powerful computers. Also in 1993, Myrinet, one of the first commercial system area networks, was introduced for commodity clusters, delivering improvements in bandwidth and latency an order of magnitude better than the Fast Ethernet local area network (LAN) most widely used for the purpose at that time. 

The first Beowulf-class PC cluster was developed at the NASA Goddard Space Flight center in 1994 using early releases of the Linux operating system and PVM running on 16 Intel 100 MHz 80486-based personal computers connected by dual 10 Mbps Ethernet LANs. The Beowulf project developed the necessary Ethernet driver software for Linux and additional low-level cluster management tools and demonstrated the performance and cost effectiveness of Beowulf systems for real-world scientific applications. That year, based on experience with many other message-passing software systems, the first Message-Passing Interface (MPI) standard was adopted by the parallel computing community to provide a uniform set of message-passing semantics and syntax. MPI has become the dominant parallel computing programming standard and is supported by virtually all MPP and cluster system vendors. Workstation clusters running Sun Microsystems Solaris operating system and NCSAs PC cluster running the Microsoft NT operating system were being used for real-world applications. 

In 1996, the DOE Los Alamos National Laboratory and the California Institute of Technology with the NASA Jet Propulsion Laboratory independently demonstrated sustained performance of over 1 Gflops for Beowulf systems costing under $50,000 and was awarded the Gordon Bell Prize for price/performance for this accomplishment. By 1997, Beowulf-class systems of over a hundred nodes had demonstrated sustained performance of greater than 10 Gflops, with a Los Alamos system making the Top500 list. By the end of the decade, 28 clusters were on the Top500 list with a best performance of over 200 Gflops. In 2000, both DOE and NSF announced awards to Compaq to implement their largest computing facilities, both clusters of 30 Tflops and 6 Tflops, respectively. 







1.4 Elements of a Cluster 


A Beowulf cluster comprises numerous components of both hardware and software. Unlike pure closed-box turnkey mainframes, servers, and workstations, the user or hosting organization has considerable choice in the system architecture of a cluster, whether it is to be assembled on site from parts or provided by a systems integrator or vendor. A Beowulf cluster system can be viewed as being made upof four major components, two hardware and two software. The two hardware components are the compute nodes that perform the work and the network that interconnects the node to form a single system. The two software components are the collection of tools used to develop user parallel application programs and the software environment for managing the parallel resources of the Beowulf cluster. The specification of a Beowulf cluster reflects user choices in each of these domains and determines the balance of cost, capacity, performance, and usability of the system. 

The hardware node is the principal building block of the physical cluster system. After all, it is the hardware node that is being clustered. The node incorporates the resources that provide both the capability and capacity of the system. Each node has one or more microprocessors that provide the computing power of the node combined on the nodes motherboard with the DRAM main memory and the I/O interfaces. In addition the node will usually include one or more hard disk drives for persistent storage and local data buffering although some clusters employ nodes that are diskless to reduce both cost and power consumption as well as increase reliability. 

The network provides the means for exchanging data among the cluster nodes and coordinating their operation through global synchronization mechanisms. The subcomponents of the network are the network interface controllers (NIC), the network channels or links, and the network switches. Each node contains at least one NIC that performs a series of complex operations to move data between the external network links and the user memory, conducting one or more transformations on the data in the process. The channel links are usually passive, consisting of a single wire, multiple parallel cables, or optical fibers. The switches interconnect a number of channels and route messages between them. Networks may be characterized by their topology, their bisection and per channel bandwidth, and the latency for message transfer. 

The software tools for developing applications depend on the underlying programming model to be used. Fortunately, within the Beowulf cluster community, there has been a convergence of a single dominant model: communicating sequential processes, more commonly referred to as message passing. The message-passing model implements concurrent tasks or processes on each node to do the work of the application. Messages are passed between these logical tasks to share data and to synchronize their operations. The tasks themselves are written in a common language such as Fortran or C++. A library of communicating services is called by these tasks to accomplish data transfers with tasks being performed on other nodes. While many different message-passing languages and implementation libraries have been developed over the past two decades, two have emerged as dominant: PVM and MPI (with multiple library implementations available for MPI). 

The software environment for the management of resources gives system administrators the necessary tools for supervising the overall use of the machine and gives users the capability to schedule and share the resources to get their work done. Several schedulers are available and discussed in this book. For coarse-grained job stream scheduling, the popular Condor scheduler is available. PBS and the Maui scheduler handle task scheduling for interactive concurrent elements. For lightweight process management, the new Scyld Bproc scheduler will provide efficient operation. PBS also provides many of the mechanisms needed to handle user accounts. For managing parallel files, there is PVFS, the Parallel Virtual File System. 





1.5 Description of the Book 


Beowulf Cluster Computing is offered as a fully comprehensive discussion of the foundations and practices for the operation and application of commodity clusters with an emphasis on those derived from mass-market hardware components and readily available software. The book is divided into three broad topic areas. Part I describes the hardware components that make up a Beowulf system and shows how to assemble such a system as well as take it out for an initial spin using some readily available parallel benchmarks. Part II discusses the concepts and techniques for writing parallel application programs to run on a Beowulf using the two dominant communitywide standards, PVM and MPI. Part III explains how to manage the resources of Beowulf systems, including system administration and task scheduling. Each part is standalone; any one or pair of parts can be used without the need of the others. In this way, you can just jump into the middle to get to the necessary information fast. To help in this, Chapter 2 (the next chapter) provides an overview and summary of all of the material in the book. A quick perusal of that chapter should give enough context for any single chapter to make sense without your having to have read the rest of the book. 

The Beowulf book presents three kinds of information to best meet the requirements of the broad and varied cluster computing community. It includes foundation material for students and people new to the field. It also includes reference material in each topic area, such as the major library calls to MPI and PVM or the basic controls for PBS. And, it gives explicit step-by-step guidance on how to accomplish specific tasks such as assembling a processor node from basic components or installing the Maui scheduler. 

This book can be used in many different ways. We recommend just sitting down and perusing it for an hour or so to get a good feel for where the information is that you would find most useful. Take a walk through Chapter 2 to get a solid overview. Then, if youre trying to get a job done, go after that material germane to your immediate needs. Or if you are a first-time Beowulf user and just learning about cluster computing, use this as your guide through the field. Every section is designed both to be interesting and to teach you how to do something new and useful. 

One major challenge was how to satisfy the needs of the majority of the commodity cluster community when a major division exists across the lines of the operating system used. In fact, at least a dozen different operating systems have been used for cluster systems. But the majority of the community use either Linux or Windows. The choice of which of the two to use depends on many factors, some of them purely subjective. We therefore have taken the unprecedented action of offering a choice: weve crafted two books, mostly the same, but differing between the two operating systems. So, you are holding either Beowulf Cluster Computing with Windows or Beowulf Cluster Computing with Linux. Whichever works best for you, we hope you find it the single most valuable book on your shelf for making clusters and for making clusters work for you. 

Thomas Sterling 


Commodity cluster systems offer an alternative to the technical and commercial computing market for scalable computing systems for medium-and high-end computing capability. For many applications they replace previous-generation monolithic vector supercomputers and MPPs. By incorporating only components already developed for wider markets, they exploit the economy of scale not possible in the high-end computing market alone and circumvent significant development costs and lead times typical of earlier classes of high-end systems resulting in a price/performance advantage that may exceed an order of magnitude for many user workloads. In addition, users have greater flexibility of configuration, upgrade, and supplier, ensuring longevity of this class of distributed system and user confidence in their software investment. Beowulf-class systems exploit mass-market components such as PCs to deliver exceptional cost advantage with the widest space of choice for building systems. Beowulfs integrate widely available and easily accessible low-cost or no-cost system software to provide many of the capabilities required by a system environment. As a result of these attributes and the opportunities they imply, Beowulf-class clusters have penetrated almost every aspect of computing and are rapidly coming to dominate the medium to high end. 

Computing with a Beowulf cluster engages four distinct but interrelated areas of consideration: 


	hardware system structure, 

	resource administration and management environment, 

	distributed programming libraries and tools, and 

	parallel algorithms. 


Hardware system structure encompasses all aspects of the hardware node components and their capabilities, the dedicated network controllers and switches, and the interconnection topology that determines the systems global organization. The resource management environment is the battery of system software and tools that govern all phases of system operation from installation, configuration, and initialization, through administration and task management, to system status monitoring, fault diagnosis, and maintenance. The distributed programming libraries and tools determine the paradigm by which the end user coordinates the distributed computing resources to execute simultaneously and cooperatively the many concurrent logical components constituting the parallel application program. Finally, the domain of parallel algorithms provides the models and approaches for organizing a users application to exploit the intrinsic parallelism of the problem while operating within the practical constraints of effective performance. 

This chapter provides a brief and top-level overview of these four main domains that constitute Beowulf cluster computing. The objective is to provide sufficient context for you to understand any single part of the remaining book and how its contribution fits in to the broader form and function of commodity clusters. 


2.1 A Taxonomy of Parallel Computing 


The goal of achieving performance through the exploitation of parallelism is as old as electronic digital computing itself, which emerged from the World War II era. Many different approaches and consequent paradigms and structures have been devised, with many commercial or experimental versions being implemented over the years. Few, however, have survived the harsh rigors of the data processing marketplace. Here we look briefly at many of these strategies, to better appreciate where commodity cluster computers and Beowulf systems fit and the tradeoffs and compromises they represent. 

A first-tier decomposition of the space of parallel computing architectures may be codified in terms of coupling: the typical latencies involved in performing and exploiting parallel operations. This may range from the most tightly coupled fine-grained systems of the systolic class, where the parallel algorithm is actually hard-wired into a special-purpose ultra-fine-grained hardware computer logic structure with latencies measured in the nanosecond range, to the other extreme, often referred to as distributed computing, which engages widely separated computing resources potentially across a continent or around the world and has latencies on the order of a hundred milliseconds. Thus the realm of parallel computing structures encompasses a range of 108, when measured by degree of coupling and, by implication, granularity of parallelism. In the following list, the set of major classes in order of tightness of coupling is briefly described. We note that any such taxonomy is subjective, rarely orthogonal, and subject to debate. It is offered only as an illustration of the richness of choices and the general space into which cluster computing fits. 


Systolic computers are usually special-purpose hardwired implementations of fine-grained parallel algorithms exploiting one-, two-, or three-dimensional pipelining. Often used for real-time post sensor processors, digital signal processing, image processing, and graphics generation, systolic computing is experiencing a revival through adaptive computing, exploiting the versatile FPGA (field programmable gate array) technology that allows different systolic algorithms to be programmed into the same FPGA medium at different times. 


Vector computers exploit fine-grained vector operations through heavy pipelining of memory bank accesses and arithmetic logic unit (ALU) structure, hardware support for gather-scatter operations, and amortizing instruction fetch/execute cycle overhead over many basic operations within the vector operation. The basis for the original supercomputers (e.g., Cray), vector processing is still a formidable strategy in certain Japanese high end systems. 


SIMD (single instruction, multiple data) architecture exploits fine-grained data parallelism by having many (potentially thousands) or simple processors performing the same operation in lock step but on different data. A single control processor issues the global commands to all slaved compute processors simultaneously through a broadcast mechanism. Such systems (e.g., MasPar-2, CM-2) incorporated large communications networks to facilitate massive data movement across the system in a few cycles. No longer an active commercial area, SIMD structures continue to find special-purpose application for post sensor processing. 


Dataflow models employed fine-grained asynchronous flow control that depended only on data precedence constraints, thus exploiting a greater degree of parallelism and providing a dynamic adaptive scheduling mechanism in response to resource loading. Because they suffered from severe overhead degradation, however, dataflow computers were never competitive and failed to find market presence. Nonetheless, many of the concepts reflected by the dataflow paradigm have had a strong influence on modern compiler analysis and optimization, reservation stations in out-of-order instruction completion ALU designs, and multithreaded architectures. 


PIM (processor-in-memory) architectures are only just emerging as a possible force in high-end system structures, merging memory (DRAM or SRAM) with processing logic on the same integrated circuit die to expose high on-chip memory bandwidth and low latency to memory for many data-oriented operations. Diverse structures are being pursued, including system on a chip, which places DRAM banks and a conventional processor core on the same chip; SMP on a chip, which places multiple conventional processor cores and a three-level coherent cache hierarchical structure on a single chip; and Smart Memory, which puts logic at the sense amps of the DRAM memory for in-place data manipulation. PIMs can be used as standalone systems, in arrays of like devices, or as a smart layer of a larger conventional multiprocessor. 

MPPs (massively parallel processors) constitute a broad class of multiprocessor architectures that exploit off-the-shelf microprocessors and memory chips in custom designs of node boards, memory hierarchies, and global system area networks. Ironically, MPP was first used in the context of SIMD rather than MIMD (multiple instruction, multiple data) machines. MPPs range from distributed-memory machines such as the Intel Paragon, through shared memory without coherent caches such as the BBN Butterfly and CRI T3E, to truly CC-NUMA (non-uniform memory access) such as the HP Exemplar and the SGI Origin2000. 


Clusters are an ensemble of off-the-shelf computers integrated by an interconnection network and operating within a single administrative domain and usually within a single machine room. Commodity clusters employ commercially available networks (e.g., Ethernet, Myrinet) as opposed to custom networks (e.g., IBM SP-2). Beowulf-class clusters incorporate mass-market PC technology for their compute nodes to achieve the best price/performance. 


Distributed computing, once referred to as metacomputing, combines the processing capabilities of numerous, widely separated computer systems via the Internet. Whether accomplished by special arrangement among the participants, by means of disciplines referred to as Grid computing, or by agreements of myriad workstation and PC owners with some commercial (e.g., DSI, Entropia) or philanthropic (e.g., SETI@home) coordinating host organization, this class of parallel computing exploits available cycles on existing computers and PCs, thereby getting something for almost nothing. 


In this book, we are interested in commodity clusters and, in particular, those employing PCs for best price/performance, specifically, Beowulf-class cluster systems. Commodity clusters may be subdivided into four classes, which are briefly discussed here. 


Workstation clusters  ensembles of workstations (e.g., Sun, SGI) integrated by a system area network. They tend to be vendor specific in hardware and software. While exhibiting superior price/performance over MPPs for many problems, there can be as much as a factor of 2.5 to 4 higher cost than comparable PC-based clusters. 


Beowulf-class systems  ensembles of PCs (e.g., Intel Pentium 4) integrated with commercial COTS local area networks (e.g., Fast Ethernet) or system area networks (e.g., Myrinet) and run widely available low-cost or no-cost software for managing system resources and coordinating parallel execution. Such systems exhibit exceptional price/performance for many applications. 


Cluster farms  existing local area networks of PCs and workstations serving either as dedicated user stations or servers that, when idle, can be employed to perform pending work from outside users. Exploiting job stream parallelism, software systems (e.g., Condor) have been devised to distribute queued work while precluding intrusion on user resources when required. These systems are of lower performance and effectiveness because of the shared network integrating the resources, as opposed to the dedicated networks incorporated by workstation clusters and Beowulfs. 


Superclusters  clusters of clusters, still within a local area such as a shared machine room or in separate buildings on the same industrial or academic campus, usually integrated by the institutions infrastructure backbone wide area network. Although usually within the same internet domain, the clusters may be under separate ownership and administrative responsibilities. Nonetheless, organizations are striving to determine ways to enjoy the potential opportunities of partnering multiple local clusters to realize very large scale computing at least part of the time. 


2.2 Hardware System Structure 


The most visible and discussed aspects of cluster computing systems are their physical components and organization. These deliver the raw capabilities of the system, take up considerable room on the machine room floor, and yield their excellent price/performance. The two principal subsystems of a Beowulf cluster are its constituent compute nodes and its interconnection network that integrates the nodes into a single system. These are discussed briefly below. 


2.2.1 Beowulf Compute Nodes 


The compute or processing nodes incorporate all hardware devices and mechanisms responsible for program execution, including performing the basic operations, holding the working data, providing persistent storage, and enabling external communications of intermediate results and user command interface. Five key components make up the compute node of a Beowulf cluster: the microprocessor, main memory, the motherboard, secondary storage, and packaging. 

The microprocessor provides the computing power of the node with its peak performance measured in Mips (millions of instructions per second) and Mflops (millions of floating-point operations per second). Although Beowulfs have been implemented with almost every conceivable microprocessor family, the two most prevalent today are the 32-bit Intel Pentium 3 and Pentium 4 microprocessors and the 64-bit Compaq Alpha 21264 family. We note that the AMD devices (including the Athlon), which are binary compatible with the Intel Pentium instruction set, have also found significant application in clusters. In addition to the basic floating-point and integer arithmetic logic units, the register banks, and execution pipeline and control logic, the modern microprocessor, comprising on the order of 20 to 50 million transistors, includes a substantial amount of on-chiphigh-speed memory called cache for rapid access of data. Cache is organized in a hierarchy usually with two or three layers, the closest to the processor being the fastest but smallest and the most distant being relatively slower but with much more capacity. These caches buffer data and instructions from main memory and, where data reuse or spatial locality of access is high, can deliver a substantial percentage of peak performance. The microprocessor interfaces with the remainder of the node usually by two external buses: one specifically optimized as a high-bandwidth interface to main memory, and the other in support of data I/O. 

Main memory stores the working dataset and programs used by the microprocessor during job execution. Based on DRAM technology in which a single bit is stored as a charge on a small capacitor accessed through a dedicated switching transistor, data read and write operations can be significantly slower to main memory than to cache. However, recent advances in main memory design have improved memory access speed and have substantially increased memory bandwidth. These improvements have been facilitated by advances in memory bus design such as RAMbus. 

The motherboard is the medium of integration that combines all the components of a node into a single operational system. Far more than just a large printed circuit board, the motherboard incorporates a sophisticated chip set almost as complicated as the microprocessor itself. This chip set manages all the interfaces between components and controls the bus protocols. One important bus is PCI, the primary interface between the microprocessor and most high-speed external devices. Initially a 32-bit bus operating at 33 MHz, the most recent variation operates at 66 MHz on 64-bit data, thus quadrupling its potential throughput. Most system area network interface controllers are connected to the node by means of the PCI bus. The motherboard also includes a substantial read-only memory (which can be updated) containing the systems BIOS (basic input/output system), a set of low-level services, primarily related to the function of the I/O and basic boot strap tasks, that defines the logical interface between the higher-level operating system software and the node hardware. Motherboards also support several other input/output ports such as the users keyboard/mouse/video monitor and the now-ubiquitous universal serial bus (USB) port that is replacing several earlier distinct interface types. Nonetheless, the vestigial parallel printer port can still be found, whose specification goes to the days of the earliest PCs more than twenty years ago. 

Secondary storage provides high-capacity persistent storage. While main memory loses all its contents when the system is powered off, secondary storage fully retains its data in the powered-down state. While many standalone PCs include several classes of secondary storage, some Beowulf-systems may have nodes that keep only something necessary for holding a boot image for initial startup, all other data being downloaded from an external host or master node. Secondary storage can go a long way to improving reliability and reducing per node cost. However, it misses the opportunity for low-cost, high-bandwidth mass storage. Depending on how the system ultimately is used, either choice may be optimal. The primary medium for secondary storage is the hard disk, based on a magnetic medium little different from an audio cassette tape. This technology, almost as old as digital computing itself, continues to expand in capacity at an exponential rate, although access speed and bandwidths have improved only gradually. Two primary contenders, SCSI (small computer system interface) and EIDE (enhanced integrated dual electronics), are differentiated by somewhat higher speed and capacity in the first case, and lower cost in the second case. Today, a gigabyte of EIDE disk storage costs the user a few dollars, while the list price for SCSI in a RAID (redundant array of independent disks) configuration can be as high as $100 per gigabyte (the extra cost does buy more speed, density, and reliability). Most workstations use SCSI, and most PCs employ EIDE drives, which can be as large as 100 GBytes per drive. Two other forms of secondary storage are the venerable floppy disk and the optical disk. The modern 3.5-inch floppy (they dont actually flop anymore, since they now come in a hard rather than a soft case), also more than twenty years old, holds only 1.4 MBytes of data and should have been retired long ago. Because of its ubiquity, however, it continues to hang on and is ideal as a boot medium for Beowulf nodes. Largely replacing floppies are the optical CD (compact disk), CDRW (compact disk-read/write), and DVD (digital versatile disk). The first two hold approximately 600 MBytes of data, with access times of a few milliseconds. (The basic CD is read only, but the CD-RW disks are writable, although at a far slower rate.) Most commercial software and data are now distributed on CDs because they are very cheap to create (actually cheaper than a glossy one-page double-sided commercial flyer). DVD technology also runs on current-generation PCs, providing direct access to movies. 

Packaging for PCs originally was in the form of the pizza boxes: low, flat units, usually placed on the desk with a fat monitor sitting on top. Some small early Beowulfs were configured with such packages, usually with as many as eight of these boxes stacked one on top of another. But by the time the first Beowulfs were implemented in 1994, tower casesvertical floor-standing (or sometimes on the desk next to the video monitor) componentswere replacing pizza boxes because of their greater flexibility in configuration and their extensibility (with several heights available). Several generations of Beowulf clusters still are implemented using this low-cost, robust packaging scheme, leading to such expressions as pile of PCs and lots of boxes on shelves (LOBOS). But the single limitation of this strategy was its low density (only about two dozen boxes could be stored on a floor-to ceiling set of shelves) and the resulting large footpad of medium-to large-scale Beowulfs. Once the industry recognized the market potential of Beowulf clusters, a new generation of rack-mounted packages was devised and standardized (e.g., 1U, 2U, 3U, and 4U, with 1U boxes having a height of 1.75 inches) so that it is possible to install a single floor-standing rack with as many as 42 processors, coming close to doubling the processing density of such systems. Vendors providing complete turnkey systems as well as hardware system integrators (bring-your-own software) are almost universally taking this approach. Yet for small systems where cost is critical and simplicity a feature, towers will pervade small labs, offices, and even homes for a long time. (And why not? On those cold winter days, they make great space heaters.) 

Beowulf cluster nodes (i.e., PCs) have seen enormous, even explosive, growth over the past seven years since Beowulfs were first introduced in 1994. We note that the entry date for Beowulf was not arbitrary: the level of hardware and software technologies based on the mass market had just (within the previous six months) reached the point that ensembles of them could compete for certain niche applications with the then-well-entrenched MPPs and provide price/performance benefits (in the very best cases) of almost 50 to 1. The new Intel 100 MHz 80486 made it possible to achieve as much as 5 Mflops per node for select computationally intense problems and the cost of 10 Mbps Ethernet network controllers and network hubs had become sufficiently low that their cost permitted them to be employed as dedicated system area networks. Equally important was the availability of the inchoate Linux operating system with the all-important attribute of being free and open source and the availability of a good implementation of the PVM message-passing library. Of course, the Beowulf project had to fill in a lot of the gaps, including writing most of the Ethernet drivers distributed with Linux and other simple tools, such as channel bonding, that facilitated the management of these early modest systems. Since then, the delivered floating-point performance per processor has grown by more than two orders of magnitude while memory capacity has grown by more than a factor of ten. Disk capacities have expanded by as much as 1000X. Thus, Beowulf compute nodes have witnessed an extraordinary evolution in capability. By the end of this decade, node floating-point performance, main memory size, and disk capacity all are expected to grow by another two orders of magnitude. 

One aspect of node structure not yet discussed is symmetric multiprocessing. Modern microprocessor design includes mechanisms that permit more than one processor to be combined, sharing the same main memory while retaining full coherence across separate processor caches, thus giving all processors a consistent view of shared data in spite of their local copies in dedicated caches. While large industrial-grade servers may incorporate as many as 512 processors in a single SMP unit, a typical configuration for PC-based SMPs is two or four processors per unit. The ability to share memory with uniform access times should be a source of improved performance at lower cost. But both design and pricing are highly complicated, and the choice is not always obvious. Sometimes the added complexity of SMP design offsets the apparent advantage of sharing many of the nodes resources. Also, performance benefits from tight coupling of the processors may be outweighed by the contention for main memory and possible cache thrashing. An added difficulty is attempting to program at the two levels: message passing between nodes and shared memory between processors of the same node. Most users dont bother, choosing to remain with a uniform message-passing model even between processors within the same SMP node. 


2.2.2 Interconnection Networks 


Without the availability of moderate-cost short-haul network technology, Beowulf cluster computing would never have happened. Interestingly, the two leaders in cluster dedicated networks were derived from very different precedent technologies. Ethernet was developed as a local area network for interconnecting distributed single user and community computing resources with shared peripherals and file servers. Myrinet was developed from a base of experience with very tightly coupled processors in MPPs such as the Intel Paragon. Together, Fast and Gigabit Ethernet and Myrinet provide the basis for the majority of Beowulf-class clusters. 

A network is a combination of physical transport and control mechanisms associated with a layered hierarchy of message encapsulation. The core concept is the message. A message is a collection of information organized in a format (order and type) that both the sending and the receiving processes understand and can correctly interpret. One can think of a message as a movable record. It can be as short as a few bytes (not including the header information) or as long as many thousands of bytes. Ordinarily, the sending user application process calls a library routine that manages the interface between the application and the network. Performing a high-level send operation causes the user message to be packaged with additional header information and presented to the network kernel driver software. Additional routing information and additional converges are performed prior to actually sending the message. The lowest-level hardware then drives the communication channels lines with the signal, and the network switches route the message appropriately in accordance with the routing information encoded bits at the header of the message packet. Upon receipt at the receiving node, the process is reversed and the message is eventually loaded into the user application name space to be interpreted by the application code. 

The network is characterized primarily in terms of its bandwidth and its latency. Bandwidth is the rate at which the message bits are transferred, usually cited in terms of peak throughput as bits per second. Latency is the length of time required to sends the message. Perhaps a fairer measure is the time from sending to receiving an application process, taking into consideration all of the layers of translation, conversions, and copying involved. But vendors often quote the shorter time between their network interface controllers. To complicate matters, both bandwidth and latency are sensitive to message length and message traffic. Longer messages make better use of network resources and deliver improved network throughput. Shorter messages reduce transmit, receive, and copy times to provide an overall lower transfer latency but cause lower effective bandwidth. Higher total network traffic (i.e., number of messages per unit time) increases overall network throughput, but the resulting contention and the delays they incur result in longer effective message transfer latency. 

More recently, an industrial consortium has developed a new networking model known as VIA. The goal of this network class is to support a zero-copy protocol, avoiding the intermediate copying of the message in the operating system space and permitting direct application-to-application message transfers. The result is significantly reduced latency of message transfer. Emulex has developed the cLAN network product, which provides a peak bandwidth in excess of 1 Gbps and for short messages exhibits a transfer latency on the order of 7 microseconds. 

2.3 Node Software 


A node in a cluster is often (but not always) an autonomous computing entity, complete with its own operating system. Beowulf clusters exploit the sophistication of modern operating systems both for managing the node resources and for communicating with other nodes by means of their interconnection network. 

Linux has emerged as the dominant Unix-like operating system. Its development was anything but traditional; it was started by a graduate student (Linus Tovald) in Finland and contributed to by a volunteer force of hundreds of developers around the world via the Internet. Recently Linux has received major backing from large computer vendors including IBM, Compaq, SGI, and HP. Linux is a full-featured multiuser, multitasking, demand-paged virtual memory operating system with advanced kernel software support for high-performance network operation. 


2.4 Resource Management 


Except in the most restrictive of cases, matching the requirements of a varied workload and the capabilities of the distributed resources of a Beowulf cluster system demands the support and services of a potentially sophisticated software system for resource management. The earliest Beowulfs were dedicated systems used by (at most) a few people and controlled explicitly, one application at a time. But todays more elaborate Beowulf clusters, possibly comprising hundreds or even thousands of processors and shared by a large community of users, both local and at remote sites, need to balance contending demands and available processing capacity to achieve rapid response for user programs and high throughput of cluster resources. Fortunately, several such software systems are available to provide systems administrators and users alike with a wide choice of policies and mechanisms by which to govern the operation of the system and its allocation to user tasks. 

The challenge of managing the large set of compute nodes that constitute a Beowulf cluster involves several tasks to match user-specified workload to existing resources. 


Queuing. 


User jobs are submitted to a Beowulf cluster by different people, potentially from separate locations, who are possibly unaware of requirements being imposed on the same system by other users. A queuing system buffers the randomly submitted jobs, entered at different places and times and with varying requirements, until system resources are available to process each of them. Depending on priorities and specific requirements, different distributed queues may be maintained to facilitate optimal scheduling. 


Scheduling. 


Perhaps the most complex component of the resource manager, the scheduler has to balance the priorities of each job, with the demands of other jobs, the existing system compute and storage resources, and the governing policies dictated for their use by system administrators. Schedulers need to contend with such varied requirements as large jobs needing all the nodes, small jobs needing only one or at most a few nodes, interactive jobs during which the user must be available and in the loop for such things as real-time visualization of results or performance debugging during program development, or high-priority jobs that must be completed quickly (such as medical imaging). The scheduler determines the order of execution based on these independent priority assessments and the solution to the classic bin-packing problem: What jobs can fit on the machine at the same time? 


Resource Control. 


A middleware component, resource control puts the programs on the designated nodes, moves the necessary files to the respective nodes, starts jobs, suspends jobs, terminates jobs, and offloads result files. It notifies the scheduler when resources are available and handles any exception conditions across the set of nodes committed to a given user job. 


Monitoring. 


The ongoing status of the Beowulf cluster must be continuously tracked and reported to a central control site such as a master or host node of the system. Such issues as resource availability, task status on each node, and operational health of the nodes must be constantly monitored to aid in the successful management of the total system in serving its incident user demand. Some of this information must continuously update the system operators status presentation, while other statistics and status parameters must be directly employed by the automatic resource management system. 


Accounting. 


In order to assess billing or at least to determine remaining user allocation of compute time (often measured in node hours), as well as to assess overall system utilization, availability, and demand response effectiveness, records must be automatically kept of user accounts and system work. This is the primary tool by which system administrators and managers assess effectiveness of scheduling policies, maintenance practices, and user allocations. 


While no single resource management system addresses all of these functions optimally for all operational and demand circumstances, several tools have proven useful in operational settings and are available to users and administrators of Beowulfclass cluster systems. An entire chapter is dedicated to each of these in Part III of this book; here they are discussed only briefly. 


Condor supports distributed job stream resource management emphasizing capacity or throughput computing. Condor schedules independent jobs on cluster nodes to handle large user workloads and provides many options in scheduling policy. This venerable and robust package is particularly well suited for managing both workloads and resources at remote sites. 


PBS is a widely used system for distributing parallel user jobs across parallel Beowulf cluster resources and providing the necessary administrative tools for professional systems supervision. Both free and commercially supported versions of this system are available, and it is professionally maintained, providing both user and administrator confidence. 


Maui is an advanced scheduler incorporating sophisticated policies and mechanisms for handling a plethora of user demands and resource states. This package actually sits on top of other lower-level resource managers, providing added capability. 


PVFS manages the secondary storage of a Beowulf cluster, providing parallel file management shared among the distributed nodes of the system. It can deliver faster response and much higher effective disk bandwidth than conventional use of NFS (network file system). 


2.5 Distributed Programming 


Exploitation of the potential of Beowulf clusters relies heavily on the development of a broad range of new parallel applications that effectively takes advantage of the parallel system resources to permit larger and more complex problems to be explored in a shorter time. Programming a cluster differs substantially from that of programming a uniprocessor workstation or even an SMP. This difference is in part due to the fact that the sharing of information between nodes of a Beowulf cluster can take a lot longer than between the nodes of a tightly coupled system, because the fragmented memory space reflected by the distributed-memory Beowulfs imposes substantially more overhead than that required by shared-memory systems, and because a Beowulf may have many more nodes than a typical 32-processor SMP. As a consequence, the developer of a parallel application code for a Beowulf must take into consideration these and other sources of performance degradation to achieve effective scalable performance for the computational problem. 

A number of different models have been employed for parallel programming and execution, each emphasizing a particular balance of needs and desirable traits. The models differ in part by the nature and degree of abstraction they present to the user of the underlying parallel system. These vary in generality and specificity of control. But one model has emerged as the dominant strategy. This is the communicating sequential processes model, more often referred to as the message-passing model. Through this methodology, the programmer partitions the problems global data among the set of nodes and specifies the processes to be executed on each node, each working primarily on its respective local data partition. Where information from other nodes is required, the user establishes logical paths of communication between cooperating processes on separate nodes. The application program for each process explicitly sends and receives messages passed between itself and one or more other remote processes. A message is a packet of information containing one or more values in an order and format that both processes involved in the exchange understand. Messages are also used for synchronizing concurrent processes in order to coordinate the execution of the parallel tasks on different nodes. 

Programmers can use low-level operating system kernel interfaces to the network, such as Unix sockets or remote procedure calls. Fortunately, however, an easier way exists. Two major message-passing programming systems have been developed to facilitate parallel programming and application development. These are in the form of linkable libraries that can be used in conjunction with conventional languages such as Fortran or C. Benefiting from prior experiences with earlier such tools, PVM has a significant following and has been used to explore a broad range of semantic constructs and distributed mechanisms. PVM was the first programming system to be employed on a Beowulf cluster and its availability was critical to this early work. MPI, the second and more recently distributed programming system, was developed as a product of a communitywide consortium. MPI is the model of choice for the majority of the parallel programming community on Beowulf clusters and other forms of parallel computer as well, even shared-memory machines. There are a number of open and commercial sources of MPI with new developments, especially in the area of parallel I/O, being incorporated in implementations of MPI-2. Together, MPI and PVM represent the bulk of parallel programs being developed around the world, and both languages are represented in this book. 

Of course, developing parallel algorithms and writing parallel programs involves a lot more than just memorizing a few added constructs. Entire books have been dedicated to this topic alone (including threein this series), and it is a focus of active research. A detailed and comprehensive discourse of parallel algorithm design is beyond the scope of this book. Instead, we offer specific and detailed examples that provide templates that will satisfy many programming needs. Certainly not exhaustive, these illustrations nonetheless capture many types of problem. 


2.6 Conclusions 


Beowulf cluster computing is a fascinating microcosm of parallel processing, providing hands-on exposure and experience with all aspects of the field, from low-level hardware to high-level parallel algorithm design and everything in between. While many solutions are readily available to provide much of the necessary services required for effective use of Beowulf clusters in many roles and markets, many challenges still remain to realizing the best of the potential of commodity clusters. Research and advanced development is still an important part of the work surrounding clusters, even as they are effectively applied to many real-world workloads. The remainder of this book serves two purposes: it represents the state of the art for those who wish ultimately to extend Beowulf cluster capabilities, and it guides those who wish immediately to apply these existing capabilities to real-world problems. 

Node Hardware 


Thomas Sterling 


Beowulf is a network of nodes, with each node a low-cost personal computer. Its power and simplicity is derived from exploiting the capabilities of the mass-market systems that provide both the processing and the communication. This chapter explores all of the hardware elements related to computation and storage. Communication hardware options will be considered in detail in Chapter 5. 

Few technologies in human civilization have experienced such a rate of growth as that of the digital computer and its culmination in the PC. Its low cost, ubiquity, and sometimes trivial application often obscure its complexity and precision as one of the most sophisticated products derived from science and engineering. In a single human lifetime over the fifty-year history of computer development, performance and memory capacity have grown by a factor of almost a million. Where once computers were reserved for the special environments of carefully structured machine rooms, now they are found in almost every office and home. A personal computer today outperforms the worlds greatest supercomputers of two decades ago at less than one ten-thousandth the cost. It is the product of this extraordinary legacy that Beowulf harnesses to open new vistas in computation. 

Hardware technology changes almost unbelievably rapidly. The specific processors, chipsets, and three-letter acronyms (TLAs) we define today will be obsolete in a very few years. The prices quoted will be out of date before this book reaches bookstore shelves. On the other hand, the organizational design of a PC and the functions of its primary components will last a good deal longer. The relative strengths and weaknesses of components (e.g., disk storage is slower, larger, cheaper and more persistent than main memory) should remain valid for nearly as long. Fortunately, it is now easy to find up-to-date prices on the Web; see Appendix C for some places to start. 

This chapter concentrates on the practical issues related to the selection and assembly of the components of a Beowulf node. You can assemble the nodes of the Beowulf yourself, let someone else (a system integrator) do it to your specification, or purchase a turnkey system. In either case, youll have to make some decisions about the components. Many system integrators cater to a know-nothing market, offering a few basic types of systems, for example, office and home models with a slightly different mix of hardware and software components. Although these machines would work in a Beowulf, with only a little additional research you can purchase far more appropriate systems for less money. Beowulf systems (at least those we know of) have little need for audio systems, speakers, joysticks, printers, frame grabbers, and the like, many of which are included in the standard home or office models. High-performance video is unnecessary except for specialized 
