Glen.Gardner@GridToys.com

Beowulf Clusters

11/2013

My current cluster is running CentOS 6.4 using the Plain Vanilla Cluster configuration kit. It consists of six ASUS AT5IONT-I motherboards with 4 GB DDR-3 ram. Each node boots from a single 30 GB solid state disk drive. The head node has a second 30 GB SSD serving /home via NFS to the four computing nodes. Two of the computing nodes have additional 30GB hard drives and are configured to act as parallel I/O servers for my choice of OrangeFS or Fraunhofer Global File System with a capacity of up to 60GB to the parallel file system. The maximum parallel file system throughput is about 160 MB/s on large sustained writes. The network fabric is gigabit ethernet configured for jumbo frames. An outboard server with multiple large capacity hard drives is used for archival storage. MPICH-3.0.4 is installed for use with MPI. The Linpack benchmark comes to about 68.9 GFLP (without the use of the GPUs) for all six computational nodes in use. The NVIDIA drivers for the ION graphics chipset (16 GPUS on each board) are installed on all nodes, with full support for OpenGL and CUDA.

The original chassis as used with these motherboards in an earlier cluster oriented the motherboards vertically. Since the motherboards were fanless and the horizontal chassis frame was open to vertical air flow, convection cooling proved to be adequate, and no cooling fans were installed. I used the original chassis in this build, but without the rack panel mount so that the chassis would stand upright. The end result was that there was now insufficient airflow due to convection around the (now horizontal) motherboards with the chassis oriented vertically. This caused the CPU's to overheat and generate an alarm after operating for 30 minutes or so. My solutions was to use cable ties to strap two 120 mm ultra quiet cooling fans to the side of the chassis to force a bit of horizontal air flow across the now horizontal motherboards, and the cluster now runs nice and cool.



11/2013

I recently had a need to come up with a base configuration for a small Beowulf cluster. The initial development was done on virtual box. After the configuration developed sufficiently I stood up a six node cluster (see above) using spares from an older cluster build (see below) and used it to complete the configuration.

I created a customized kickstart DVD based on CentOS 6.4 to install a base operating system then created a scripted configuration package to configure the head node and all of the computing nodes in the cluster. It proved to be simple, easy and effective so I thought I would share it with the community in the hope that others may find it equally useful. The project is called "The Plain Vanilla Cluster" or "PVC". It consists of a set of files and instructions for creating the kickstart DVD from the CentOS 6.4 installation DVD and a configuration package made of a setup script and quite a few configuration files. It is reasonably well documented.

Once you have customized the setup script for each node in the cluster it is possible to bring up a large working Beowulf cluster in less than an hour. Click HERE to visit the PVC web page.

One of my early Beowulf clusters (10/2003):http://www.mini-itx.com/projects/cluster/



12/2010

An upgraded version of the mini cluster was built in December of 2010. The cluster was made up of 13 nodes using ASUS AT5IONT-I fanless mini-itx motherboards with Intel ATOM D525 1.8 GHz dual core 64-bit processors. Hyperthreading was enabled. The NVIDIA ION graphics adapters on the little ASUS motherboards support CUDA and OpenCL with 16 GPUs per motherboard, so there is a big aggregate potential for processing in small machines of this type. Each node had 4 GB 800 MHz DDR 3 ram.

All 12 compute nodes booted from 30 GB solid state drives. Another machine having a 500 GB solid-state raid 0, backed up to a hard disk drive, was the nfs server for the common data directory. The interconnect was Gigabit Ethernet. The operating system was Fedora 13 Linux. The development node was housed in a 2U rackmount case. It was equipped with a Blu-Ray optical drive and a 120 GB SSD. The compute nodes were installed in the aluminum rack mount chassis from the original cluster. Additionally, A portable USB/E-SATA Blu-Ray drive was used for the initial installation of the operating system during development. A recovery tool was constructed using a small portable USB/E-SATA 120 GB solid-state drive with Fedora 13 Linux installed on it which contained the disk image used by all nodes in the cluster, so that a node could be installed/recovered easily. Once the configuration was developed and the disk image created, the recovery tool was used to install the entire cluster.

Each compute node of the cluster had an 85 watt dc-dc converter powering it. Power distribution for the cluster was divided into two sub-units of 6 nodes each, and each unit was powered by a single 600 watt 12VDC power supply. A single 1500 VA UPS provided protection from power line transients/brownouts.

In terms of power use, the new cluster was not as frugal as the original mini-itx cluster: Idle power use was approximately 350 watts, with peak power use approaching 600 watts.... roughly twice the power use.... but twice the number of cpus and much faster processing speeds. In terms of watts-per-GFLP the new cluster was more efficient than the early mini-itx cluster by a factor of 10 on cpu floating point performance alone. If the availability of the NVIDIA GPUs on the AT51ONT-I motherboards were factored in, the new cluster was more than 50 times as efficient in terms of watts-per-GFLP.

The mini-cluster ran a daily parallel-processing job using GRASS GIS for over a year. I constructed some AVI animations of science data downloaded from NASA’s Direct Readout Laboratory. They were made using GRASS GIS and mencoder. The cluster was removed from service in late 2012 to free up rack space for other projects.

The initial build and burn-in of the cluster was flawless. Everything worked, and there were no failures, no quirks, no bugs. This was really good hardware.

In operation of the 13 node cluster, Two power supplies failed after about a year and a half. The original requirement for this cluster was to have 10 nodes operational after one year of continuous operation. The failed nodes were simply shut off (to be repaired from stock parts for reuse later), and the remainder of the processing goal was accomplished without interruption.


Here are a few words to the wise: When you build a cluster make a lifetime buy of everything that you need to keep it operational for the design lifetime of the hardware. There is little difference between "server motherboards" and "workstation motherboards", other than features. One notable exception is the "gamer motherboards", which push the limits on reliability to favor flat out speed (imho they are basically junk). The industry quality standard is for an AQL of 1% or better, so you can expect your failure rate to be initially very low after burning in the hardware and weeding out the defective bits. In general, modern computer electronics is designed to maintain new equipment specifications for up to two years. After that, you are trying to beat the statistics, which will likely not work in your favor. In practice, you can push the hardware to three years. Beyond that you will see a rise in failure rates, and erratic hardware behavior due to aging components. From the moment that you first turn a computer on, the hardware is degrading and nothing can stop it.

The processing that I was doing with this cluster required a minimum of 10 operational nodes for one year. With that in mind, I built one head node and 12 computing nodes. Also, I bought spare motherboards, power supplies, spare ram, and spare disk drives for two more nodes, so that I could reasonably expect to be able to maintain all 13 nodes for a two year period, thus guaranteeing that the cluster would very, very likely meet the requirement for ten operational nodes after one year. This is pretty much a hard and fast rule. It scales up linearly.. so if you build a bigger cluster, you need to make sure you have enough spares on-hand to meet whatever the mission lifetime requirements are. Plan on having at least 10%-12% spares on-hand to keep the cluster alive for two years after the initial burn-in and replacement of bad nodes.

Heaven help you if you need to operate the hardware much longer than the expected design lifetime. After about three years of operation, advancing technology will catch up with you and you will most likely find that it is cheaper to build a new, smaller, faster, more efficient cluster than it is to pay for the electrical power to run a large, obsolete, inefficient existing cluster, and vainly try to maintain quirky, aging, degrading hardware in terms of dollar cost per GFLP and cost of man-hours spent fighting a battle that you have already lost.

10/2003

I began construction of the cluster in October 2003. The initial development was done on three VIA EPIA V8000A mini-itx motherboards. After the operating system and configuration was developed, I upgraded it to six nodes. I operated it at a 6 node capacity for a while, then expanded it to a 12 node capacity in December of that year. Eventually I was persuaded by several people that It ought to be published online, so I submitted it to the projects page at mini-itx.com. This machine really created a fuss when it was published at mini-itx.com early in 2004. For whatever reason, this little computer created an unexpected viral sensation on the internet. It was slashdotted on the first day it was published online and on the day after the release I had quite a lot of email, which continued for several years afterward.

A MINI ITX BEOWULF CLUSTER! In late 2003 I began building a MINI-ITX based Beowulf cluster named PROTEUS . The envisioned configuration was to be 12 nodes using VIA EPIA V8000A , 800 MHz motherboards with 256 MB ram. The head node booted from a 120 GB HDD that was soon replaced with a 200 GB IDE hard disk drive. The remaining eleven nodes booted from 340 MB IBM microdrives in a compact flash format. These worked extremely well and used very little power. As a test, one node was eventually configured to boot from a 512MB compact flash card. It proved to be less reliable than the microdrives (CF tended to fail after about a year of daily use) and was extremely slow on writes compared to the microdrives. The microdrives were installed in CF to IDE adapter boards so that any drive could be replaced easily. The head node had a spare CF to IDE adapter which could be used to install an operating system image onto a microdrive so that a failed drive could be easily and quickly replaced. The little machine ran FreeBSD 4.8, and MPICH 1.2.5.2. The mini-cluster provided a theoretical peak performance of 4.8 GFLP.

Since the mini-itx motherboards had 100 Mbit Ethernet adapters, the original network employed a 100 Mbit switch. Later, five inexpensive Gigabit Ethernet switches were used to construct a data concentrator to assure full Gigabit throughput to an external file server during parallel I/O operations.

I created a very short (and poor quality) AVI video of the cluster operating at the time that I was initially experimenting with the data concentrator. You can see four five-port gigabit switches on the right side of the rack shelf, and an eight port gigabit switch on top of the power supply at the left. The switches ran so hot that I had to setup some fans behind them to keep cool air circulating over them. The 7U rack mount conversion also came along not long after this video was made. Click HERE to see the AVI video.

After the first year of operation I reconfigured the chassis into a rack mountable version that fit into a 7U rack space. Pictures of the rackmount version are here and here. Immediately below the 12-node 7U rack panel is an additional two mini-itx computing nodes in a single 1U chassis. The rack mount proved to be a very efficient use of rack space. (You can fit 64 mini-itx boards in a 28U rack space, but would also need 12U for the 12VDC power supplies) Below the 1U two-node chassis (barely visible in the bottom of the picture) was the head node with a VIA M-10000 mini-itx motherboard, 512 MB ram, two 300 GB HDD and a DVD R/W optical drive, for a total of 15 cluster nodes. The new head hosted all of the configuration and system images for the compute nodes.

Eventually the cluster was configured as a diskless cluster to netboot from the head node so that the configuration for the entire cluster could be controlled in a single directory tree on the head node (which worked very well).

At various times other computers running Fedora 3, Fedora 4, and PCBSD were joined to the cluster to form a heterogenous grid with results that were useful, but more or less uninspiring. Proteus was eventually taken offline in late 2009, and remained unused until it was cannibalized to build an upgraded cluster. In all, the original mini-itx cluster had served very well as a testbed for many ideas, for a very long time. It was also a fun computer to operate.

During the initial build and burn-in, two new motherboards were replaced due to erratic behavior, and a bad batch of 256 MB memory modules were replaced in 6 nodes (yes, I got my money back).

Over the life of the cluster, two motherboards failed. One board developed thermal issues after several years and would crash under a heavy computing load. The other motherboard failed after being seriously abused by myself in testing (when I discovered that the bus connected to the IDE port was NOT buffered, and NOT protected against having the output pins forced to +5V).

Two IBM microdrives (out of the original 15) failed. One failed when it was hot-swapped to see if it would tolerate it (it did not), the second failed, apparently due to age and wear.

One Compact flash card failed after the first year due to being used as a disk drive replacement (this was expected, but I wanted to test it).

When the decommissioned cluster was disassembled in 2010 , the power supplies were inspected and failed/leaking capacitors were found on ALL of the 12 VDC to ATX power supplies. Amazingly, not one ever completely failed, but they were all clearly past the end of their operational life and were likely responsible for some operational problems.

I had no idea of the stir that this project was going to cause on the internet until I posted it on mini-itx.com so long ago. I'm still getting email, in the year 2013, and the number of links and postings on the net connected with this machine continued to increase for 4 years after it was built. Do a web search for "mini-cluster" or “mini-itx cluster” and see for yourself. I get the impression that there are a lot of Beowulf/HPC fans out there.