Hopeless, a system for building disk-less clusters

1. Motivation

In our work with clusters we found that one of the main causes of system down-time are disk failures. This is the main motivation behind this work. Anyway a disk-less system is easier to reconfigure and update since all the information is centralised in just one place. Classical disk-less systems are quite difficult to manage as far as the handling of system images to be distributed to cluster nodes is concerned. Here we are going to describe hopeless, a disk-less system, that uses unionfs to reduce the redundancy in the building of the image that will be distributed to nodes.

2. Unionfs

With unionfs it's possible to mount different directories on the same mount point. The files in the directory with higher priority eventually shadow those in the others. Thanks to unionfs we clearly separate the base distribution, the cluster-wide configuration, the host specific one and the files that were written by applications on the single nodes. This has a clear advantage as far as upgrade of the system and restoring of the initial state of a node are concerned.

3. The components

The hopeless system can be built upon every distribution. We focused on Centos 4.2 since it meets the needs of our deployments. We start by supposing that we have the master node of the cluster up and running. Let's see briefly the components that build up the system.

The boot environment

Is used to boot-up the client nodes. It's a quite standard infrastructure for disk-less systems.

Initramfs

Here is where the action goes on: we mount the remote file-systems and set-up the union.

Root file system

The bulk of files for the hosts.

Cluster wide overlay

Here we override the files from the root file system in a way that is general enough for the whole cluster.

Host specific overlay

This mount has 2 functions. Firstly it serves as read/write space since it will hold mutable directories like var, tmp and application specific data. Secondly it holds host-specific configuration. Of course these 2 functions may be split in different mounts if it's needed.

4. The boot environment

You can find many references on how setup a disk-less boot environment on the Net, so we just briefly review what's is needed and the configuration files involved. Of course the cluster members have to be configured to boot up via PXE. So let's see the services that are needed.

DHCP server

has to be configured to assign the IP (and other bring-up information like network mask, server IP, etc.) to the nodes. It's configured by the file /etc/dhcpd.conf.

TFTP server

downloads the PXElinux bootstrap, the kernel and the initramfs image to the clients.

PXE linux

is the boot-loader that fires up Linux on client machines. It uses the configuration file /tftpboot/pxelinux.cfg/default which can (and has to) be carefully crafted to provide the correct parameters to the initramfs image.

NFS server

the files for the client nodes are shared by using the NFS protocol. It's configured using the file /etc/exports.

Authentication services

or, broader speaking, the distribution of other administrative information to clients (for example host names). In this example we use NIS since it's easy to setup and use, but it's definitely uncool these days. Other solutions like LDAP can make sense in some large-scale installations.

5. Initramfs

Initramfs is the system that employs an early user-space in Linux 2.6. It is the successor to the old initrd and it's rather simpler to setup (it's just a cpio archive) and the switch to the final root is straight-forward. We started from the standard boilerplate generated by mkinitrd:

#!/bin/nash

mount -t proc /proc /proc
setquiet
echo Mounted /proc filesystem

echo Mounting sysfs
mount -t sysfs none /sys

echo Creating /dev
mount -o mode=0755 -t tmpfs none /dev
mknod /dev/console c 5 1
mknod /dev/null c 1 3
mknod /dev/zero c 1 5
mkdir /dev/pts
mkdir /dev/shm

echo Starting udev
/sbin/udevstart
echo -n "/sbin/hotplug" > /proc/sys/kernel/hotplug

echo "Loadin modules"
/bb/setup_modules.sh
/sbin/udevstart

echo Creating root device
mkrootdev /dev/root
umount /sys

echo Mounting root filesystem
/bb/setup_hopeless.sh

mount -t tmpfs --bind /dev /sysroot/dev
#mount -t nfs --bind /dev/hopeless/clients /sysroot/dev/hopeless/clients
#mount -t nfs --bind /dev/hopeless/roots /sysroot/dev/hopeless/roots

echo Switching to new root
switchroot /sysroot
umount /initrd/dev

As you can see, basically there are 2 addition to the standard initramfs start-up file: the loading of modules and the setup of the root file-system in the hopeless-way. If you need more information about how mkinitrd works please have a look at the documentation that comes with that package. Before we dig in the hopeless specific configuration, please note the 2 commented mount statement with the --bind option: they are useful if you want to have direct access to the mounts underlying the final union-based root file-system (otherwise they are not accessible anymore since the ad-interim file-system used in this phase is discarded by switchroot).

We use a statically-linked busybox that provides us all the utilities (and some more) that we are going to use during the set-up of the final root file-system. The memory used by it is discarded once the switchroot is done, so it doesn't have any consequence once the final system in-place. /bb/setup_modules.sh is not very interesting: it just loads the kernel modules that will be useful during the root file-system setup. Let's take a look now at the more interesting part, the /bb/setup_hopeless.sh script:

#!/bb/bin/sh

export PATH=/bb/bin:/bb/sbin:/bb/usr/bin:/bb/usr/sbin:$PATH

. /bb/config

# determine the IP
ifconfig $IFACE up
sleep 3
udhcpc -i $IFACE -q -s /bb/dhcpdc.sh
IP=`cat /bb/MYIP`
HOSTNAME=`cat /bb/HOSTNAME`
if [ "$IP" = "" ]
then
    echo "Cannot get IP!"
    ifconfig -a
    sleep 1000
fi
echo "Got IP $IP, hostname $HOSTNAME"

# make the directories where to mount
mkdir -p /dev/hopeless/clients
mkdir -p /dev/hopeless/roots

# mount the client specific dir
mount -o nolock,tcp,rw $SERVER:$CDIR /dev/hopeless/clients
if [ ! -e /dev/hopeless/clients/$IP/hopeless.root ]
then
    echo '!!! Cannot mount client dir or missing file hopeless.root in it'
    sleep 1000
fi
if [ ! -e /dev/hopeless/clients/$IP/hopeless.overlay ]
then
    echo '!!! Missing file hopeless.overlay in client dir'
    sleep 1000
fi

# mount the generic root dir and the cluster wide overlay
export MY_ROOT=`cat /dev/hopeless/clients/$IP/hopeless.root`
export MY_OVERLAY=`cat /dev/hopeless/clients/$IP/hopeless.overlay`
mount -o nolock,tcp,rw $SERVER:$RDIR /dev/hopeless/roots

# do the union and check that it looks healthly
mount -t unionfs -o dirs=/dev/hopeless/clients/$IP=rw:/dev/hopeless/roots/$MY_OVERLAY=ro:/dev/hopeless/roots/$MY_ROOT=ro none /sysroot
#sleep 3

# assure we don't try to fsck the remote fs
rm /sysroot/.autofsck
echo fastboot > /sysroot/fastboot

Here is a rather precise dissection of what's going on:

  1. The configuration variables are sourced from the file /bb/config. In the future we plan to fetch this values from the process command line or via DHCP to make the initramfs image more immutable. We also plan to use this hook to mount the NFS shares from different machines (by defining the SERVER variable in a host-specific way) so we have a simple but effective way of load-balancing since NFS protocol is limited to the range of tenths of clients per server.

  2. We determine our IP address and host name using the small but nice DHCP client in busybox.

  3. We mount the host-specific overlay. A file in its root directory is used to point to the cluster-wide overlay and the base root file-system that have to be used. This enables us to choose between different kinds of nodes, for example compute and storage ones.

  4. Next the cluster-wide overlay and the base root file-system are mounted.

  5. We set-up the final root file-system in one, rather long, mount command.

  6. To avoid the check of a remote mounted NFS share we play some tricks with the Centos bring-up system by assuring the fsck for the root file-system won't be called.

After these steps the SYSV initscripts from Centos do start caged in the unionfs based environment we just set-up.

6. Root File System

In our firsts deployments based on hopeless we used the server root file system as the one for the cluster clients. This enabled us to just do updates and maintenance on the server machine. Such an approach played well in small clusters, but the larger ones tend to became, after hardware upgrades and additions, a mix of very different architectures (for example it's quite usual to have a mix of 32-bit Xeons and 64-bit Opterons). So we decided to have a separate root file-system since it's quite easy to manage by using the powerful yum utility to resolve package dependencies. For example to set-up a basic root file-system the following instructions are more than enough:

export ROOT=`pwd`/root
mkdir root
rpm --initdb --root $ROOT
yum --installroot=$ROOT install bash
yum --installroot=$ROOT install openssh-server
yum --installroot=$ROOT install ypbind
yum --installroot=$ROOT install passwd

By using more yum statements we can install every RPM we need.

7. Cluster wide overlay

Here we place all the configuration files that are common for all the machines in the cluster. Examples are:

There is another important file that we added at this priority level in the unionfs mount list: we have to do a little modification to the /etc/init.d/netfs file. On shutdown it does unmount remote file-system and this results in a situation similar to trying to move a carpet under our feet. We just skip the unmount step so the shutdown or reboot process does it's job gracefully.

Another handy trick that is worth to note: when we are working on the cluster wide overly it's useful to do an unionfs mount that layers it above the basic root file-system and chroot to it on the cluster master node.

8. Host specific overlay

At this level in the unionfs mount list we have machine-specific configuration files and data that has been written by application on them. Both the former and the latter have to be kept to a minimum.

Less host-specific configuration we have then less troubles we are going to face when doing software upgrades or restoring the original state of the host after some unrecoverable problem has happened (unfortunately it does happen more often that we think and hope). Anyway we provide a template and a shell script that farms-out a bunch of this kind of overlays (see /hopeless/clients/instantiate.sh and /hopeless/clients/template-minimal).

The amount of data written by the specific applications has to be low in this configuration since NFS doesn't scale when the need of distributed I/O is high. In such cases other solutions has to explored like machine-specific disks used as scratch space or more scalable cluster-wide file-systems (like GFS or OCFS2).

9. Show me the code!

By clicking the following link you can download a tarball with the configuration file and hopeless specific components that has been described so far: hopeless-example.tar.gz. Please note that this is not a ready to run system whatsoever, but is just an useful example (actually is a backup of one of my experimental systems). It's based on Centos 4.2 and it is designed to run on a head node called mickymaster.exadron.com (with IP 192.168.233.50). The configuration files are set-up to bring up a client node called mickyslave001.exadron.com (with IP 192.168.233.51) that is equipped with a Broadcom Taigon 3 network card.

10. References and acknowledgements

I would like to thank Matteo Vit and Pierfrancesco Zuccato for all the great help and useful insight they provided me. Here are some useful link that you should visit if you intend to set-up a hopeless-like system:

Feel free to contact me at chripell at gmail dot com. Happy hacking!