In our work with clusters we found that one of the main causes of system down-time are disk failures. This is the main motivation behind this work. Anyway a disk-less system is easier to reconfigure and update since all the information is centralised in just one place. Classical disk-less systems are quite difficult to manage as far as the handling of system images to be distributed to cluster nodes is concerned. Here we are going to describe hopeless, a disk-less system, that uses unionfs to reduce the redundancy in the building of the image that will be distributed to nodes.
With unionfs it's possible to mount different directories on the same mount point. The files in the directory with higher priority eventually shadow those in the others. Thanks to unionfs we clearly separate the base distribution, the cluster-wide configuration, the host specific one and the files that were written by applications on the single nodes. This has a clear advantage as far as upgrade of the system and restoring of the initial state of a node are concerned.
The hopeless system can be built upon every distribution. We focused on Centos 4.2 since it meets the needs of our deployments. We start by supposing that we have the master node of the cluster up and running. Let's see briefly the components that build up the system.
Is used to boot-up the client nodes. It's a quite standard infrastructure for disk-less systems.
Here is where the action goes on: we mount the remote file-systems and set-up the union.
The bulk of files for the hosts.
Here we override the files from the root file system in a way that is general enough for the whole cluster.
This mount has 2 functions. Firstly it serves as read/write space since it will hold mutable directories like var, tmp and application specific data. Secondly it holds host-specific configuration. Of course these 2 functions may be split in different mounts if it's needed.
You can find many references on how setup a disk-less boot environment on the Net, so we just briefly review what's is needed and the configuration files involved. Of course the cluster members have to be configured to boot up via PXE. So let's see the services that are needed.
has to be configured to assign the IP (and other bring-up information like network mask, server IP, etc.) to the nodes. It's configured by the file /etc/dhcpd.conf.
downloads the PXElinux bootstrap, the kernel and the initramfs image to the clients.
is the boot-loader that fires up Linux on client machines. It uses the configuration file /tftpboot/pxelinux.cfg/default which can (and has to) be carefully crafted to provide the correct parameters to the initramfs image.
the files for the client nodes are shared by using the NFS protocol. It's configured using the file /etc/exports.
or, broader speaking, the distribution of other administrative information to clients (for example host names). In this example we use NIS since it's easy to setup and use, but it's definitely uncool these days. Other solutions like LDAP can make sense in some large-scale installations.
Initramfs is the system that employs an early user-space in Linux 2.6. It is the successor to the old initrd and it's rather simpler to setup (it's just a cpio archive) and the switch to the final root is straight-forward. We started from the standard boilerplate generated by mkinitrd:
#!/bin/nash mount -t proc /proc /proc setquiet echo Mounted /proc filesystem echo Mounting sysfs mount -t sysfs none /sys echo Creating /dev mount -o mode=0755 -t tmpfs none /dev mknod /dev/console c 5 1 mknod /dev/null c 1 3 mknod /dev/zero c 1 5 mkdir /dev/pts mkdir /dev/shm echo Starting udev /sbin/udevstart echo -n "/sbin/hotplug" > /proc/sys/kernel/hotplug echo "Loadin modules" /bb/setup_modules.sh /sbin/udevstart echo Creating root device mkrootdev /dev/root umount /sys echo Mounting root filesystem /bb/setup_hopeless.sh mount -t tmpfs --bind /dev /sysroot/dev #mount -t nfs --bind /dev/hopeless/clients /sysroot/dev/hopeless/clients #mount -t nfs --bind /dev/hopeless/roots /sysroot/dev/hopeless/roots echo Switching to new root switchroot /sysroot umount /initrd/dev
As you can see, basically there are 2 addition to the standard initramfs start-up file: the loading of modules and the setup of the root file-system in the hopeless-way. If you need more information about how mkinitrd works please have a look at the documentation that comes with that package. Before we dig in the hopeless specific configuration, please note the 2 commented mount statement with the --bind option: they are useful if you want to have direct access to the mounts underlying the final union-based root file-system (otherwise they are not accessible anymore since the ad-interim file-system used in this phase is discarded by switchroot).
We use a statically-linked busybox that provides us all the utilities (and some more) that we are going to use during the set-up of the final root file-system. The memory used by it is discarded once the switchroot is done, so it doesn't have any consequence once the final system in-place. /bb/setup_modules.sh is not very interesting: it just loads the kernel modules that will be useful during the root file-system setup. Let's take a look now at the more interesting part, the /bb/setup_hopeless.sh script:
#!/bb/bin/sh export PATH=/bb/bin:/bb/sbin:/bb/usr/bin:/bb/usr/sbin:$PATH . /bb/config # determine the IP ifconfig $IFACE up sleep 3 udhcpc -i $IFACE -q -s /bb/dhcpdc.sh IP=`cat /bb/MYIP` HOSTNAME=`cat /bb/HOSTNAME` if [ "$IP" = "" ] then echo "Cannot get IP!" ifconfig -a sleep 1000 fi echo "Got IP $IP, hostname $HOSTNAME" # make the directories where to mount mkdir -p /dev/hopeless/clients mkdir -p /dev/hopeless/roots # mount the client specific dir mount -o nolock,tcp,rw $SERVER:$CDIR /dev/hopeless/clients if [ ! -e /dev/hopeless/clients/$IP/hopeless.root ] then echo '!!! Cannot mount client dir or missing file hopeless.root in it' sleep 1000 fi if [ ! -e /dev/hopeless/clients/$IP/hopeless.overlay ] then echo '!!! Missing file hopeless.overlay in client dir' sleep 1000 fi # mount the generic root dir and the cluster wide overlay export MY_ROOT=`cat /dev/hopeless/clients/$IP/hopeless.root` export MY_OVERLAY=`cat /dev/hopeless/clients/$IP/hopeless.overlay` mount -o nolock,tcp,rw $SERVER:$RDIR /dev/hopeless/roots # do the union and check that it looks healthly mount -t unionfs -o dirs=/dev/hopeless/clients/$IP=rw:/dev/hopeless/roots/$MY_OVERLAY=ro:/dev/hopeless/roots/$MY_ROOT=ro none /sysroot #sleep 3 # assure we don't try to fsck the remote fs rm /sysroot/.autofsck echo fastboot > /sysroot/fastboot
Here is a rather precise dissection of what's going on:
The configuration variables are sourced from the file /bb/config. In the future we plan to fetch this values from the process command line or via DHCP to make the initramfs image more immutable. We also plan to use this hook to mount the NFS shares from different machines (by defining the SERVER variable in a host-specific way) so we have a simple but effective way of load-balancing since NFS protocol is limited to the range of tenths of clients per server.
We determine our IP address and host name using the small but nice DHCP client in busybox.
We mount the host-specific overlay. A file in its root directory is used to point to the cluster-wide overlay and the base root file-system that have to be used. This enables us to choose between different kinds of nodes, for example compute and storage ones.
Next the cluster-wide overlay and the base root file-system are mounted.
We set-up the final root file-system in one, rather long, mount command.
To avoid the check of a remote mounted NFS share we play some tricks with the Centos bring-up system by assuring the fsck for the root file-system won't be called.
After these steps the SYSV initscripts from Centos do start caged in the unionfs based environment we just set-up.
In our firsts deployments based on hopeless we used the server root file system as the one for the cluster clients. This enabled us to just do updates and maintenance on the server machine. Such an approach played well in small clusters, but the larger ones tend to became, after hardware upgrades and additions, a mix of very different architectures (for example it's quite usual to have a mix of 32-bit Xeons and 64-bit Opterons). So we decided to have a separate root file-system since it's quite easy to manage by using the powerful yum utility to resolve package dependencies. For example to set-up a basic root file-system the following instructions are more than enough:
export ROOT=`pwd`/root mkdir root rpm --initdb --root $ROOT yum --installroot=$ROOT install bash yum --installroot=$ROOT install openssh-server yum --installroot=$ROOT install ypbind yum --installroot=$ROOT install passwd
By using more yum statements we can install every RPM we need.
Here we place all the configuration files that are common for all the machines in the cluster. Examples are:
List of services that has to be started on every machine (/etc/rc3.d/).
User authentication information (nsswitch.conf, sysconfig/network, etc.).
Common root user public keys and such (/root/.ssh) that enables quick and easy management (although we must be careful to not open security holes in our cluster when fiddling with these files).
And of course many more depending on the application running in the cluster.
There is another important file that we added at this priority level in the unionfs mount list: we have to do a little modification to the /etc/init.d/netfs file. On shutdown it does unmount remote file-system and this results in a situation similar to trying to move a carpet under our feet. We just skip the unmount step so the shutdown or reboot process does it's job gracefully.
Another handy trick that is worth to note: when we are working on the cluster wide overly it's useful to do an unionfs mount that layers it above the basic root file-system and chroot to it on the cluster master node.
At this level in the unionfs mount list we have machine-specific configuration files and data that has been written by application on them. Both the former and the latter have to be kept to a minimum.
Less host-specific configuration we have then less troubles we are going to face when doing software upgrades or restoring the original state of the host after some unrecoverable problem has happened (unfortunately it does happen more often that we think and hope). Anyway we provide a template and a shell script that farms-out a bunch of this kind of overlays (see /hopeless/clients/instantiate.sh and /hopeless/clients/template-minimal).
The amount of data written by the specific applications has to be low in this configuration since NFS doesn't scale when the need of distributed I/O is high. In such cases other solutions has to explored like machine-specific disks used as scratch space or more scalable cluster-wide file-systems (like GFS or OCFS2).
By clicking the following link you can download a tarball with the configuration file and hopeless specific components that has been described so far: hopeless-example.tar.gz. Please note that this is not a ready to run system whatsoever, but is just an useful example (actually is a backup of one of my experimental systems). It's based on Centos 4.2 and it is designed to run on a head node called mickymaster.exadron.com (with IP 192.168.233.50). The configuration files are set-up to bring up a client node called mickyslave001.exadron.com (with IP 192.168.233.51) that is equipped with a Broadcom Taigon 3 network card.
I would like to thank Matteo Vit and Pierfrancesco Zuccato for all the great help and useful insight they provided me. Here are some useful link that you should visit if you intend to set-up a hopeless-like system:
Linux disk-less how-to: http://www.tldp.org/HOWTO/Diskless-HOWTO.html
Unionfs site: http://www.fsl.cs.sunysb.edu/project-unionfs.html
Centos distribution site: http://www.centos.org/
Feel free to contact me at chripell at gmail dot com. Happy hacking!