Administration

From ATLAS
Jump to: navigation, search

This section describes how the cluster works and is administered internally and normally is of no interest to the regular users.

Cloning

All software installations, as well as any other maintenance, takes place on tau only. All the other nodes are obtaining from this one by cloning - this means making an exact copy of the operating system of tau then altering some configuration files in order to reflect the correct values for the target computer.

The process of cloning is automated by several scripts found in the /root/Scripts directory, out of which two are the most important:

  • rsync-files should be executed on the source node (tau) followed by the name of the destination computer; for example, in order to clone the node called delta, the command would be: rsync-files delta. It simply rsync's the / partition of the source to the target.
  • clone is executed on the target node, after the abovementioned command is complete. It receives an argument, the name of the target; for example, clone delta. After it's done a reboot is needed.

So here is the complete list of steps to clone a node, say delta:

  1. Open two root terminals on tau
  2. From one of them ssh to delta before the start of rsync. This is necessary because the rsync will alter the target host key and ssh will complain about a changed host key if attempted after the start of the rsync.
  3. In both terminals change directory to Scripts: cd ~/Scripts
  4. Start the rsync on tau: ./rsync-files delta
  5. After the rsync is complete and the prompt re-appears, in the other terminal (the one on delta) say: ./clone delta. Please pay attention to execute this command on the remote and not on the source node!
  6. Watch for any errors produced by the clone script. When it's done, simply say reboot on the target computer.

Keeping the systems up to date

At least once a week the operating system on the cluster should be updated; this is a simple operation which is done with the yum update command. A reboot is normally indicated after the update, especially if a new kernel has been installed. Always two kernels are kept on the system, the current and the previous one, so if the system refuses to boot with the new kernel one can switch back to the old one usin the Grub menu at boot time. Updates should be performed on tau only; they are propagated to the rest of the cluster using the syncall script.

After each operation which alters the operating system - be it an update, a software install or removal, a change in configuration, an enable or disable of some system software - a line should be added in the /etc/stfp.release file. This file is not part of the operating system; it's a local feature which keeps track of the changes made to the system but, more imporant, allows for the synchronization between the master node tau and the clones via the syncall script. Lines in the file are in reverse chronological order so always the current operation should be added on the top of the file, moving the previous contents one line down. Lines are never erased from this file.