Control Center

From ATLAS
Jump to: navigation, search

Overview

This tool provides a single page overview of the entire cluster, including the computers and the UPS units. It also allows for various actions to be initiated, like powering on or off of computers or manipulating Matlab workers on Matlab DCS nodes. It is only available from inside the laboratory, either from the BluePC network or from the Cluster nodes themselves.

Besides information displayed as text it uses dots with four colours with the following meanings: Green for normal or operational status; Red for failure or service unavailable; Blue for partially operational status; Orange for expected failure. Each line in the table describes a single computer or UPS unit; it also might have some actions available which act on that particular computer.

Nodes table

The columns displayed in the computers table are:

  • Host or computer name, followed by the processor type which can be Intel or AMD; normally this is of no importance to users since programs run identically on both architectures. Intel architecture computers have their table lines drawn with a light red background and AMD ones have a light blue background. Nodes are displayed in alphabetical order. Here also the 'always on' attribute might appear; computers missing this attribute can be powered on or off by users using the interface.
  • Status tells whether the computer is currently powered On or Off; the Ethernet is used for asserting whether a computer is on or off so a computer which is powered on might appear as off when the Ethernet wiring is not functional.
  • Temp or core (CPU) temperature, followed by a green or red dot which denote a normal or abnormal reading. Probably due to differences in the way CPUs are mounted in their mainboards the normal readings for Intel and AMD computers are different; when idle the Intel nodes have temperatures around 25-30 degrees while the AMD ones around 35-40 degrees.
  • Rack/UPS; here the rack name is printed for computers currently powered on with a dot indicating the state of the corresponding UPS; more details about each UPS are available in the UPS table at the bottom of the page.
  • Connectivity; the status of Infiniband and IPMI connections is reported here. Infiniband connection is available for all but two nodes in the cluster and should be available for all computers which are powered on; for the ones currently off it will appear in orange instead of red since it is an expected failure. IPMI connectivity is available on all nodes and should be always on for all nodes, regardless whether the node is powered on or off. Only immediately after powering on a node its IPMI interface goes down for some 10 seconds.
  • Matlab connectivity; here one node (eta) has the Matlab DCS Server installed and will be reported with a green dot if working and a red dot otherwise; eight more compute nodes run 8 Matlab DCS workers each; nodes with all workers started are reported in green, ones with some workers (but not all) started are reported in blue, ones which are powered on, have workers configured but none started are reported in red and finally nodes which have workers configured but are currently off are reported in orange.
  • Disposition or reservation status; here computers which are powered on but are missing the 'always on' attribute have an indication on their power off date; normally all the computers which are not always on are powered off at 7pm every evening unless they are reserved. Unreserved computers which are currently on will have an entry here reading 'Will stop today at 7pm'; those reserved will have 'Reserved until' followed by a date and hour; please note that they will not be automatically powered off immediately after the reservation ends but at the first 7pm afterwards. Computers which are currently on can be reserved here for 1 or 3 days; reservations are cumulative, so for example clicking on 'Reserve for 3 days' for a computer which is already reserved will increase its reservation time by 72 hours. Reservations can be also cancelled from here but this practice is not recommended; users should always only cancel reservations which they made themselves. Nothing is printed here for computers which are always on and for the computers which are currently powered off. It's not possible to reserve a computer which is powered off; it should be powered on first then it can be reserved.
  • Actions; here various actions affecting that particular can be performed. Any computer which is currently off will have a Power On action which will immediately power it on; the boot process takes 3-4 minutes. Computers which are not always on, are currently powered on and are not reserved have a Power Off action which is also immediate; use with care. Computers with Matlab DCS workers configured have also actions to start or stop Matlab DCS workers, depending whether the workers are already started or not.

After the node lines a TOTAL line is also printed which summarizes the above information - how many computers are powered on and off, how many have IPMI connections and the total number of Matlab DCS workers for the entire cluster.

UPS table

A separate table is present at the bottom of the page with the status of each of the two UPS units; for each unit the following information is given: Name, which also indentifies the rack; number of nodes powered on and off connected to this UPS; current status, which can be ONLINE (normal use, line power present) or ONBATT (failure, no line power present, UPS is running on internal batteries); current load, given as a percentage of installed capacity; battery charge, given as a percentage; it should always be around 100% unless the UPS is running on batteries or recovering after a power failure; time remaining at current load if running on batteries only, which depends both on the battery charge percent and current load; with full batteries, an UPS can power the cluster computers between 12 and 45 minutes depending on how many computers are powered on; internal temperature of the UPS.

External links

Control Center webpage