Mimosa, MCSR’s Pentium4 Beowulf Cluster, has been in continua production for 104 days. In that time it has accumulated a number of software problems that are causing some jobs (especially big parallel jobs) to crash. We are planning a system-wide reboot to correct these problems on Friday, 29 Apr 11.
All jobs running at the time of the reboot will be affected. Users with running jobs are advised to log in to mimosa on Thursday and preserve any useful data from partially completed jobs. Users should then log in again on Friday after the reboot to check on jobs and resubmit as needed.
This should be the last reboot mimosa needs; once we bring the system back up on Friday, we expect it to be in continuous production through the end of its service life in July.