MCSR Parallel-O-Gram

E-Newsletter/Blog of the Mississippi Center for Supercomputing Research

Sequoia Update

Posted on: August 29th, 2013 by Brian Hopkins

Sequoia was scheduled for a major maintenance cycle last week. We intended to take the machine down first thing Monday, 19 Aug 13, and perform some hardware repairs and a major cluster-wide OS upgrade.

On Sunday, 18 Aug 13, the disk array holding the /ptmp space on sequoia (as well as the /hpc_ptmp space on hpcwoods) failed. We spent much of the week of 8/19 attempting to revive it. In the end, we had to replace fourteen drives simultaneously and completely rebuild the filesystem.  That meant re-formatting and re-partitioning the array, then restoring all user files from our tape backup system.

At this time, the backup from tape is not complete for some users.  However, we believe it is past time to reopen the system for use.  If you log in to sequoia over the next few days and find files missing from your /ptmp area, please do not panic.  Your files are safe and sound on our tape, and we will get them restored in due course.  If there are files you urgently need that have not been restored yet, please feel free to email or call us and we will move those files to the head of the line.

Also during the down time, we did perform the planned hardware and software maintenance on sequoia.  We have thoroughly tested the machine and all of the standard-install software works following this upgrade.  If you have software that you installed that isn’t working, please let us know and we will help you repair it.

Certain nodes of sequoia are currently down for an OS problem.  SGI is working on this problem and hopes to have it resolved today.