03 June 2012

172. ECCE and a ROCKS cluster: step by step

This is quite similar to a recent post, but here's a step-by-step, detailed account of how to set up ECCE for remote job submission to a ROCKS 5.4.3 cluster (one front node, 4 subnodes)

Coming soon (give it a week): Setting up a virtualbox machine with ecce for (stubborn) windows and ROCKS/CentOS users.

What isn't shown are all the failed attempts and dead-ends I went through and encountered getting to the point where I had a working system. I compiled ECCE. I compiled tcsh. I tried compiling bsd csh, which required me to compile bmake etc. This stuff looks simple, and it is simple -- but not obvious.

NOTE: From the outside we connect to rocks.university.edu. From inside the cluster the submit node is called rocks.local, and the subnodes are called node0, node1, etc. Refer to this naming if you get confused late.

Step 1. Create the site in ecce
From the terminal, do
ecce -admin
and add a new machine

Don't forget to hit Add/Change queue to make the changes to the queue part take effect. Then hit Add/Change. Oh, and pay attention to the Allocation Account tick box - if it's ticked you can't submit anything unless you add an account.  Important: the machine name you add here is the local name or local IP of the submit node -- it's not the 'public' name or url. We'll add that somewhere else later. Don't forget to select the queue manager (I forgot in the screen shot)

Close.

2. Editing your CONFIG file
Since you're already in the terminal, go to ecce-v6.3/apps/siteconfig

Take a quick peek at your Machines file (no editing):
Machines line
rocks rocks.local Dell beo Intel 40:5 ssh :NWChem:Gaussian-03 MN:RD:SD:UN:PW:Q:TL

Take another look at rocks.Q -- there's probably nothing to edit here either:

rocks.Q# Queue details for rocks
Queues:    nwchem
nwchem|minProcessors:       1
nwchem|maxProcessors:       40
nwchem|runLimit:       100000
nwchem|memLimit:       0
nwchem|scratchLimit:       0
Finally, do some editing of your CONFIG.rocks file.

CONFIG.rocks

NWChem: /share/apps/nwchem/nwchem-6.1/bin/LINUX64/nwchem
Gaussian-03: /share/apps/gaussian/g09/g09
perlPath: /usr/bin/
qmgrPath: /opt/gridengine/bin/lx26-amd64
sourcefile: /home/rocksuser/.cshrc
frontendMachine: rocks.university.edu

SGE {
#$ -S /bin/csh
#$ -cwd
#$ -l h_rt=$walltime
#$ -l h_vmem=$memoryG
#$ -j y
#$ -pe orte $totalprocs  
}

NWChemEnvironment{
            LD_LIBRARY_PATH /usr/lib/openmpi/1.3.2-gcc/lib/
}

NWChemCommand {
        /opt/openmpi/bin/mpirun -n $totalprocs $nwchem $infile > $outfile
}
Gaussian-03Command {
    setenv GAUSS_SCRDIR /tmp
    setenv GAUSS_EXEDIR /share/apps/gaussian/g09/bsd:/share/apps/gaussian/g09/local:/share/apps/gaussian/g09/extras:/share/apps/gaussian/g09
        time /share/apps/gaussian/g09/g09 $infile  $outfile }

Obviously, your variables will be different. NOTE that memory is in gigabyte here. You could also do $memoryM for megabyte. Just adjust your launcher requirements accordingly.

Step 3. Making csh modifications on the ROCKS cluster
On the main node just use the root password (or become sudo) and move /etc/csh.cshrc and /etc/csh.login out of the way (backing them up is a good idea). It doesn't seem like you need to make any changes csh-wise to the subnodes.

Step 4. Finalising our set up
Start ecce the normal way (e.g. run ecce from the terminal)
In the Gateway, start the Machine Browser, highlight 'rocks' and click on Setup Remote Access.
Do what you're told.

Step 5. Submit to your heart's content!

NOTE: the option to set the amount of memory is not shown in the launcher window above: my mistake. You can edit you apps/siteconfig/Machines file and add :MM at the end of the line, e.g.
Dynamic beryllium       Unspecified     Unspecified     Unspecified     18:3    ssh     :NWChem:Gaussian-03     MN:RD:SD:UN:PW:Q:TL:MM

No comments:

Post a Comment