Authors: Victor Travieso and Stuart Rankin
Last updated:
It was founded in January 1997 by a consortium of leading UK cosmologists, brought together by Stephen Hawking. It is funded by SGI, Intel, HEFCE and PPARC.
At the time of writing (May 2004), COSMOS has just entered its sixth incarnation, a 152cpu/152GB SGI Altix 3700. This platform is at the forefront of (distributed) shared memory HPC technology (as were the SGI Origin systems preceding it). It allows both shared memory and distributed memory (MPI) cosmology codes to be run with equal ease and with great flexibility and reliability. It operates a single system image across all processors and memory, creating a close similarity in feel to an ordinary linux workstation, and provides a particularly attractive environment for young researchers acquiring HPC experience.
The application form can be downloaded. Please also see the usage guidelines.
Please note that all login sessions and interactive work are automatically confined to the first 8 cpus, so please be careful not to take an unreasonable proportion of these resources. Watchdog limits the total amount of cpu time any one login session can accumulate, but it is still possible to (accidentally) cause problems.
Please remember when running interactively:
| (bash) $ export OMP_NUM_THREADS=N (tcsh) > setenv OMP_NUM_THREADS N |
For example, in Netscape7+ or Mozilla, choose Edit/Preferences from the menu; then open the Advanced/Proxies panel and enter the URL below as your Automatic proxy configuration setting:
http://www.damtp.cam.ac.uk/cosmos/proxyCOSMOS.pac.
Once your browser is reconfigured in one of the above ways, establish an SSH tunnel to an allowed host via a command such as:
| ssh -D 9870 userid'at'allowedhost.knowndept.knowninst.ac.uk |
where this is appropriate for the OpenSSH
ssh client running on unix. You may need to restart your browser for the
new proxy settings to take effect. Note that Mozilla and Netscape7+ support
multiple profiles - the COSMOS proxy configuration settings could be saved
in a new profile, making it easy to switch between the COSMOS configuration
and your normal settings.
When the SSH connection is established, and using one of the proxy
settings described, all web pages on the DAMTP network, including
those from the restricted server, will be fetched through the tunnel,
from an allowed host which has the required access privileges. The
difference between methods (1) and (2) above is that whereas in (1),
only pages from the DAMTP network are fetched through the SSH tunnel,
in (2) all web traffic is fetched via this route. (2) suffers
several disadvantages: firstly, pages fetched from sites other than
DAMTP will take slightly longer to arrive; secondly, browsing will
completely stop working if the SSH connection is terminated (until the
browser is reconfigured) and, thirdly, it may not be appropriate for
all web fetches to be redirected through someone else's network.
More information on this aimed at DAMTP users, including details of
the setup when using putty under Windows, can be found here (draft).
In principle the SSH protocol allows tunnelling of connections from X windows applications through the encrypted channel so that when started on COSMOS they display normally on your local X display. Most of the time, this works transparently so you can e.g. simply type:
| emacs & |
in your COSMOS login window and emacs will appear in its own X window on your screen (this obviously requires you to be running an X server on your local machine, and to be able to display local X programs on it).
If this fails with an error such as:
| cosmos:~ 14:54:16$ xterm& xterm Xt error: Can't open display: |
then X forwarding may not be enabled by default by your ssh program. You can ensure that it is (for the OpenSSH client) by adding the -X option to the ssh command line (similarly, under Windows turn on X forwarding in the putty preferences). Note that on very recent versions of the OpenSSH client, some applications may still not start (or may die with strange errors) unless you use the -Y option instead.
When X forwarding is enabled, it is not necessary to manually set DISPLAY or to add X cookies - doing this will (if done correctly) simply direct X programs to travel directly to your local machine outside the encrypted channel (which may fail anyway due to firewalls), but doing it incorrectly may produce an error such as below.
Sometimes X forwarding is correctly enabled, but X applications still fail with an error such as:
| X11 connection rejected because of wrong authentication. |
If you see this, and you aren't manually setting DISPLAY on COSMOS or explicitly adding X cookies (see above), then check that your quota on your home directory has not been reached by issuing the command:
| quota -v |
(an asterisk against the first figure of the /dev/cxvm/xvm94-4_cosmos entry indicates that the maximum usage has been attained). When this occurs, no new X cookies can be created in your ~/.Xauthority file, which means that X applications cannot pass the correct authentication data to your X server and so fail to display. The solution is to remedy the quota situation by reducing your usage (by deletion, transferring to your home machines or to other COSMOS filesystems) and then logging in again.
The system default Intel compilers are icc (C/C++) and ifort (Fortran). As of the upgrade to ProPack 4, these are version 9.0.
The version 8.1 compilers remain available through modules - e.g. the module icomp81 loads the most recent version of the 8.1 compilers (see below).
Because Itanium is a relatively new platform, the IA64 Intel compilers are very much a work in progress, as evidenced by the high rate of new compiler releases. Rather than change the default compilers every few weeks, newer releases are made available by packaging them into modules, which can be loaded and unloaded (switched on and off) by issuing a module command.
The full list of available modules can be printed by issuing the command:
| module avail |
For example, the system default compilers can be replaced by the most recent Intel version 8.1 compilers with the following command (note that this effects only the compilers, the numerical libraries visible remain the system default versions):
| module load icomp81 |
Conversely,
| module unload icomp81 |
restores the previous state. The command:
| module list |
lists the currently loaded modules.
The most recent Intel Math Kernel library is loaded via the module command:
| module load mkl |
Note that the same module commands used when compiling a program should be issued in the job submission script when running the program in the batch queues.
The startup file changes required for LSF are detailed below, but to always find the complete, current recommended settings, refer to the template files under /home/cosmos/template.
For C-Shell users, check if the following statements are found in your .cshrc file, and add them if not; to activate any changes run the command source ~/.cshrc (or simply log in again):
if (-r /opt/intel/setup.csh) then
source /opt/intel/setup.csh
endif
|
if [ -r /opt/intel/setup.sh ]; then
. /opt/intel/setup.sh
fi
|
~/.toprc.
The startup file changes required for modules are detailed below, but to always find the complete, current recommended settings, refer to the template files under /home/cosmos/template.
For C-Shell users, check if the following statements are found in your .cshrc file, and add them if not; to activate any changes run the command source ~/.cshrc (or simply log in again):
if ( ${UNAME} =~ IRIX* ) then
setenv MODROOT /usr/local/inst/opt/modules/modules
module load modules
endif
if ( ${UNAME} == Linux ) then
setenv MODROOT /usr/local/rpm/modules/default
endif
if ($?MODROOT) then
if (-f ${MODROOT}/init/tcsh) then
source ${MODROOT}/init/tcsh
setenv BASH_ENV ${MODROOT}/init/bash
endif
endif
|
case `uname` in
IRIX*)
MODROOT=/usr/local/inst/opt/modules/modules
;;
Linux)
MODROOT=/usr/local/rpm/modules/default
;;
esac
if [ -f ${MODROOT}/init/bash ]; then
. ${MODROOT}/init/bash
fi
|
The startup file changes required for LSF are detailed below, but to always find the complete, current recommended settings, refer to the template files under /home/cosmos/template.
For C-Shell users, check if the following statements are found in your .cshrc file, and add them if not; to activate any changes run the command source ~/.cshrc (or simply log in again):
if ( -e /home/cosmos/lsf/conf/cshrc.lsf ) then setenv MANPATH /usr/share/man:/usr/man:/usr/local/man: source /home/cosmos/lsf/conf/cshrc.lsf endif |
if [ -r /home/cosmos/lsf/conf/profile.lsf ]; then export MANPATH=/usr/share/man:/usr/man:/usr/local/man: . /home/cosmos/lsf/conf/profile.lsf fi |
This is usually due to a pre-prepared submission script being manually submitted incorrectly to bsub as a command instead of as standard input, i.e. via:
bsub large.nnnn |
instead of (as suggested by the submission dialogue):
bsub < large.nnnn |
Note that the bsub command expects to receive the parameters of the job request either through command line options, or as structured (#BSUB) comments in standard input. Any non-option arguments are simply interpreted as a command to run, so in the first case above bsub never looks inside the large.nnn file for the job details (queue, memory, cpu number etc); instead default values are applied (e.g. the default queue if none is supplied is currently small).
Old users may remember the qsub command of Cray NQE, which performed the same function as bsub in LSF but did not take the <.
# BSUB -M total_mem_in_KB # BSUB -R "rusage[mem=mem_per_cpu_in_MB:duration=15m:decay=1]"
The difference between these two statements (apart from the syntax and the fact that one is given in terms of total job memory in KB, and the other in terms of per cpu memory in MB), is as follows. The first (-M) statement imposes an operating system limit on each process of the job which none may exceed without punitive action being taken (by the operating system); this is to protect the system from runaway jobs taking excessive amounts of memory. The second (-R) statement indicates to LSF how much memory will need to be found for the job and how quickly; this is to enable the scheduler to decide intelligently when sufficient space exists for the job to be launched, and also when sufficient space exists for jobs following it to be launched (at slightly later times when the initial job may be yet to achieve it's full memory usage). Whereas -M imposes a constraint on a job from launch to exit, -R provides scheduling information which becomes irrelevant soon after launch. Clearly the two values for memory should be consistent, since no individual process should use more memory than will be required by the job as a whole (whatever the flavour of the job), but they are used differently and it is possible to submit jobs where these two statements are inconsistent.
A common error when editing submission scripts manually is omitting to edit the -R statement, or even leaving it out altogether. Both of these can result in LSF taking an incorrect value for the initial memory requirement of the job (possibly a largest-case value taken from the definition of the queue); this can cause undesirable effects such as the job never being scheduled (because the apparent memory needs are oversize and can never be met) and system problems resulting from a job being launched when there is insufficient memory available.
In LSF, this can be done through job dependencies.
These are specified either through the -w command line option to bsub, or equivalently by using a #BSUB -w directive in the submission script.
For example, to arrange for a job to not start until the job with id 1234 has finished (either successfully or with an error code), add
| #BSUB -w 'ended(1234)' |
to the submission script; alternatively, if 1234 must have finished successfully, do:
| #BSUB -w 'done(1234)' |
Equivalently, if you have a script called e.g. small.5678 already produced by the generator (in the case, by the small command), one could do
| bsub -w 'ended(1234)' < small.5678 |
from the command line, and so on.
Often the job ids involved in the dependencies will not be known because the jobs themselves have not yet been submitted. It is also possible to specify the jobs by name: the generator script asks for a job name when it runs, but this is actually supplied to LSF via the -J bsub option. The last example above would become as follows (if job 1234 is named jobname):
| bsub -w 'ended("jobname")' < small.5678 |
Please note the use of quotation marks in the places indicated above.
To submit a sequence of jobs to run one at a time in the order Job1, Job2, ... etc, one might do:
| # BSUB -J Job1 |
in the submission script for the first job (in order to name it);
| # BSUB -J Job2 # BSUB -w 'done("Job1")' |
for the second (to ensure it won't start before Job1 successfully finishes);
| # BSUB -J Job3 # BSUB -w 'done("Job2")' |
etc.
For further information, please see the LSF Documentation.
| F_UFMTENDIAN=type[:unit];type[:unit] |
where type is big for big to little endian conversion, and little to specify data in little endian format (i.e. no conversion).
For example, if all your reads use data in big endian format you can use:
| (bash) $ export F_UFMTENDIAN=big (tcsh) > setenv F_UFMTENDIAN big |
Or, if you need conversion only from a particular file assigned to unit 20:
| (bash) $ export F_UFMTENDIAN=big:20 (tcsh) > setenv F_UFMTENDIAN big:20 |
For a detailed description of the different methods, please refer to the Intel Fortran Manual.
The following C-code may also be useful.
| $ ulimit -Ss unlimited |
| $ export KMP_STACKSIZE=2gb |
| $ ldd ./myprogram libguide.so => /opt/intel/compiler70/ia64/lib/libguide.so (0x2000000000048000) libpthread.so.0 => /lib/libpthread.so.0 (0x2000000000104000) librt.so.1 => /lib/librt.so.1 (0x20000000000c8000) libcxa.so.6 => not found libunwind.so.6 => not found libc.so.6.1 => /lib/libc.so.6.1 (0x2000000000494000) /lib/ld-linux-ia64.so.2 => /lib/ld-linux-ia64.so.2 (0x2000000000000000) |
| $ module load intel-compilers.8.0.66_46 $ ldd ./myprogram libguide.so => /usr/local/rpm/cmplrs/8.0.66_46/lib/libguide.so (0x2000000000048000) libpthread.so.0 => /lib/libpthread.so.0 (0x2000000000104000) librt.so.1 => /lib/librt.so.1 (0x20000000000d0000) libcxa.so.6 => /usr/local/rpm/cmplrs/8.0.66_46/lib/libcxa.so.6 (0x2000000000178000) libunwind.so.6 => /usr/local/rpm/cmplrs/8.0.66_46/lib/libunwind.so.6 (0x2000000000494000) libc.so.6.1 => /lib/libc.so.6.1 (0x20000000004c8000) /lib/ld-linux-ia64.so.2 => /lib/ld-linux-ia64.so.2 (0x2000000000000000) libdl.so.2 => /lib/libdl.so.2 (0x20000000001e4000) |
| $ cat > jobscript cd /path/to/working/directory module load mkl dplace -x2 ./myprogram Control-D $ large jobscript |
| export LD_LIBRARY_PATH=/my/path to export LD_LIBRARY_PATH=/my/path:$LD_LIBRARY_PATH |
The usefulness of dplace relies on there being local memory available to the processes of a job to begin with. The main (non-aux) batch queues attempt to encourage this by confining different jobs to disjoint sets of CPUs, and the use of dplace is generally recommended when using these queues. However, the presence of jobs with a large memory versus CPU ratio can spoil this (by stealing the memory local to other CPUs and forcing other jobs to use remote memory) - the "ideal" ratio on the current system is 1GB per CPU (since each 2-CPU Altix node contains 2GB RAM).
dplace can also be positively harmful, if used incorrectly. E.g., if two processes wish to perform work on the same CPU, allowing one process to migrate to a less busy CPU will probably result in better overall performance. However if both processes are bound via dplace to the same CPU, so that migration is impossible, they must share the cycles of a single CPU (and each run half as fast). The main (non-aux) batch queues keep their jobs separated, so processes from unrelated jobs cannot share a CPU, but the aux queue is designed specifically to make use of spare CPU cycles wherever they exist. Job processes in this queue must therefore be able to migrate to the least busy CPUs, and for this reason use of dplace is prevented in this queue.
The separation of different (non-aux) jobs does not prevent interference between processes of the same job when dplace is used incorrectly. A typical parallel job will contain a number of non-working "shepherd" processes, in addition to the processes performing the actual work. The latter should be equal in number to the number of CPUs allocated to the job by the batch system. Since dplace binds processes to CPUs simply in the order of their creation, it is vital that the non-working shepherds are skipped over and not bound - otherwise, because the number of CPUs available for binding is smaller than the number of processes being bound, it is highly probable that two workers will end up sharing a CPU, with predictable severe damage to performance. Unfortunately, it isn't necessarily obvious how to avoid this disaster for every subspecies of multiprocessor job, so although use of dplace is recommended where it is clear how to do so, in case of doubt not using it is safest (and permitted in the main queues).
The following table describes the "correct" use of dplace for various types of multiprocessor job. The yellow rows have not yet been tested on real jobs at the time of writing. The red rows may not provide much advantage over not using dplace at all due to uncertainty in the order of process creation (performance should be compared in the two cases). Note that for single-processor jobs there is no virtue in using dplace, because the job is confined to one CPU by the batch system.
| Type of job | Example with dplace | True job size (cpus) |
No. shepherds |
|---|---|---|---|
| Simple parallel (OpenMP) |
export OMP_NUM_THREADS=N dplace -x2 ./a.out |
N ≥ 2 | 1 |
| Simple parallel (MPI) |
mpirun -np M dplace -s1 a.out | M ≥ 2 | 2 |
| Serial farm | dplace -ec 0 ./a.out.1 & dplace -ec 1 ./a.out.2 & ... dplace -ec M-1 ./a.out.M & wait |
M ≥ 2 | 0 |
| Parallel farm | export OMP_NUM_THREADS=N dplace -ec 0,x,1,...,N-1 ./a.out.1 & dplace -ec N,x,N+1,...,2N-1 ./a.out.2 & ... dplace -ec (M-1)N,x,...,MN-1 ./a.out.M & wait |
M x N (N ≥ 2) |
M |
| Hybrid parallel* | export OMP_NUM_THREADS=N mpirun -np 4 dplace -x 481 cosmomc |
4N (N ≥ 2) |
6 | Hybrid parallel** | export OMP_NUM_THREADS=N mpirun -np 8 dplace -x 130561 cosmomc |
8N (N ≥ 2) |
10 |
* Mileage may vary. This depends on timing, unfortunately: 481 is
111100001 in binary, which implies that processes 1 and 6-9 inclusive
are to be skipped. But in this case the shepherds could in principle
be spawned in a different chronological order, which could make this
bitmask incorrect, although the disaster of two workers bound to the
same CPU should be avoided by virtue of the fact that at least the
right number of processes is being skipped. Similarly, -x2 (a skip bit
mask of 10) implies that the second process created in a simple OpenMP
job should be skipped, as this is (in the current Intel OpenMP
implementation used in SGI ProPack 4) the single shepherd process for
a simple OpenMP application.
** If this is actually beneficial, please let me know! 130561 is
11111111000000001 in binary, but any number smaller than
2^(8N+9) with 9 1's in its binary representation (e.g. 511) may
be just as good, from the remarks above.
Is it trying to read or change a file which may be offline (i.e. contents have been transferred to tape)?
Remember that on some filesystems, older files are automatically migrated to tape.
For batch jobs, remember to check that the files required to be accessible are online, as described in the Quick start instructions. Another frequent scenario is the use of scp or sftp to transfer old data off the system - in both cases, offline files will be recalled individually on their first access, which is very inefficient and slow (the same tape will probably load and unload many times).
The commands dmfind or dmls are DMF-aware versions of find and ls respectively, which can be used to locate or list the DMF-state of migrated files.
Here is how to efficiently recall an entire directory in preparation for submitting a job or running a command which will need to access the files there.
An additional function is to provide memory monitoring and cpu allocation features, which are needed for efficient operation of the Altix but missing or unreliable in basic LSF.
It's easy to start ignoring watchdog messages, but for batch jobs these are only produced if a significant discrepancy is detected (e.g. a 25% variance between real and advertised memory usage for a large memory job) which may have performance implications, both for the particular job being scanned and for the system as a whole. Please don't ignore them - if you don't understand why you have received a message, or think you have received one inappropriately, please contact cosmos_sys.
The obvious way to do this is to create a basic jobscript firing each sub-job in turn with an & (see the user guide); e.g.
export OMP_NUM_THREADS=4 cd /path/to/directory1 ./myprogram1 & cd /path/to/directory2 ./myprogram2 & cd /path/to/directory3 ./myprogram3 & wait |
for a 3x4=12 cpu job. Note however the two non-trivial features:
Note that jobs with a complicated structure (such as farms of M, N-cpu jobs) used to confuse watchdog when it tried to work out the number of cpus actually being used, but this issue is now resolved.
runCosmomc [options] <params_file> <number_of_chains>
Options:
--jobname <jobname> Job name (default: <params_file>)
--queue <queuename> COSMOS queue (default: small for cpus <= 8
large for cpus > 8)
--threads <threads> OpenMP threads per chain (default: 1)
--runtime <minutes> Wall clock run time in minutes
--size <megabytes> Total memory required in MB
--dplace Use dplace (default: yes)
--progname <progname> Name of the program binary (default: cosmomc)
Abbreviations and short forms of the options are possible.
Note that most of the above options have sensible defaults, so you can probably get away with simply:
runCosmomc -t 2 -r 120 params 4
for an 8-cpu vanilla cosmomc job with 4 chains (and 2 threads) using params.ini lasting 2 hours. A job not using OpenMP threads could omit the -t option (it defaults to 1), in the above example that would create a 4-cpu job. Also, omitting the -r option will result in the default value for the queue being used for the runtime limit (which for both large and small is currently 8 hours).
OMP_NUM_THREADS, or the
params.ini file option num_threads (if the latter is
non-zero). Thus, the total number of working threads, and therefore
the number of cpus to request for such a job to run properly, is
MxN.
Prior to 15th December 2004, the automatic job submission script
(qscribe) assumed that when a job requested X cpus,
OMP_NUM_THREADS should also be set to X (and silently did
so). For the majority of (non-hybrid) jobs, i.e. jobs which are either
pure MPI or pure OpenMP, this was correct. However in the case of a
hybrid code like cosmomc this made it easy to launch vastly bigger
jobs than intended (because OMP_NUM_THREADS could easily
end up equal to the desired total size, which is then multiplied by the
number of MPI threads to produce the actual overall size of the
job). The most recent version of qscribe asks separately for the
number of OpenMP threads, however users modifying older scripts for
use with cosmomc should take care to explicitly specify a reasonable
size for OMP_NUM_THREADS in their prototype script,
e.g.:
export OMP_NUM_THREADS=4 cd /path/to/directory mpirun -np 2 dplace -x25 ./cosmomc params.ini |
for an 8-cpu job; beware that a non-zero num_threads
value in params.ini seems to override the value of
OMP_NUM_THREADS. (See the next paragraph for an
explanation of the use of dplace.)
OMP_NUM_THREADS, or
num_threads, is N, the total number of shepherds is
M+2).For best performance it is recommended that dplace be used to bind the worker processes to specific cpus (this reduces overhead incurred by moving between cpus, reduces interference from "free-roaming" lower priority jobs, and increases the likelihood of memory accesses staying local to each cpu). The existence of shepherds implies that there are more processes making up the job than cpus allocated, since the latter are chosen to be only as numerous as the number of working processes. Ideally, the working processes will be placed (or bound) 1-1 to the allocated cpus, with the shepherds left to roam. The worst case scenario is multiple working processes bound to the same cpu, which would clearly lead to two or more job threads running at less than half speed; the same would probably then be true of the job as a whole (because the unhampered threads would have to wait for the slow ones to catch up). The upshot of this is that although for best performance we would like to tell dplace to skip placement for all shepherd processes, and only for the shepherds, we must at least skip placement of as many processes as there are shepherds to ensure that each cpu has no more than one process bound to it.
Below are listed the best-guess dplace cosmomc launch command lines for various sizes of job. Notice that they depend only on the number of MPI threads chosen, and not on the number of OpenMP threads. In theory these should give better performance than not using dplace at all. The order in which processes are created is significant, unfortunately, so these bitmasks may not always lead to only shepherds being skipped as processes are bound to each available cpu in turn. However they contain enough binary 1s to ensure at least that no cpu has more than one process attached to it, thus avoiding the worst case scenario. Other dplace commands are liable to have disastrous effects on performance, if this aspect is wrong. For more information, please see this faq.
| No. MPI threads | Example | No. shepherds |
|---|---|---|
| 1 | mpirun -np 1 dplace -x5 cosmomc params.ini | 3 |
| 2 | mpirun -np 2 dplace -x25 cosmomc params.ini | 4 |
| 3 | mpirun -np 3 dplace -x113 cosmomc params.ini | 5 |
| 4 | mpirun -np 4 dplace -x481 cosmomc params.ini | 6 |
| 8 | mpirun -np 8 dplace -x130561 cosmomc params.ini | 10 |
| 16 | mpirun -np 16 dplace -x8589803521 cosmomc params.ini | 18 |
| 32 | mpirun -np 32 dplace -x36893488138829168641 cosmomc params.ini | 34 |
NB When SGI ProPack is upgraded, the above numbers may change.