Contents:

  1. What is COSMOS?
  2. Getting an account
  3. Log in/submit a job
  4. Propack 4 info
  5. Interactive use
  6. Web documentation
  7. X applications
  8. Compiler version
  9. Intel compilers
  10. module not working
  11. top unreadable
  12. Missing LSF commands
  13. Strange job parameters
  14. Editing job memory
  15. Using job dependencies
  16. Big- or Little- Endian
  17. Compiler errors
  18. Segmentation faults
  19. "Shared libraries" error
  20. "Unaligned access" error
  21. Mpirun not working
  22. Using dplace
  23. Job not doing anything?
  24. What is watchdog?
  25. Farming smaller jobs
  26. Running CosmoMC

Frequently Asked Questions


Authors: Victor Travieso and Stuart Rankin
Last updated:


Contents:

  1. What is COSMOS?
  2. How do I apply for a COSMOS account?
  3. How do I log in/submit a job to COSMOS?
  4. I haven't used COSMOS recently, is there anything I need to know?
  5. What if I want to run interactively?
  6. Why can't I access some web documentation?
  7. Why can't I start X applications on COSMOS?
  8. What compiler should I be using?
  9. Why can't I find the Intel compilers?
  10. Why is module not working?
  11. Why is top filling the screen with rubbish?
  12. Why are LSF commands (e.g. bjobs, bsub) not working?
  13. Why has my job entered the small queue with strange parameters?
  14. I manually edited a submission script to change the memory requirements - why won't LSF schedule the job?
  15. How do I arrange that a batch job won't be started before another is finished?
  16. How can I convert data in Big Endian format (e.g. from SGI Origin) to Little Endian in order to read it with a Fortran program on the Altix?
  17. What is causing this compiler error: "(Insert error here). Severe!"
  18. What is the cause of segmentation faults?
  19. Why am I getting "Error while loading shared libraries..." when running the program?
  20. My application is displaying "Unaligned Access" errors when running. What is the problem?
  21. Why is mpirun refusing to work?
  22. What is dplace and how do I use it?
  23. Why is my job or command not doing anything?
  24. What is watchdog and why is it emailing me?
  25. How do I farm several smaller jobs to make one larger multiprocessor job?
  26. How do I run cosmomc on COSMOS?
  27. My files disappeared!
  28. I deleted an important file - how do I get it back?

What is COSMOS?

COSMOS is a supercomputing facility dedicated to research in cosmology.

It was founded in January 1997 by a consortium of leading UK cosmologists, brought together by Stephen Hawking. It is funded by SGI, Intel, HEFCE and PPARC.

At the time of writing (May 2004), COSMOS has just entered its sixth incarnation, a 152cpu/152GB SGI Altix 3700. This platform is at the forefront of (distributed) shared memory HPC technology (as were the SGI Origin systems preceding it). It allows both shared memory and distributed memory (MPI) cosmology codes to be run with equal ease and with great flexibility and reliability. It operates a single system image across all processors and memory, creating a close similarity in feel to an ordinary linux workstation, and provides a particularly attractive environment for young researchers acquiring HPC experience.

How do I apply for a COSMOS account?

Normally it is necessary for all potential users of COSMOS to have a link to one of the consortium sites at Cambridge, Durham, Imperial College, Manchester, Oxford, Portsmouth or Sussex, and to obtain the support of a consortium investigator.

The application form can be downloaded. Please also see the usage guidelines.

I haven't used COSMOS recently, is there anything I need to know?

SGI ProPack 4 was adopted in May 2006. For more information, please refer to this page.

How do I log in/submit a job to COSMOS?

Please see the Quick Start instructions.

What if I want to run interactively?

This is permissible for development and testing, but not production codes, as described in the user guide.

Please note that all login sessions and interactive work are automatically confined to the first 8 cpus, so please be careful not to take an unreasonable proportion of these resources. Watchdog limits the total amount of cpu time any one login session can accumulate, but it is still possible to (accidentally) cause problems.

Please remember when running interactively:

To set OMP_NUM_THREADS explicitly to a value of 8 or less when running an OpenMP code.
By default, OpenMP jobs will look at the total size of the machine as a guide if the number of cpus on which to run is not explicitly set, and this can easily massively overload the first 8 cpus. E.g., for 1<N≤8:

(bash) $ export OMP_NUM_THREADS=N
(tcsh) > setenv OMP_NUM_THREADS N

Not to use dplace.
The interactive cpus are shared by all logins, so jobs should not be bound to cpus.

Why can't I access some web documentation?

Some web pages referred to from the COSMOS web documentation (like this) may give a Forbidden error when viewed from machines outside the local DAMTP network. The reason for this is that some proprietary technical documentation cannot be made viewable to the entire world, and an effort has been made to confine access to registered COSMOS hosts.

These pages are served from a different web server, configured to give access only to the set of IP addresses corresponding to known, registered COSMOS hosts (i.e. the same set from which ssh logins are possible).

Unfortunately this is insufficient when, for example, all web traffic from your site is routed through a proxy (which is probably not registered), or you wish to read the documentation from a home machine or laptop which has not been registered (and which may not even have a static IP address which can be registered).

Bad solution:  Use ssh -X to connect to a registered host, or (worse) COSMOS itself, then launch a browser on the remote registered host, displaying it locally. This can be very slow and frustrating, particularly over low bandwidth connections (forget it over a modem), although it is possible as an emergency manoeuvre. However, please don't run browsers on COSMOS itself. NEVER connect to a registered host using an insecure method like telnet or rlogin and from there to COSMOS via SSH - the COSMOS password will be completely unencrypted and exposed on the first leg of its journey from your local machine.

Good solution: Configure port forwarding in your local SSH client so that some or all of the traffic from/to your local web browser is automatically redirected over an SSH connection to a host from which all pages are visible (e.g. a registered host or COSMOS itself). From the point of view of the restricted web server, all your requests to read pages would then be coming from an allowed host.

There is more than one way to use port forwarding to accomplish this. Your local web browser needs to support SOCKS proxy hosts at minimum, and preferably also PAC scripts (modern versions of Netscape, Internet Explorer and Mozilla satisfy these requirements, except that IE doesn't on Mac, apparently ).
  1. The neatest way is to use a Proxy Auto Configuration (PAC) script so that only COSMOS-related traffic uses the SSH tunnel (thus avoiding the need to reconfigure your browser when it is no longer desirable to channel all your web traffic through the remote network).

    For example, in Netscape7+ or Mozilla, choose Edit/Preferences from the menu; then open the Advanced/Proxies panel and enter the URL below as your Automatic proxy configuration setting:

    http://www.damtp.cam.ac.uk/cosmos/proxyCOSMOS.pac.

  2. Alternatively, choose Manual proxy configuration and enter localhost as SOCKS Host and 9870 for Port.

Once your browser is reconfigured in one of the above ways, establish an SSH tunnel to an allowed host via a command such as:

ssh -D 9870 userid'at'allowedhost.knowndept.knowninst.ac.uk

where this is appropriate for the OpenSSH ssh client running on unix. You may need to restart your browser for the new proxy settings to take effect. Note that Mozilla and Netscape7+ support multiple profiles - the COSMOS proxy configuration settings could be saved in a new profile, making it easy to switch between the COSMOS configuration and your normal settings.

When the SSH connection is established, and using one of the proxy settings described, all web pages on the DAMTP network, including those from the restricted server, will be fetched through the tunnel, from an allowed host which has the required access privileges. The difference between methods (1) and (2) above is that whereas in (1), only pages from the DAMTP network are fetched through the SSH tunnel, in (2) all web traffic is fetched via this route. (2) suffers several disadvantages: firstly, pages fetched from sites other than DAMTP will take slightly longer to arrive; secondly, browsing will completely stop working if the SSH connection is terminated (until the browser is reconfigured) and, thirdly, it may not be appropriate for all web fetches to be redirected through someone else's network.

More information on this aimed at DAMTP users, including details of the setup when using putty under Windows, can be found here (draft).

Why can't I start X applications on COSMOS?

In principle the SSH protocol allows tunnelling of connections from X windows applications through the encrypted channel so that when started on COSMOS they display normally on your local X display. Most of the time, this works transparently so you can e.g. simply type:

emacs &

in your COSMOS login window and emacs will appear in its own X window on your screen (this obviously requires you to be running an X server on your local machine, and to be able to display local X programs on it).

If this fails with an error such as:

cosmos:~ 14:54:16$ xterm&
xterm Xt error: Can't open display:

then X forwarding may not be enabled by default by your ssh program. You can ensure that it is (for the OpenSSH client) by adding the -X option to the ssh command line (similarly, under Windows turn on X forwarding in the putty preferences). Note that on very recent versions of the OpenSSH client, some applications may still not start (or may die with strange errors) unless you use the -Y option instead.

When X forwarding is enabled, it is not necessary to manually set DISPLAY or to add X cookies - doing this will (if done correctly) simply direct X programs to travel directly to your local machine outside the encrypted channel (which may fail anyway due to firewalls), but doing it incorrectly may produce an error such as below.

Sometimes X forwarding is correctly enabled, but X applications still fail with an error such as:

X11 connection rejected because of wrong authentication.

If you see this, and you aren't manually setting DISPLAY on COSMOS or explicitly adding X cookies (see above), then check that your quota on your home directory has not been reached by issuing the command:

quota -v

(an asterisk against the first figure of the /dev/cxvm/xvm94-4_cosmos entry indicates that the maximum usage has been attained). When this occurs, no new X cookies can be created in your ~/.Xauthority file, which means that X applications cannot pass the correct authentication data to your X server and so fail to display. The solution is to remedy the quota situation by reducing your usage (by deletion, transferring to your home machines or to other COSMOS filesystems) and then logging in again.

What compiler should I be using?

As you might expect from an environment closely resembling RedHat Linux, GCC compilers are available: gcc/g77 (3.23). However, although these can be used to build minor single-processor programs and utilities, codes to be run for serious work on COSMOS should use the Intel compilers unless there are overriding compatibility reasons for using GCC. For more comment see the development guide.

The system default Intel compilers are icc (C/C++) and ifort (Fortran). As of the upgrade to ProPack 4, these are version 9.0.

The version 8.1 compilers remain available through modules - e.g. the module icomp81 loads the most recent version of the 8.1 compilers (see below).

Because Itanium is a relatively new platform, the IA64 Intel compilers are very much a work in progress, as evidenced by the high rate of new compiler releases. Rather than change the default compilers every few weeks, newer releases are made available by packaging them into modules, which can be loaded and unloaded (switched on and off) by issuing a module command.

The full list of available modules can be printed by issuing the command:

module avail

For example, the system default compilers can be replaced by the most recent Intel version 8.1 compilers with the following command (note that this effects only the compilers, the numerical libraries visible remain the system default versions):

module load icomp81

Conversely,

module unload icomp81

restores the previous state. The command:

module list

lists the currently loaded modules.

The most recent Intel Math Kernel library is loaded via the module command:

module load mkl

Note that the same module commands used when compiling a program should be issued in the job submission script when running the program in the batch queues.

Why can't I find the Intel compilers?

Note there are a number of different sets of compilers on the system (see this question) but all users should be able to see the default Intel 9.0 compilers, icc and ifort. If these commands aren't found, they are probably not in your PATH. This can happen if you are missing the necessary settings in your startup files, because e.g. you have modified the default startup settings or you possess an ordinary DAMTP or old COSMOS account that has not been adapted for the new COSMOS.

The startup file changes required for LSF are detailed below, but to always find the complete, current recommended settings, refer to the template files under /home/cosmos/template.

For C-Shell users, check if the following statements are found in your .cshrc file, and add them if not; to activate any changes run the command source ~/.cshrc (or simply log in again):

if (-r /opt/intel/setup.csh) then
    source /opt/intel/setup.csh
endif

If you use bash, the same applies to the following lines in your .bashrc file (activate changes by running . ~/.bashrc - note the initial "." -  or log in again):

if [ -r /opt/intel/setup.sh ]; then
    . /opt/intel/setup.sh
fi

Why is top filling the screen with rubbish?

The default behaviour of top is to print a summary information line for each CPU at the head of the display. On a large system like COSMOS, this fills the screen so that the main display becomes invisible. To work around this known issue, press t to eliminate the summary info. To stop having to do this every time, press W, which will store the setting in ~/.toprc.

Why is module not working?

If the module command is not in your PATH, or you don't seem to be finding any modules (e.g. when typing module avail), it is probably because you are missing the necessary settings in your startup files. This can happen if you have modified the default startup settings or you possess an ordinary DAMTP or old COSMOS account that has not been adapted for the new COSMOS.

The startup file changes required for modules are detailed below, but to always find the complete, current recommended settings, refer to the template files under /home/cosmos/template.

For C-Shell users, check if the following statements are found in your .cshrc file, and add them if not; to activate any changes run the command source ~/.cshrc (or simply log in again):

if ( ${UNAME} =~ IRIX* ) then
    setenv MODROOT /usr/local/inst/opt/modules/modules
    module load modules
endif

if ( ${UNAME} == Linux ) then
    setenv MODROOT /usr/local/rpm/modules/default
endif

if ($?MODROOT) then
    if (-f ${MODROOT}/init/tcsh) then
      source ${MODROOT}/init/tcsh
      setenv BASH_ENV ${MODROOT}/init/bash
    endif
endif

If you use bash, the same applies to the following lines in your .bashrc file (activate changes by running . ~/.bashrc - note the initial "." -  or log in again):

case `uname` in
    IRIX*)
         MODROOT=/usr/local/inst/opt/modules/modules
        ;;
    Linux)
         MODROOT=/usr/local/rpm/modules/default
        ;;
esac
if [ -f ${MODROOT}/init/bash ]; then
   . ${MODROOT}/init/bash
fi

Why are LSF commands (e.g. bjobs, bsub) not working?

If LSF commands such as bjobs, bsub etc aren't found, they are probably not in your PATH. This can happen if you are missing the necessary settings in your startup files, because e.g. you have modified the default startup settings or you possess an ordinary DAMTP or old COSMOS account that has not been adapted for the new COSMOS.

The startup file changes required for LSF are detailed below, but to always find the complete, current recommended settings, refer to the template files under /home/cosmos/template.

For C-Shell users, check if the following statements are found in your .cshrc file, and add them if not; to activate any changes run the command source ~/.cshrc (or simply log in again):

if ( -e /home/cosmos/lsf/conf/cshrc.lsf ) then
   setenv MANPATH /usr/share/man:/usr/man:/usr/local/man:
   source /home/cosmos/lsf/conf/cshrc.lsf 
endif

If you use bash, the same applies to the following lines in your .bashrc file (activate changes by running . ~/.bashrc - note the initial "." -  or log in again):

if [ -r /home/cosmos/lsf/conf/profile.lsf ]; then
   export MANPATH=/usr/share/man:/usr/man:/usr/local/man:
   . /home/cosmos/lsf/conf/profile.lsf
fi

Why has my job entered the small queue with strange parameters?

It may happen that a job is found to have entered the small queue (when possibly another queue was intended) with unexpected parameters, which may result in the job never being scheduled, or other undesired effects.

This is usually due to a pre-prepared submission script being manually submitted incorrectly to bsub as a command instead of as standard input, i.e. via:

bsub large.nnnn

instead of (as suggested by the submission dialogue):

bsub < large.nnnn

Note that the bsub command expects to receive the parameters of the job request either through command line options, or as structured (#BSUB) comments in standard input. Any non-option arguments are simply interpreted as a command to run, so in the first case above bsub never looks inside the large.nnn file for the job details (queue, memory, cpu number etc); instead default values are applied (e.g. the default queue if none is supplied is currently small).

Old users may remember the qsub command of Cray NQE, which performed the same function as bsub in LSF but did not take the <.

I manually edited a submission script to change the memory requirements - why won't LSF schedule the job?

In general we don't recommend manually editing submission scripts generated previously by the queuename commands and submitting them to create new jobs, or writing submission scripts from first principles, as there are a number of subtleties which the automatic method takes care of, and which manual methods can easily miss. One such subtlety is the fact that the memory requirements of a job need to be stated in TWO places:
# BSUB -M total_mem_in_KB
# BSUB -R "rusage[mem=mem_per_cpu_in_MB:duration=15m:decay=1]"

The difference between these two statements (apart from the syntax and the fact that one is given in terms of total job memory in KB, and the other in terms of per cpu memory in MB), is as follows. The first (-M) statement imposes an operating system limit on each process of the job which none may exceed without punitive action being taken (by the operating system); this is to protect the system from runaway jobs taking excessive amounts of memory. The second (-R) statement indicates to LSF how much memory will need to be found for the job and how quickly; this is to enable the scheduler to decide intelligently when sufficient space exists for the job to be launched, and also when sufficient space exists for jobs following it to be launched (at slightly later times when the initial job may be yet to achieve it's full memory usage). Whereas -M imposes a constraint on a job from launch to exit, -R provides scheduling information which becomes irrelevant soon after launch. Clearly the two values for memory should be consistent, since no individual process should use more memory than will be required by the job as a whole (whatever the flavour of the job), but they are used differently and it is possible to submit jobs where these two statements are inconsistent.

A common error when editing submission scripts manually is omitting to edit the -R statement, or even leaving it out altogether. Both of these can result in LSF taking an incorrect value for the initial memory requirement of the job (possibly a largest-case value taken from the definition of the queue); this can cause undesirable effects such as the job never being scheduled (because the apparent memory needs are oversize and can never be met) and system problems resulting from a job being launched when there is insufficient memory available.

How do I arrange that a batch job won't be started before another is finished?

In LSF, this can be done through job dependencies.

These are specified either through the -w command line option to bsub, or equivalently by using a #BSUB -w directive in the submission script.

For example, to arrange for a job to not start until the job with id 1234 has finished (either successfully or with an error code), add

#BSUB -w 'ended(1234)'

to the submission script; alternatively, if 1234 must have finished successfully, do:

#BSUB -w 'done(1234)'

Equivalently, if you have a script called e.g. small.5678 already produced by the generator (in the case, by the small command), one could do

bsub -w 'ended(1234)' < small.5678

from the command line, and so on.

Often the job ids involved in the dependencies will not be known because the jobs themselves have not yet been submitted. It is also possible to specify the jobs by name: the generator script asks for a job name when it runs, but this is actually supplied to LSF via the -J bsub option. The last example above would become as follows (if job 1234 is named jobname):

bsub -w 'ended("jobname")' < small.5678

Please note the use of quotation marks in the places indicated above.

To submit a sequence of jobs to run one at a time in the order Job1, Job2, ... etc, one might do:

# BSUB -J Job1

in the submission script for the first job (in order to name it);

# BSUB -J Job2
# BSUB -w 'done("Job1")'

for the second (to ensure it won't start before Job1 successfully finishes);

# BSUB -J Job3
# BSUB -w 'done("Job2")'

etc.

For further information, please see the LSF Documentation.

How can I convert data in Big Endian format (e.g. from SGI Origin) to Little Endian in order to read it with a Fortran program on the Altix?

There are a variety of ways to achieve this easily. If you want to avoid recompiling the program, you can use the environmental variable F_UFMTENDIAN to specify a list of I/O units and type of conversion desired. The syntax is:

F_UFMTENDIAN=type[:unit];type[:unit]

where type is big for big to little endian conversion, and little to specify data in little endian format (i.e. no conversion).

For example, if all your reads use data in big endian format you can use:

(bash) $ export F_UFMTENDIAN=big
(tcsh) > setenv F_UFMTENDIAN big

Or, if you need conversion only from a particular file assigned to unit 20:

(bash) $ export F_UFMTENDIAN=big:20
(tcsh) > setenv F_UFMTENDIAN big:20

For a detailed description of the different methods, please refer to the Intel Fortran Manual.

The following C-code may also be useful.

What is causing this compiler error: "(Insert error here). Severe!"

The error could be caused by some illegal construction in your code, so read the error message carefully and try to relate it to a particular section in the source.

More commonly, you may have stumbled upon a compiler bug (they are quite common, sadly). If this seems to be the case, please email us providing enough information for us to reproduce the error (eg. error message, path to your source and Makefile, compiler version used). We will look into it and if necessary we'll submit a bug report using the COSMOS Intel Premier Support accounts. In some cases we might be able to find a workaround  that allows the code to be compiled correctly in spite of the bug.  The actual fix must come from Intel in the form of a patch (which usually takes a minimum of four weeks).

What is the cause of segmentation faults?

The most common cause if running interactively is inadequate stack size. You can increase the stack to up to the system maximum 8 Gb by using:

$ ulimit -Ss  unlimited

If an OpenMP application gets a segmentation fault immediately upon execution, you may need to increase the thread's private stack by setting the KMP_STACKSIZE environmental variable:

$ export KMP_STACKSIZE=2gb

- unlikely, as it is already should be set (via the cosmolib module) to a respectable 1 GB. Here, newsize can be written with a unit modifier (eg. 2gb for 2000 Mb).

More general segmentation faults are usually caused by out-of-bounds memory references in arrays or subroutine calls. For these type of errors, the most efficient way to proceed is to recompile the code for debugging using '-g' and to run the program interactively under a debugger to locate the illegal access. You can refer here for a brief introduction to debugging on COSMOS.

Why am I getting "Error while loading shared libraries..." when running the program?

This normally means that a required run time library cannot be found by the linker. You can check the run time libraries needed by your application with ldd, and change the LD_LIBRARY_PATH (manually or via modules) accordingly. E.g.:

$ ldd ./myprogram
        libguide.so => /opt/intel/compiler70/ia64/lib/libguide.so (0x2000000000048000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x2000000000104000)
        librt.so.1 => /lib/librt.so.1 (0x20000000000c8000)
        libcxa.so.6 => not found
        libunwind.so.6 => not found
        libc.so.6.1 => /lib/libc.so.6.1 (0x2000000000494000)
        /lib/ld-linux-ia64.so.2 => /lib/ld-linux-ia64.so.2 (0x2000000000000000)

Here, myprogram was compiled with the Intel Fortran Compiler version 8, however, the OpenMP run time library (libguide) is being picked from the default compiler (version 7, at the time of writing) path, and the version 8 specific libraries (libcxa.so.6, libunwind.so.6) are not found. Loading the compiler module adds the necessary paths in the environment and resolves the missing and incorrect libraries:

$ module load intel-compilers.8.0.66_46
$ ldd ./myprogram
        libguide.so => /usr/local/rpm/cmplrs/8.0.66_46/lib/libguide.so (0x2000000000048000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x2000000000104000)
        librt.so.1 => /lib/librt.so.1 (0x20000000000d0000)
        libcxa.so.6 => /usr/local/rpm/cmplrs/8.0.66_46/lib/libcxa.so.6 (0x2000000000178000)
        libunwind.so.6 => /usr/local/rpm/cmplrs/8.0.66_46/lib/libunwind.so.6 (0x2000000000494000)
        libc.so.6.1 => /lib/libc.so.6.1 (0x20000000004c8000)
        /lib/ld-linux-ia64.so.2 => /lib/ld-linux-ia64.so.2 (0x2000000000000000)
        libdl.so.2 => /lib/libdl.so.2 (0x20000000001e4000)

Note that if a module is loaded in the shell from where you are submitting a job to the queue, the module environment will normally be inherited by the job script. However, if you launch the program from a different shell window or you log in again at a later time you will need to reload the module. It is advisable to explicitly load the module in the job script. By doing so, you ensure that the correct libraries are loaded and also set a reminder to yourself of which compiler version was used to compile the program. E.g.:

$ cat > jobscript
cd /path/to/working/directory
module load mkl
dplace -x2 ./myprogram
Control-D
$ large jobscript

The module command inside the job script will always work if you have the default COSMOS startup environment. If you encounter problems refer to this question.

My application is displaying "Unaligned Access" errors when running. What is the problem?

This is actually a warning message and not an error. The program should still be running fine and produce the right results.

Unaligned accesses are caused by data that isn't stored with the proper alignment to memory boundaries. This can cause small delays when loading the data from memory, but it shouldn't result in a big performance loss unless the loads are located inside a time consuming loop.

Proper alignment can be ensured by reordering the declaration of variables in the code and by instructing the compiler to explicitly solve alignment problems (options -align -pad for the Intel Fortran Compiler).

Why is mpirun refusing to work?

Mpirun creates the equivalent of a fresh login session for every job, so if any of your startup settings overwrite, say, the library paths, any of the settings introduced via modules will be lost.

To check if this is causing the problems try running ldd on your application or try running the program without mpirun and look for linker errors. If everything looks fine you will need to check your startup files for any lines that set the LD_LIBRARY_PATH and add the current environment to it. E.g. change:

export LD_LIBRARY_PATH=/my/path
to
export LD_LIBRARY_PATH=/my/path:$LD_LIBRARY_PATH

If running ldd shows missing libraries refer to this question.

What is dplace and how do I use it?

dplace is one of the SGI NUMA tools. It allows the user to bind the important (i.e. working) processes belonging to an application to specific CPUs, so that memory initially allocated locally by such processes remains local (the processes being unable to migrate to more distant CPUs). Not doing this can lead to an unnecessarily high ratio of remote to local memory accesses, and thus degraded performance.

The usefulness of dplace relies on there being local memory available to the processes of a job to begin with. The main (non-aux) batch queues attempt to encourage this by confining different jobs to disjoint sets of CPUs, and the use of dplace is generally recommended when using these queues. However, the presence of jobs with a large memory versus CPU ratio can spoil this (by stealing the memory local to other CPUs and forcing other jobs to use remote memory) - the "ideal" ratio on the current system is 1GB per CPU (since each 2-CPU Altix node contains 2GB RAM).

dplace can also be positively harmful, if used incorrectly. E.g., if two processes wish to perform work on the same CPU, allowing one process to migrate to a less busy CPU will probably result in better overall performance. However if both processes are bound via dplace to the same CPU, so that migration is impossible, they must share the cycles of a single CPU (and each run half as fast). The main (non-aux) batch queues keep their jobs separated, so processes from unrelated jobs cannot share a CPU, but the aux queue is designed specifically to make use of spare CPU cycles wherever they exist. Job processes in this queue must therefore be able to migrate to the least busy CPUs, and for this reason use of dplace is prevented in this queue.

The separation of different (non-aux) jobs does not prevent interference between processes of the same job when dplace is used incorrectly. A typical parallel job will contain a number of non-working "shepherd" processes, in addition to the processes performing the actual work. The latter should be equal in number to the number of CPUs allocated to the job by the batch system. Since dplace binds processes to CPUs simply in the order of their creation, it is vital that the non-working shepherds are skipped over and not bound - otherwise, because the number of CPUs available for binding is smaller than the number of processes being bound, it is highly probable that two workers will end up sharing a CPU, with predictable severe damage to performance. Unfortunately, it isn't necessarily obvious how to avoid this disaster for every subspecies of multiprocessor job, so although use of dplace is recommended where it is clear how to do so, in case of doubt not using it is safest (and permitted in the main queues).

The following table describes the "correct" use of dplace for various types of multiprocessor job. The yellow rows have not yet been tested on real jobs at the time of writing. The red rows may not provide much advantage over not using dplace at all due to uncertainty in the order of process creation (performance should be compared in the two cases). Note that for single-processor jobs there is no virtue in using dplace, because the job is confined to one CPU by the batch system.

Type of job Example with dplace True job size
(cpus)
No. shepherds
Simple parallel
(OpenMP)
export OMP_NUM_THREADS=N
dplace -x2 ./a.out
N ≥ 2 1
Simple parallel
(MPI)
mpirun -np M dplace -s1 a.out M ≥ 2 2
Serial farm dplace -ec 0 ./a.out.1 &
dplace -ec 1 ./a.out.2 &
...
dplace -ec M-1 ./a.out.M &
wait
M ≥ 2 0
Parallel farm export OMP_NUM_THREADS=N
dplace -ec 0,x,1,...,N-1 ./a.out.1 &
dplace -ec N,x,N+1,...,2N-1 ./a.out.2 &
...
dplace -ec (M-1)N,x,...,MN-1 ./a.out.M &
wait
M x N
(N ≥ 2)
M
Hybrid parallel* export OMP_NUM_THREADS=N
mpirun -np 4 dplace -x 481 cosmomc
4N
(N ≥ 2)
6
Hybrid parallel** export OMP_NUM_THREADS=N
mpirun -np 8 dplace -x 130561 cosmomc
8N
(N ≥ 2)
10

* Mileage may vary. This depends on timing, unfortunately: 481 is 111100001 in binary, which implies that processes 1 and 6-9 inclusive are to be skipped. But in this case the shepherds could in principle be spawned in a different chronological order, which could make this bitmask incorrect, although the disaster of two workers bound to the same CPU should be avoided by virtue of the fact that at least the right number of processes is being skipped. Similarly, -x2 (a skip bit mask of 10) implies that the second process created in a simple OpenMP job should be skipped, as this is (in the current Intel OpenMP implementation used in SGI ProPack 4) the single shepherd process for a simple OpenMP application.
** If this is actually beneficial, please let me know! 130561 is 11111111000000001 in binary, but any number smaller than 2^(8N+9) with 9 1's in its binary representation (e.g. 511) may be just as good, from the remarks above.

Why is my job or command not doing anything?

If there is a batch job which has not yet started to run, use the command bjobs -pl to find the reason it is still pending in the queue. Otherwise, if the job or command is running but not apparently doing work, see below.

Is it trying to read or change a file which may be offline (i.e. contents have been transferred to tape)?

Remember that on some filesystems, older files are automatically migrated to tape.

For batch jobs, remember to check that the files required to be accessible are online, as described in the Quick start instructions. Another frequent scenario is the use of scp or sftp to transfer old data off the system - in both cases, offline files will be recalled individually on their first access, which is very inefficient and slow (the same tape will probably load and unload many times).

The commands dmfind or dmls are DMF-aware versions of find and ls respectively, which can be used to locate or list the DMF-state of migrated files.

Here is how to efficiently recall an entire directory in preparation for submitting a job or running a command which will need to access the files there.

What is watchdog and why is it emailing me?

Watchdog is a script run by cron every 10 minutes. Its two basic functions are:

  1. Monitoring resource usage in the 8cpu/8GB interactive sector of the machine and enforcing the interactive use time limit;
  2. Monitoring resource usage by batch jobs and sending email warnings if any are behaving unexpectedly (i.e. in a way that indicates a possible user error or problem that may affect global system performance). In the case of aux queue jobs only, errant jobs may be terminated if too many resources are being taken away from the higher priority queues.

An additional function is to provide memory monitoring and cpu allocation features, which are needed for efficient operation of the Altix but missing or unreliable in basic LSF.

It's easy to start ignoring watchdog messages, but for batch jobs these are only produced if a significant discrepancy is detected (e.g. a 25% variance between real and advertised memory usage for a large memory job) which may have performance implications, both for the particular job being scanned and for the system as a whole. Please don't ignore them - if you don't understand why you have received a message, or think you have received one inappropriately, please contact cosmos_sys.

How do I farm several smaller jobs to make one larger multiprocessor job?

It is possible to submit jobs to the batch queueing system which are composed of several sub-jobs - e.g. to run 8 serial programs simultaneously as a single 8-cpu batch job. Note that there should be no need to do precisely this with the new queueing system, since the small queue accepts single-cpu jobs, but in principle a job of size MxN can be submitted composed of M distinct N-cpu sub-jobs. This is often referred to a "farming" jobs (a kind of trivial parallelism).

The obvious way to do this is to create a basic jobscript firing each sub-job in turn with an & (see the user guide); e.g.

export OMP_NUM_THREADS=4
cd /path/to/directory1
./myprogram1 &
cd /path/to/directory2
./myprogram2 &
cd /path/to/directory3
./myprogram3 &
wait

for a 3x4=12 cpu job. Note however the two non-trivial features:

  1. OMP_NUM_THREADS must be set in the user script (unless the sub-jobs are MPI), otherwise the submission script generator will set this to 12, resulting in a massively overlarge job (36-cpu in this example);
  2. the final wait command is need to stop the script exiting prematurely - if this happens, the batch job itself will exit leaving the sub-jobs running effectively interactively.

Note that jobs with a complicated structure (such as farms of M, N-cpu jobs) used to confuse watchdog when it tried to work out the number of cpus actually being used, but this issue is now resolved.

How do I run cosmomc on COSMOS?

Cosmomc is an example of a hybrid MPI/OpenMP code (each chain is evolved inside a single MPI thread, each of which uses CAMB which distributes work within a chain over OpenMP threads). This introduces some complications when submitting a cosmomc job to the batch queues.

runCosmomc
Although it is perfectly possible to submit cosmomc like any other job, the considerations described in the rest of this answer can be handled most simply by using the script runCosmomc to submit the job to the queues. runCosmomc is used as follows:

runCosmomc [options] <params_file> <number_of_chains>

Options:
          --jobname <jobname>       Job name (default: <params_file>)
          --queue <queuename>       COSMOS queue (default: small for cpus <= 8
                                                           large for cpus  > 8)
          --threads <threads>       OpenMP threads per chain (default: 1)
          --runtime <minutes>       Wall clock run time in minutes
          --size <megabytes>        Total memory required in MB
          --dplace                  Use dplace (default: yes)
          --progname <progname>     Name of the program binary (default: cosmomc)
Abbreviations and short forms of the options are possible.

Example:

runCosmomc --job JOB1 -q small -t 2 -r 120 -s 100 --dplace params 4

will submit a cosmomc job using the binary called cosmomc and the file params.ini found in the current working directory; the job will use 4 chains with 2 OpenMP threads (so will be 8-cpus overall), will last for 120 minutes, use 100MB of memory in total, will be submitted to the small queue with the name JOB1. The correct use of dplace in the job is taken care of by issuing the option --dplace to runCosmomc - this is actually on by default. To disable dplace use --nodplace, however use of dplace is strongly recommended for performance.

Note that most of the above options have sensible defaults, so you can probably get away with simply:

runCosmomc -t 2 -r 120 params 4

for an 8-cpu vanilla cosmomc job with 4 chains (and 2 threads) using params.ini lasting 2 hours. A job not using OpenMP threads could omit the -t option (it defaults to 1), in the above example that would create a 4-cpu job. Also, omitting the -r option will result in the default value for the queue being used for the runtime limit (which for both large and small is currently 8 hours).

Overall cpu number and thread numbers
A cosmomc binary using a CAMB built with OpenMP will spawn M working MPI threads via mpirun -np M cosmomc, each of which will then split into N working OpenMP threads, where N is specified either through the environment variable OMP_NUM_THREADS, or the params.ini file option num_threads (if the latter is non-zero). Thus, the total number of working threads, and therefore the number of cpus to request for such a job to run properly, is MxN.

Prior to 15th December 2004, the automatic job submission script (qscribe) assumed that when a job requested X cpus, OMP_NUM_THREADS should also be set to X (and silently did so). For the majority of (non-hybrid) jobs, i.e. jobs which are either pure MPI or pure OpenMP, this was correct. However in the case of a hybrid code like cosmomc this made it easy to launch vastly bigger jobs than intended (because OMP_NUM_THREADS could easily end up equal to the desired total size, which is then multiplied by the number of MPI threads to produce the actual overall size of the job). The most recent version of qscribe asks separately for the number of OpenMP threads, however users modifying older scripts for use with cosmomc should take care to explicitly specify a reasonable size for OMP_NUM_THREADS in their prototype script, e.g.:

export OMP_NUM_THREADS=4
cd /path/to/directory
mpirun -np 2 dplace -x25 ./cosmomc params.ini 

for an 8-cpu job; beware that a non-zero num_threads value in params.ini seems to override the value of OMP_NUM_THREADS. (See the next paragraph for an explanation of the use of dplace.)

Extra (shepherd) processes complicating cpu placement
The second complication when setting up cosmomc to run on COSMOS is the appearence of additional non-working, or "shepherd", processes. Each pure OpenMP job (in the current version of SGI ProPack, version 4) produces 1 auxiliary ("shepherd") process with the same name as the program, which does not itself perform significant work, and which therefore should not contribute to the number of cpus allocated to the job. In the pure MPI case, there is also an auxiliary process with the same name as the program, plus the mpirun process. This becomes more complicated in the hybrid case, where in addition to the two MPI shepherds, one more is created for each of the MPI threads using OpenMP, so in the above case where we use mpirun -np M cosmomc and OMP_NUM_THREADS, or num_threads, is N, the total number of shepherds is M+2).

For best performance it is recommended that dplace be used to bind the worker processes to specific cpus (this reduces overhead incurred by moving between cpus, reduces interference from "free-roaming" lower priority jobs, and increases the likelihood of memory accesses staying local to each cpu). The existence of shepherds implies that there are more processes making up the job than cpus allocated, since the latter are chosen to be only as numerous as the number of working processes. Ideally, the working processes will be placed (or bound) 1-1 to the allocated cpus, with the shepherds left to roam. The worst case scenario is multiple working processes bound to the same cpu, which would clearly lead to two or more job threads running at less than half speed; the same would probably then be true of the job as a whole (because the unhampered threads would have to wait for the slow ones to catch up). The upshot of this is that although for best performance we would like to tell dplace to skip placement for all shepherd processes, and only for the shepherds, we must at least skip placement of as many processes as there are shepherds to ensure that each cpu has no more than one process bound to it.

Below are listed the best-guess dplace cosmomc launch command lines for various sizes of job. Notice that they depend only on the number of MPI threads chosen, and not on the number of OpenMP threads. In theory these should give better performance than not using dplace at all. The order in which processes are created is significant, unfortunately, so these bitmasks may not always lead to only shepherds being skipped as processes are bound to each available cpu in turn. However they contain enough binary 1s to ensure at least that no cpu has more than one process attached to it, thus avoiding the worst case scenario. Other dplace commands are liable to have disastrous effects on performance, if this aspect is wrong. For more information, please see this faq.

No. MPI threads Example No. shepherds
1 mpirun -np 1 dplace -x5 cosmomc params.ini 3
2 mpirun -np 2 dplace -x25 cosmomc params.ini 4
3 mpirun -np 3 dplace -x113 cosmomc params.ini 5
4 mpirun -np 4 dplace -x481 cosmomc params.ini 6
8 mpirun -np 8 dplace -x130561 cosmomc params.ini 10
16 mpirun -np 16 dplace -x8589803521 cosmomc params.ini 18
32 mpirun -np 32 dplace -x36893488138829168641 cosmomc params.ini 34

NB When SGI ProPack is upgraded, the above numbers may change.