Contents


COSMOS User Guide


NB: THIS IS THE OLD COSMOS WEBSITE. PLEASE SEE THE NEW WEBSITE FOR UP TO DATE INFO - http://www.cosmos.damtp.cam.ac.uk/

Please refer to the new COSMOS web pages for the most up to date information.


Started by Stuart Rankin May 2004

Last updated: by Tristram Scott, April, 2013.

Note: starting from November 2012 ALL users applying for a new or renewed account on COSMOS@DiRAC systems, should follow the SAFE registration pocedures for DiRAC, described on the DiRAC Wiki page The COSMOS User application form above is only needs to be filled in by the members of the COSMOS Consortium.


Quick start

For security reasons, all COSMOS@DiRAC access is restricted to secure shell (SSH) logins, which must recognise both you and the workstation from which you are logging on remotely. The application form and the SAFE registration pages have a field in which the (fully qualified) domain names or IP addresses of your workstation(s) can be supplied; additional hosts can be added on request later by contacting cosmos_sys.

New users will be contacted with details of their userid and password. Once these have been received, use e.g. any SSH (ver.2) client to log in to COSMOS as:

ssh -X userid 'at' universe.damtp.cam.ac.uk

or

ssh -X userid 'at' cosmos.damtp.cam.ac.uk

Note that from within the DAMTP (wired) network, it is usually sufficient to say:

ssh -X universe

or
ssh -X cosmos

All COSMOS users, including those from DAMTP, will have their accounts preconfigured for using COSMOS application stack in an optimal way. Please note that the BASH shell is the default shell for all users, without exceptions, for technical reasons. (It does not preclude anyone from running C-shell scripts, of course.

All COSMOS accounts share the environment set up using resource files symbolically linked from /home/cosmos/template/

Peronal preferences can be customised (shortcuts, aliases, paths, etc) using .bashrc.local. If any changes are made to the existing files, log in again to make them effective.

X-windows applications should work transparently across the (encrypted) SSH connection (if this doesn't seem to work, please ensure that you use ssh flag '-X'). See Using SSH for more details.

The basic features of the COSMOS filesystems are as follows:

  1. Each user has a backed up, home directory for long-term storage under /home/cosmos/users (quota limit is about 45GB and 100K files) - please do be sensible and clean up your folders regularly.
  2. Each user can write to a non-backed up directory under /home/cosmos-tmp intended for medium-term scratch work.

Initially, interactive use is confined (transparently) on cosmos or universe to the first 12 cpus, which are shared by all active users; there are special interactive queues designed to handle larger development/analysis jobs. See below.

To submit a batch job, job submission scripts (jobscripts, for short) are used. Jobscripts contain the commands necessary request system resources and to start the job inside a simple file, called e.g. myjob.sub. Then, the job can be submitted for run, using this command:

msub myjob.sub

After submission the job status can be monitored via the command showq.

Jobs can be killed before or during execution via the command canceljob.

See below for the more detailed examples of job submission.


Secure Shell (SSH)

Logging in

COSMOS

  1. Obtain your user id and register the local machine from which you wish to connect with the Cambridge administration.

  2. Issue the command:

  3. ssh userid'at'cosmos.damtp.cam.ac.uk

    Enter your password when prompted.

  4. X clients running on COSMOS will normally display without any setenv DISPLAY, xhost or xauth preliminaries.

UNIVERSE

universe.damtp.cam.ac.uk is the facility's second compute and login system, identical to cosmos.damtp.cam.ac.uk. COSMOS users can access this system via SSH in the same way as cosmos.damtp.cam.ac.uk itself.

Why use Secure Shell?

Secure Shell is being actively developed as a secure replacement for the common UNIX commands, rlogin, rsh, rcp, and rdist. It uses strong authentication methods to establish secure communications with a remote computer.
  1. By default all information transmitted over the network after the initial machine-to-machine connection is strongly encrypted, including the user's password and any data sent back by e.g. an X application.
  2. SSH is easy to use and indeed much more straightforward to use with X windows than rlogin or rsh.
  3. By default, all X programs are directed down the secure (encrypted) channel to the local machine, and are thus also safe from prying eyes whilst in transit. The DISPLAY at the remote end can still be set in the usual way if desired, in which case the X connections will be directed along normal, insecure channels. Only in the case of old DGL applications on COSMOS such as buttonfly (which are not pure X applications) might this be desirable.

Obtaining Secure Shell

Most sites nowadays have some form of SSH installed centrally. The current implementation on COSMOS will operate with most of these.

A suite of free SSH tools for Win32 platforms (Win95 and later) is also available.


Running jobs


Interactive use

We support several different classes of interactive use, to allow code development, debugging, job monitoring or post-processing.

Login nodes (cosmos and universe - at the time of writing)

These are intended for most common interactive tasks such as code development, compilation, optimization, job submission and data analysis. Use of login nodes is probably sufficient for the majority of users and tasks, but it is important for all to understand the limitations of using login nodes for interactive work.

Most simply, one can run programs straightforwardly on the command line of a login node. Note that provided you have an X server on your local machine, and you enable X-forwarding in your SSH connection (e.g. the -X or -Y options to ssh), then X-windows applications launched on a login node should display on your screen.

The login nodes are similar in terms of hardware to the batch compute nodes. It is possible nevertheless to run small MPI jobs on the login nodes for testing purposes using shared memory. However, the login nodes are finite, shared resources and any such use must respect other users. In particular, parallel jobs must be short (i.e. minutes), use no more than 2-4 cores and up to 2GB of memory per core each, and should be niced (prefixed with nice -19) so as not to impact interactive responsiveness.

If you find that you need to make such runs more often than occasionally, or for longer periods, then it may be more appropriate to employ the batch-interactive use described below - antisocial monopolisation of a login node will probably receive harsh treatment from the system administrators.

Please note that interactive use is not appropriate for production code runs which should be performed via the batch queues.

Please take care to set the environment variable OMP_NUM_THREADS to an appropriately small value when testing OpenMP jobs interactively (otherwise it is possible to launch enough worker threads to fill the entire system on only 8 cpus), and don't use dplace (to facilitate sharing). To set OMP_NUM_THREADS to N, perform one of the commands below, according to which shell you use:

$ export OMP_NUM_THREADS=N

Batch queues

Queue Access
Min CPUs
Default CPUS
Max CPUs
Max memory
Max real time
Max CPUs


per job
per job
per job
per job
per job
per queue








super
restricted*
32
64
128
384GB
8hr
144








large
all
16
16
64
256GB
8hr
144








small
all
1
4
16
96GB
8hr
144








express
all
1
4
16
128GB
2hr
148








* For access to the super queue, please contact cosmos_sys with details of your job requirements

Notes

How to view job status

To examine job status, use showq command:
showq

How to kill a job

First find the jobid from showq. Then:

canceljob jobid

By default, this sends SIGTERM and then after a short delay SIGKILL.

More generally, to send, e.g., SIGKILL direct to jobid, do either:

COSMOS Hierarchical Fairshare policy

Fair use of resources is controlled by MOAB's Hierarchical Fairshare feature.


General guidelines

Please note the (soft) quotas applying to each filesystem (above); the hard quotas are slightly higher and allow up to 7 days use in excess of the soft limit. To check your own quota information issue the command:

quota -v

Files which are no longer needed for local processing must be deleted, or moved off the facility - please contact cosmos_sys for advice on how to do this for large amounts of data. All users must take care that their codes only output necessary data and use compression if appropriate to reduce the size of output files (e.g. using gzip, or bzip2).

Data unrelated to code running on COSMOS must not be stored on its filesystems.

NOTE


Usage guidelines

This section details the guidelines by which COSMOS users have agreed to abide. There are two distinct categories of user guidelines: (i) those required by the University of Cambridge and (ii) those specified by the CCC consortium and the COSMOS team. The guidelines presume that all users will cooperate in sharing this resource efficiently and politely.

University of Cambridge IT conditions of use

The rules made by the University of Cambridge are largely in common with those of any academic institution. For COSMOS users a summary of these is provided for convenience in the following web page:

University of Cambridge Information Technology Syndicate rules

By logging onto COSMOS, users automatically agree to abide by these rules and guidelines. The user application form assumes that they have been read and understood.

Further conditions of use

A. General

The following additional conditions of use are consistent with regulations set by national centres such as the EPCC. In addition, because of our matching funds arrangement with SGI/Intel users are obliged to acknowledge our sponsors.
  1. Sharing of accounts or passwords is not permitted under any circumstances. (Project members have group access to each other's files so this is not required).
    Please note that it is a vital part of security that passwords are chosen properly. Dictionary words in particular (in any language) offer little if any protection against modern cracking programs. Passwords must be at least 8 characters in length, contain non-alphanumeric characters and avoid any elements derived from personal information such as name, nationality, institution, location etc, or from media references or car registration plates. Tests will occasionally be performed to detect weak passwords.

    Note that the security of the facility rests primarily on the security of user passwords. Do NOT write your COSMOS password on a postit and stick it to your screen! If you access the facility from X windows, Do NOT run the command xhost + - the last time we were hacked, this is how it happened. (The user's home institution was hacked first, incidentally.)

  2. Users should maintain a collaborative link to one of the three CCC centres. Inactive users may be suspended (or resources reduced) after six months.
  3. Users may only make use of COSMOS for the purposes outlined in their User or Project Application forms and they are obliged to inform the CCC when this work has been completed.
  4. All users are under obligation to furnish the CCC with a report of the progress of their work as requested, on at least an annual basis.
  5. Publications of results from work performed on COSMOS must note the use of this COSMOS@DiRAC facility which is supported by STFC/DBIS UK, while also including the following acknowledgement to our sponsors:

B. Interactive use

The primary function of COSMOS is to perform large-scale numerical simulations; the following code of conduct for interactive use ensures the focus on this key purpose:
  1. COSMOS is available for interactive use - that is, compiling programs, pre- and post-processing data, submitting batch jobs etc. - only during weekday office hours; these are defined to be 10am-6pm, Monday-Friday.
  2. Project members may log on to monitor and submit jobs to batch queues outside of office hours.
  3. All significant cpu intensive applications must be submitted to the batch queues. Interactive jobs taking longer than 30 minutes of cumulative processor time will be terminated automatically - note that these are confined by the operating system to an 8-cpu sector of the machine, and that the express queue exists for test jobs.
  4. Small jobs which can be performed on workstations at local institutions should not be submitted to COSMOS.

C. Sharing batch queues

  1. Fairness: Individual users and project members must ensure that they are not using an unfair share of resources, particularly when working interactively (the batch queueing system enforces fair usage of resources automatically).
  2. Efficiency: Users must ensure that their jobs use computer resources efficiently, fulfilling stated scalability criteria. They must also optimize their code and use efficient algorithms. Processor usage efficiency is also easily monitored on the Altix.
  3. Users should not in general stack up more than ten jobs in any one queue.
  4. A job submitted under a specific batch queue must conform at run-time to the stated processor range and memory and cpu limits. If not, the job may be terminated automatically by the watchdog program, or by the system administrator if global performance is adversely affected by the unexpected demand on resources.
  5. Batch jobs will not necessarily run in the order in which they have been submitted - LSF implements a "hierarchical fairshare" policy to ensure fair usage and to prevent single users or projects dominating COSMOS for extended periods of time.
  6. Jobs should not automatically resubmit themselves. This is to avoid the development of infinite loops and other antisocial behaviour by failing scripts. Instead requests which continue earlier jobs should be submitted explicitly, and exit if the restart is not successful.

D. Disks and Tapes

Further information on storage is available.

The large temporary partition on COSMOS is available to all users and project members. Consequently, they must not exceed their fair share of disk space and so interfere with the work of others.

  1. Users must not substantially exceed their recommended user or project allocation of disk space without prior permission.
  2. Don't store data on /tmp.
  3. Be sure to recall data back from tape before you run your batch job.

E. Constructive attitudes

Please note that we cannot offer the same level of user support and help as the heavily staffed national centres.

There is a well-defined mechanism by which to request help from the COSMOS team if you are experiencing difficulties - see Getting help.


More information