Submitting jobs under LSF
 

Submitting jobs under LSF


Contents and links

Introduction
Finding queue information
How to submit a job
Find the status of jobs
How to kill a job
How to suspend a job
How to resume a job
Other utilities
Platform LSF Documentation


Introduction

The new system is based around a totally different product, Platform LSF (previously it was Cray NQE), which offers many benefits such as job-level run-time resource monitoring and load balancing. This immediately implies that any old qsub scripts (containing #QSUB style headers) won't work - they will need regenerating using the updated automatic writer (see below).

Please check that you can run the command

bqueues

and see a list of queues. If this command produces an error, there may be a system or an account problem - in any event please email cosmos_sys.

The LSF queues are broadly similar to the former NQE queues, with project-specific queues of various sizes (cam-long, cam-spec etc), a high priority daytime queue with universal access (day-test), and low priority universal access reserve queues (res-long, res-med etc). For an overview of the principles behind these please see COSMOS batch queues.


Finding queue information

Please refer to this summary table of queue parameters and this table of queue run-times.

For more detail on a specific queue do e.g.

bqueues -l cam-long

At the top of the output you should see a description string summarising the intent of the queue and the size of the expected jobs therein:

QUEUE: cam-long
  -- CAM Long (Cambridge project only) queue. 16 cpus, 8GB memory, 8 hours maximum per cpu.

Near the bottom, you should see the DISPATCH_WINDOW and USERS parameters:

DISPATCH_WINDOW: 1:18:00-2:17:59 3:18:00-4:17:59 5:18:00-6:23:59 0:0:0-0:17:59

USERS: cam/ 

which shows when cam-long is active, i.e. able to launch jobs, and from whom it can accept jobs. Midnight on Sunday is 0:0:0, 9am on Monday is 1:09:00 etc, so that the first weekly window for this queue is from Monday 18:00 until Tuesday 17:59. Also from the above we see that cam users are allowed to submit to this queue.

These times represent the same weekly project time slots as previously. As before, jobs can be submitted any time; however at the end of a project slot, jobs are no longer killed. They acquire instead an increased susceptibility to suspension, which is automatically applied to lower priority jobs when the load becomes excessively high.


How to submit a job

Create a simplistic file containing the necessary job start commands, e.g.

cat > jobscript
cd /home/cosmos-tmp/sjr20/mpijob
mpirun -np 16 mpiprog < input.dat > output.dat
<Control-D>

Note that the above example is an MPI job, but follow exactly the same procedure for a job using shared-memory parallelism (the OMP_NUM_THREADS environment variable will be set automatically by the next command).

Then to submit this to cam-long, do

cam-long jobscript

The automatic writer script will ask questions about the resources the job will need. It explains the implications of these choices and under what circumstances the job might be killed, or not start as a result of the values given.

Please be as accurate as possible.

Giving overlarge numbers may mean that LSF cannot find a suitable window in which to dispatch your job. On the other hand, giving numbers which are wild underestimates in order to ensure dispatch has the potential to bring COSMOS down - don't do this.

Please note that memory requirements are currently used as a guide for scheduling only - a job will not be killed for exceeding its stated memory limit (COSMOS has never done this).

Finally the script optionally submits the job to the chosen queue, or saves the submission script for manual editing or deferred submission.

Note that to submit such an automatically written script, possibly after manual edits, one should do

bsub < cam-long.xxxx

whereas in NQE one might have done qsub cam-long.xxxx (without the <).


Find the status of jobs

To examine job status, use

bjobs

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
274     sjr20   PEND  cam-super  cosmos.amtp                        Sep  4 20:15
275     sjr20   PEND  cam-super  cosmos.amtp                        Sep  4 20:16 

to list all your own jobs and

bjobs -u all

to list the jobs of all users. Appending the JOBID to bjobs restricts attention to the corresponding job, i.e.

bjobs 275

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
275     sjr20   PEND  cam-super  cosmos.amtp                        Sep  4 20:16 

Adding an option -l produces more information, e.g. to find out why LSF has not dispatched job 275 (whose STATus is PENDing) one could do

bjobs -l 275

which produces among other verbose information:

 PENDING REASONS:
 User has reached the per-user job slot limit of the queue;

Jobs are initially pending (PEND), while they are awaiting scheduling and dispatch, then when dispatched for execution they enter the RUN state (see table below).

LSF job status values

STAT
Explanation
PEND
Job is pending (not yet started).
PSUSP
Job has been suspended by the user while pending.
RUN
Job is currently running.
USUSP
Job has been suspended by the user while running.
SSUSP
Job has been suspended by the system while running.
DONE
Job has exited normally (exit value 0).
EXIT
Job has exited abnormally (exit value non-zero).
UNKWN
or
ZOMBI
Indicates some system problem. Please contact cosmos_sys.


How to kill a job

First find the jobid from bjobs. Then:

bkill jobid

By default, this sends SIGTERM and then after a short delay SIGKILL.

More generally, to send, e.g., SIGKILL direct to jobid, do either:

bkill -s 9 jobid

or

bkill -s KILL jobid

Sending the SIGSTOP signal to sequential jobs or the SIGTSTP to parallel jobs is the same as using bstop.

Sending the SIGCONT signal is the same as using bresume.


How to suspend a job

Jobs can be suspended by the owner while pending (status becomes PSUSP) or while running (status becomes USUSP) using bstop.

To suspend job jobid do

bstop jobid

Running sequential jobs are sent the SIGSTOP signal and running parallel jobs the SIGTSTP signal in order to suspend them.

Alternatively bkill -s STOP can be used to achieve the same effect.


How to resume a job

Jobs which are in either the PSUSP or USUSP states can be resumed by the owner using bresume.

To resume job jobid do

bresume jobid

Running jobs are sent the SIGCONT signal.

Alternatively bkill -s CONT can be used to achieve the same effect.


Other utilities

Another useful utility is bmod (see man bmod) which modifies job parameters after submission.

For additional information on all the above utilties, refer to Running Jobs with Platform LSF.


Return to Batch queue information.