Submitting jobs under LSF
Contents and links
The new system is based around a totally different product, Platform LSF (previously it was Cray NQE), which offers many benefits such as job-level run-time resource monitoring and load balancing. This immediately implies that any old qsub scripts (containing #QSUB style headers) won't work - they will need regenerating using the updated automatic writer (see below).
Please check that you can run the command
and see a list of queues. If this command produces an error, there may be a system or an account problem - in any event please email cosmos_sys.
The LSF queues are broadly similar to the former NQE queues, with project-specific queues of various sizes (cam-long, cam-spec etc), a high priority daytime queue with universal access (day-test), and low priority universal access reserve queues (res-long, res-med etc). For an overview of the principles behind these please see COSMOS batch queues.
For more detail on a specific queue do e.g.
bqueues -l cam-long
At the top of the output you should see a description string summarising the intent of the queue and the size of the expected jobs therein:
QUEUE: cam-long -- CAM Long (Cambridge project only) queue. 16 cpus, 8GB memory, 8 hours maximum per cpu.
Near the bottom, you should see the DISPATCH_WINDOW and USERS parameters:
DISPATCH_WINDOW: 1:18:00-2:17:59 3:18:00-4:17:59 5:18:00-6:23:59 0:0:0-0:17:59 USERS: cam/
which shows when cam-long is active, i.e. able to launch jobs, and from whom it can accept jobs. Midnight on Sunday is 0:0:0, 9am on Monday is 1:09:00 etc, so that the first weekly window for this queue is from Monday 18:00 until Tuesday 17:59. Also from the above we see that cam users are allowed to submit to this queue.
These times represent the same weekly project time slots as previously. As before, jobs can be submitted any time; however at the end of a project slot, jobs are no longer killed. They acquire instead an increased susceptibility to suspension, which is automatically applied to lower priority jobs when the load becomes excessively high.
Create a simplistic file containing the necessary job start commands, e.g.
cat > jobscript cd /home/cosmos-tmp/sjr20/mpijob mpirun -np 16 mpiprog < input.dat > output.dat <Control-D>
Note that the above example is an MPI job, but follow exactly the same procedure for a job using shared-memory parallelism (the OMP_NUM_THREADS environment variable will be set automatically by the next command).
Then to submit this to cam-long, do
The automatic writer script will ask questions about the resources the job will need. It explains the implications of these choices and under what circumstances the job might be killed, or not start as a result of the values given.
Giving overlarge numbers may mean that LSF cannot find a suitable window in which to dispatch your job. On the other hand, giving numbers which are wild underestimates in order to ensure dispatch has the potential to bring COSMOS down - don't do this.
Please note that memory requirements are currently used as a guide for scheduling only - a job will not be killed for exceeding its stated memory limit (COSMOS has never done this).
Finally the script optionally submits the job to the chosen queue, or saves the submission script for manual editing or deferred submission.
Note that to submit such an automatically written script, possibly after manual edits, one should do
whereas in NQE one might have done qsub cam-long.xxxx (without the <).
To examine job status, use
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 274 sjr20 PEND cam-super cosmos.amtp Sep 4 20:15 275 sjr20 PEND cam-super cosmos.amtp Sep 4 20:16
to list all your own jobs and
bjobs -u all
to list the jobs of all users. Appending the JOBID to bjobs restricts attention to the corresponding job, i.e.
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 275 sjr20 PEND cam-super cosmos.amtp Sep 4 20:16
Adding an option -l produces more information, e.g. to find out why LSF has not dispatched job 275 (whose STATus is PENDing) one could do
bjobs -l 275
which produces among other verbose information:
PENDING REASONS: User has reached the per-user job slot limit of the queue;
Jobs are initially pending (PEND), while they are awaiting scheduling and dispatch, then when dispatched for execution they enter the RUN state (see table below).
LSF job status values
||Job is pending (not yet started).
||Job has been suspended by the user while pending.
||Job is currently running.
||Job has been suspended by the user while running.
||Job has been suspended by the system while running.
||Job has exited normally (exit value 0).
||Job has exited abnormally (exit value non-zero).
|Indicates some system problem. Please contact cosmos_sys.
First find the jobid from bjobs. Then:
By default, this sends SIGTERM and then after a short delay SIGKILL.
More generally, to send, e.g., SIGKILL direct to jobid, do either:
bkill -s 9 jobid
bkill -s KILL jobid
Sending the SIGSTOP signal to sequential jobs or the SIGTSTP to parallel jobs is the same as using bstop.
Sending the SIGCONT signal is the same as using bresume.
Jobs can be suspended by the owner while pending (status becomes PSUSP) or while running (status becomes USUSP) using bstop.
To suspend job jobid do
Running sequential jobs are sent the SIGSTOP signal and running parallel jobs the SIGTSTP signal in order to suspend them.
Alternatively bkill -s STOP can be used to achieve the same effect.
To resume job jobid do
Running jobs are sent the SIGCONT signal.
Alternatively bkill -s CONT can be used to achieve the same effect.
Another useful utility is bmod (see man bmod) which modifies job parameters after submission.
For additional information on all the above utilties, refer to Running Jobs with Platform LSF.
Return to Batch queue information.