|
|
|
|
Welcome to talisman |
|
The newer dual-core talisman nodes (d1–d8). |
|
The main talisman nodes (master, amulet, and nodes 1–11). |
The talisman cluster is a collection of 21 Debian and Ubuntu GNU/Linux based computers. It
has a total of 30 processors with an average speed of about
3GHz, 53 GB RAM, and 1.6 TB of hard-disk space. Each node can
be used individually, and the entire system can also be used
as a workstation cluster, with communication handled by MPI.
If you want an account on talisman, or have
any questions not answered on this page, please contact the
talisman admin.
Rules and regulations
There are certain rules and regulations that we expect
users to follow. These are designed to ensure fair usage of
the cluster.
- You must obey the DAMTP
computing rules and the rules
of the Information Strategy and Services Syndicate.
- Do not run code on an already-occupied processor core
(competing for resources makes everything run slower). If
only one core of a dual-core node is occupied, running your
code on the free one is fine.
- Any job longer than 15 minutes, and all jobs on
master, should be niced (to at
least 10).
- No user should manually start jobs on more than half the
processors. If you want to submit more than this you must
use the queueing system.
- The preceding rules can be broken when necessary,
e.g. you are off to a conference and need results quickly.
However, if you break the rules then you must email talisman admin telling
us.
- All code should be compiled with the maximum level of
optimisation possible. This also means using machine
optimised libraries (e.g. for linear algebra and fast
Fourier transforms). See below for further details.
- You must not use
talisman for projects like
distributed.net or SETI@home.
If we detect a breach of these rules, you are highly likely to
have your account disabled and/or your running jobs killed.
How to use talisman
You can access the cluster by ssh to
talisman.damtp.cam.ac.uk. This will give you a
login shell on the master node of the cluster. From there you
can reach the other nodes:
| Node |
Type |
Processor(s) |
Memory |
Hard-disk space |
master |
i686 |
Intel Pentium 4 (3.2 GHz) |
2 GB |
500 GB |
node1 |
i686 |
Intel Pentium 4 (2.8 GHz) |
1 GB |
80 GB |
node2 |
i686 |
Intel Pentium 4 (2.8 GHz) |
1 GB |
80 GB |
node3 Decommissioned |
i686 |
Intel Pentium 4 (2.8 GHz) |
1 GB |
80 GB |
node4 Decommissioned |
i686 |
Intel Pentium 4 (2.53 GHz) |
1 GB |
80 GB |
node5 Decommissioned |
i686 |
Intel Pentium 4 (2.53 GHz) |
1 GB |
80 GB |
node6 |
i686 |
Intel Pentium 4 (2.53 GHz) |
1 GB |
80 GB |
node7 |
i686 |
Intel Pentium 4 (3.2 GHz) |
2 GB |
80 GB |
node8 |
i686 |
Intel Pentium 4 (3.2 GHz) |
2 GB |
80 GB |
node9 Decommissioned |
i686 |
Intel Pentium 4 (3.2 GHz) |
2 GB |
80 GB |
node10 |
i686 |
Intel Pentium 4 (3.2 GHz) |
2 GB |
80 GB |
node11 |
i686 |
Intel Pentium 4 (2.8 GHz) |
1 GB |
80 GB |
amulet |
ia64 |
Intel Itanium 2 (2 x 1.3 GHz) |
4 GB |
40 GB |
noded1 |
x86_64 |
Intel Pentium D (2 x 3.2 GHz) |
4 GB |
80 GB |
noded2 |
x86_64 |
Intel Pentium D (2 x 3.2 GHz) |
4 GB |
80 GB |
noded3 |
x86_64 |
Intel Pentium D (2 x 3.2 GHz) |
4 GB |
80 GB |
noded4 |
x86_64 |
Intel Pentium D (2 x 3.2 GHz) |
4 GB |
80 GB |
noded5 |
x86_64 |
Intel Pentium D (2 x 3.2 GHz) |
4 GB |
80 GB |
noded6 |
x86_64 |
Intel Pentium D (2 x 3.2 GHz) |
4 GB |
80 GB |
noded7 |
x86_64 |
Intel Pentium D (2 x 3.2 GHz) |
4 GB |
80 GB |
noded8 |
x86_64 |
Intel Pentium D (2 x 3.2 GHz) |
4 GB |
80 GB |
Compiling your code
For code to be run on the cluster, please compile it on the
cluster as shown in the table below:
| Nodes |
Compile on |
Compiler options |
| icc |
gcc |
master
node1–node11 |
master |
-O3 -xN -ip |
-O3 -ffast-math -fomit-frame-pointer
-funroll-loops -march=pentium4 -mfpmath=sse,387
-ftree-vectorize |
amulet |
amulet |
-O3 -ip |
-O3 -ffast-math -funroll-loops |
noded1–noded8 |
noded1 |
-O3 -xP -ip |
-O3 -ffast-math -funroll-loops -march=nocona
-mfpmath=sse,387 -ftree-vectorize |
To attempt to automatically parallelize your code, so that
it uses both processors of a dual-processor node
simultaneously, you could try giving the
-parallel option to icc. If you're
manually parallelizing using pthreads, you
might want the extra option -pthread. If you're
using OpenMP, use the
-openmp option with icc or the
-fopenmp option with gcc. Some
tutorials on using either pthreads or
openmp may be found in the links section below.
To run large parallel distributed-memory code, use MPI. See the MPI section below for details.
Data storage
talisman is intended for processing, rather
than for storage. We take no backups and make no guarantees
of reliability. We do not impose a disk quota system; if disk
space fills up the biggest offenders will be required to free
up space.
Your home directory on talisman is stored on
master and shared across all nodes via NFS. To
transfer data between your DAMTP home directory and your
talisman home directory you can use
scp or rsync. Local per-node
scratch space is available on each of the nodes under
/local/uid/, and is quicker and larger
than the home directory space.
Although it is not recommended practice, it is possible to
directly access the local storage on the nodes from the DAMTP
network by using rsync with the option -e
"ssh talisman nice ssh". Thanks to Jim
McElwaine for suggesting this.
Running your code
There are two different ways in which you might want to run
code on the cluster. The first is the simplest; choose an
unoccupied processor of the correct type (sysload
gives you the load averages) and then ssh to that
node and run your job, preferably nice-ed. The
second way is to use the queueing
system.
Queueing system
We run the Sun GridEngine 6.0
queueing system on talisman. The queueing system
takes all the hassle out of finding a vacant node to run code
on, especially if you have a large number of jobs (say 100) to
submit. GridEngine automatically avoids nodes where there's
already code running, and as soon as any node becomes vacant
it starts running jobs on it. This can make it difficult to
find a free node manually if the queue's in use, in which case
use the queue yourself and GridEngine will let everyone's code
have a turn.
Queueing serial (non-parallel) jobs
In the simplest form, you submit a job to the queueing
system using qsub; for example, qsub
$HOME/a.out. This put the job "run
a.out" into the queue. You can see what's
in the queue by typing qstat (or qstat
-r to also see which queues it will run on), and you
can remove an item in the queue by typing qdel
jobnum. Please don't be put off if the queue
looks full, as GridEngine has an idea of fairness and will
push an infrequent user's job to the front of the queue. Once
your job gets run, its stdout and
stderr will be piped to files in your home
directory with suffixes .ojobnum and
.ejobnum respectively.
Now for a more interesting example (this is how I use the
queueing system for serial jobs). First, write a shell script
that contains the commands you want to run. For example, I
use something like: #!/bin/bash
bindir="/home/talisman/ejb48/bin_$(uname -m)"
$bindir/prog1 0.0 10.0 1000 > /local/ejb48/temp1.dat
$bindir/prog2 1000 < /local/ejb48/temp1.dat > /local/ejb48/temp2.dat
$bindir/prog3 0.0 10.0 /local/ejb48/temp1.dat /local/ejb48/temp2.dat > /home/talisman/ejb48/1000.dat
rm /local/ejb48/temp1.dat /local/ejb48/temp2.dat
(Of course, I have lots of these scripts with
slightly different values.) Note that I select my binary
directory based on uname -m, so that I run x86_64
binaries on x86_64 machines, and so on, meaning that the same
script will work when run on any of talisman's nodes. Note,
also, the use of the local scratch space to store temporary
files. To submit a job like this to the queue to be run on
master or
node1–node11, use qsub
script (you may need qsub -b y
/full/path/of/script in some circumstances). To
submit a job to be run on
noded1–noded8 in addition to
these, use qsub -q noded.q script. To
submit a job to run on any node (including
amulet), use qsub -q
noded.q,amulet.q. To get a job to run only on (say)
noded1–noded8, follow the job
submission with qalter -q noded.q jobnum,
where jobnum is the number of the job you
just submitted. You can also request specific resources; for
example, to submit a job that needs 1.5GB of RAM, use
qsub -l mem_free=1.5G. See the qsub
man page for further details.
Queueing parallel jobs
The above is only for serial (i.e. non-parallel) jobs. To
submit a parallel job, you need to use a Parallel Environment,
or PE. There are two to choose from on
talisman.
The first is for shared-memory parallel programs, such as
those using OpenMP or PThreads. For
either of these, when submitting the job using
qsub you use the command-line option -pe
openmp num_threads.
Num_threads is either the number of
threads you would like, or a range. For example, -pe
openmp 2-4 would request a parallel job with between 2
and 4 threads (all running on the same node, of course). The
environment variable NSLOTS is set to the actual
number of threads allocated to this job, and you must
not use more than this. For an OpenMP program, setting
OMP_NUM_THREADS = $NSLOTS in your job's batch
file will ensure this.
For real parallel jobs,
i.e. distributed-memory jobs using MPI, we must use a
different parallel environment. In this case, use -pe
mpi num_processes. The MPI command to run the
code in this case is simply mpirun
program. There is no need to mess around with
host files or the like, as this is all done for you by the
queueing system.
MPI on talisman
In this section, only a simple example of how to run an MPI
program on talisman is given. For further
details, see the links section below.
To compile your code, use mpicc for C, or
mpif77 or mpif90 for Fortran. For
details of how to use these in Makefiles, or how to change the
compiler used, please see the mpicc manpage.
To run your code, use mpirun -hostfile
hostfile -np num_processes prog.
Hostfile should be a list of hostnames,
formatted as described in the mpirun manpage, and
num_processes is the number of processes
to run. This is made far easier by using the queueing system,
as described above.
Here is a real-world example, submitted using the queueing
system. First, the program, mpi_test.c (note
that this is missing error checking, which should never be
omitted):
#include <stdio.h>
#include <mpi.h>
int main(int argc, char** argv)
{
int rank, size, processor_name_len;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &processor_name_len);
printf("Message from %.*s (aka %d of %d): you will be assimilated!\n",
processor_name_len, processor_name, rank+1, size);
MPI_Finalize();
return 0;
}
We compile this using icc, so run
OMPI_CC=icc mpicc -O3 -ip -xN mpi_test.c -o
mpi_test_i686 on master. We then repeat
this on noded1 with -xP instead of
-xN and x86_64 instead of
i686, and finally on amulet with no
-xN and ia64 instead of
i686. We next write a machine-independent
wrapper, mpi_test.sh:
#!/bin/sh
exec "$HOME/mpi_test_$(uname -m)" "$@"
and make this executable by running chmod a+x
mpi_test.sh. We now submit this to the queue. For
this simple case, we'll do it all in one go: qsub -b y
-q noded.q,amulet.q -pe mpi 1- mpirun ~/mpi_test.sh.
If all goes well, we get four output files in our home
directory, of which three are empty (the errors), and one
contains the output from our MPI program.
Software libraries
There are a number of useful libraries installed on the
cluster, which you are very strongly encouraged to use.
Locally installed libraries live in /opt/. If
you want other libraries installed, please let us know.
Intel MKL
The
Intel Math Kernel library contains many useful routines;
most relevantly machine optimised versions of LAPACK, BLAS, and FFTs. These
are the fastest versions of these libraries available for
Intel processors. The libraries live in
/opt/intel/mkl/version/lib/arch,
and there is extensive documentation in
/opt/intel/mkl/version/doc/. The LAPACK
User's Guide can be found on netlib.
The structure of the libraries to link has recently changed
with MKL version 10. Please refer to the Intel
Manual for details. If you're in a hurry, icc -xN
-ip -O3 prog.c -o prog
-L/opt/intel/mkl/version/lib/arch -lmkl
-openmp may work.
FFTW
FFTW is the Fastest
Fourier Transform in the West, or nearly. It is slower than
the Intel FFTs in quite a few cases, but is still respectable,
and is portable to other platforms. The latest version is
FFTW3, which can be linked using -lfftw3.
The Intel MKL provides an FFTW3 interface. To use it,
first check the FFTW
to MKL wrapper page to make sure the function you want to
use is supported, and then just change #include
<fftw3.h> to #include
<mkl_fftw3.h>.
Other libraries
If you would like any other libraries installing, please
contact talisman admin.
Other Computing Resources
- Condor: a
system for running jobs on idle workstations across PWF
machines in the University. Condor is intended to provide a
significant computational resource for researchers in the
University, particularly those who have a need for high
throughput computing.
Useful Links
Parallel programming
POSIX threads (Pthreads)
OpenMP
-
OpenMP, from Lawrence Livermore National Laboratory.
MPI
|