The following pages give an overview of the development tools available on COSMOS. They are intended as a gentle introduction to the software and libraries available on the Altix, and should contain enough information to get you started. Note however, that this document is far from comprehensive and is not intended as a substitute of the official manuals. Throughout the text we have indicated links to the relevant documentation, all of which is available online. You should refer to these links for more complete information on the following topics.
This guide is work in progress, and we expect to update it frequently as new software becomes available. If you spot any errors or know of a topic that you think should be included in the guide, please contact us on cosmos_sys. Constructive feedback will be appreciated.
1) Configure your development environment:
On COSMOS the default and the only recommended way of setting up the environment is by use of the environment modules. We have an extensive set of modules to support various compilers, scientific libraries and applications (See software, development and visualization pages).
The most commonly-used environments (to access FFTW, GSL and some other libraries) are collated into the cosmolib module. It is also the pre-requisit module for all other modules, so should always be loaded first.
Most of the cosmolib libraries are compiled with the latest Intel Compilers, which require the use of the latest module.
Thus the default environment on COSMOS is set by loading two modules (cosmolib and latest) in the .bashrc file:
|
This should be set by default - so if it is not - please contact us asap - your environment needs updating.
If you have a code that used to work with version 9 of the compilers but now fails with the newer version, please email us and let us know about it. You can still access the version 9.1 environment by loading the 'icomp91' module.
2) Compile your application with Intel compilers. E.g.:
|
3) Do a test run interactively (preferably with a small/toy data set). On COSMOS, there are 8 processors reserved for interactive use, to allow rapid interactive response and the ability to test run parallel programs. You just need to launch your application from the shell, e.g.:
|
Interactive jobs are limited to 30 minutes of CPU time. If you want to test a longer job, you can now use the 'express' queue with a maximum of 8 processors and 2 hours.
4) If there are any problems recompile and run the program under the debugger interactively. E.g.:
|
Most of the development software on COSMOS presented here has a module
interface. That is, it either exists in a number of
incompatible versions, or uses non-standard paths in the directory
hierarchy, and access to it is not enabled by default.
The interface to different packages is managed via environmental
modules. This makes it easier to maintain several versions of the same
software avoiding interference between similar components. To access a particular
software package or library with a module interface you should first load the corresponding module
using:
|
This will set up the necessary environmental variables and paths
required by the package. You can load as many modules as you require in a
session, but be aware that if you load multiple modules with overlapping
components (e.g.. two different versions of the same compiler) the module
loaded last will have precedence. In order to avoid confusion you can unload a module once it is no
longer needed by typing:
|
There are various commands available to manage modules. Among the
most useful ones are:
|
To see all the modules that you have currently loaded. And:
|
To see a list of all the modules available on COSMOS. Use 'module -help' for a complete list
of commands. The default developer environment on COSMOS is based around the Intel Compilers.
A Makefile is a collection of instructions to automate the compilation of programs. Having one for your program is a great time saver if you are developing code yourself and often changing source files, and specially if you are planning to share the code with other people.
Creating a Makefile is very easy, you just need to specify your source files
and the rules that build your program. Once you have a working Makefile, you
can build your program by invoking 'make' or 'make all'.
There are many tutorials online about creating and managing Makefiles, see for instance the Introduction to Makefiles from the GNU make manual.
We have compiled a few templates that you can edit and use to build your programs on Cosmos:
C Makefile template
Fortran90 Makefile template
Example C application
CVS is a popular open source version control system. It works by keeping a master copy (CVS repository) of your source files. When you want to work on the application, you check out the latest version in the repository into a working copy, which you then edit and modify at will. If you're happy with the changes, you can update the repository with your working copy by committing the changes and logging a description of the modifications.
By putting your source files under the control of CVS, you can easily keep track of changes made during development, manage different versions of the code or contributions by different people, and on COSMOS, you can also simplify the task of backing up your sources by keeping the CVS repository on a backed up filesystems (eg. /home/cosmos/).
You can set up the root of your CVS repository in your COSMOS home directory by setting the environment variable CVSROOT - when this variable is set, all the cvs commands operate on that directory. A CVS repository is created with the command 'cvs init':
|
Once you have a working CVS repository, you can start adding projects to it. If you already have sources and Makefiles in a directory, you can place them under CVS by going into the application directory and importing it into CVS. The import command below will copy the contents of the application directory to the CVS root, with the project name PROJECT. USER can be any tag you want, like your user id.
Every time you commit something to the repository, CVS will fire up an editor and ask you to enter some description of the changes made. You can select your favourite editor by setting the CVSEDITOR environment variable. For example:
|
To work on a project, you 'check out' a working copy from the repository, which you can then use to build your application or to edit your source files. If you make modifications to the sources that you want to keep, you can 'commit' them to the repository. CVS will ask you to enter a brief description of the changes and will update the repository with the incremental difference from the working copy, assigning it a new version number. In this way, you can later review or undo any changes you have made. Some useful commands in CVS are:
|
There are many other features and commands in CVS which are beyond the scope of this guide. An excellent CVS FAQ and Manual can be found at Ximbiot.
There are several compilers for the Itanium2 architecture available on COSMOS. We strongly recommend the use of the Intel compilers for the best performance. Note that Intel compilers are evolving rapidly, and new patches appear regularly - as part of the process to make them more robust and reliable on the Altix platform, it is important to collect and report to Intel any bugs that we you might encounter. If you have a program that fails to compile and you suspect that the code is legal, it might be due to a compiler bug. Please send an email to cosmos_sys with details of the code (preferably include a toy example) and the error message; if you have your own Premier support account with Intel and submit a report, please let us know the details.
The Itanium architecture follows a philosophy of less logic, more resources. This means that the chip has enough functional units and registers to sustain a very high performance (6 operations per clock cycle) but the compiler must do a very good job to generate optimal code. Careful use of optimization flags and close analysis of the optimizations done by the compiler is often necessary to avoid performance losses.
The latest version of the Intel compiler suite - Intel Composer XE, contains numerous enhancements over the previous releases. (only available on x86_64 architectures (universe), so for the old cosmos machine version 11.1 is the last one avaialble).
Codes that used to build well under previous versions should also compile with
the latest version compilers, but changes in the routines of the
run-time library and in the compiler behaviour may mean that your code will
need some modifications before it can be compiled with the new compilers.
There are several flags that determine the type and level of optimizations that the compiler can do on your code. Here we list the options that have shown the most significant impact in achieving good performance levels. Note that a fixed set of options cannot be given as the 'best choice', since the final result will depend heavily on the particular program. You may need to experiment with different combinations before you get a satisfactory result. Sometimes the compiler will need some help to deal with specific loops. Be sure to look at the optimization report to identify factors inhibiting optimizations. For a comprehensive list of optimization options and more detailed explanations refer to the Intel optimization guide, which can be found here; there is also a step-by-step tutorial from Intel on code optimization.
-O3 |
Enables -O2 plus more aggressive loop and floating point optimizations. It also turns on prefetching. |
-ftz
|
Flushes denormalized numbers to zero. (ON with -O3). |
-fno-alias
|
Assumes no aliasing between pointers (i.e. they don't overlap in memory). Allows the compiler to find more opportunities for optimal pipelining. |
-fno-fnalias
|
Instructs the compiler not to assume aliasing within functions. |
-ip
|
Enables optimizations accross procedures/subroutines (eg. inlining) in the same source file. |
-ipo
|
Enables inter-procedural optimizations across multiple files. |
-align
|
Ensures proper aligment of data on memory boundaries for faster loads. |
-auto
|
Allocates local variables on the stack. (ON with -openmp). |
-prof_gen |
Enables profile guided optimizations (PGO). Requires three phases: Compilation with -prof_gen, running the program and recompiling with -prof_use. Has a greater impact on codes that make heavy use of branches. |
-opt_report |
Generates an optimization report, detailing changes to the code in different optimization pahes. A useful starting point is: "-opt_report -opt_report_fileOR.out -opt_report_phase hlo -opt_report_phase ecg_swp". Then examine $quot;OR.out$quot; for pipelining failures and loop transformations. |
-O2
|
A more conservative approach to optimization. Useful in combination with selected loop and floating point optimizations when accuracy is an issue or -O3 is degrading performance. |
The recent versions of the Intel Compiler (11.1) are still available on both universe and cosmos and work ok for the time being. They remain on universe for compatibility and on cosmos as the default compilers. If you need to access the 11.1 compilers, you can use the following module command (11.1.075 was the last released version of the 11.1 line):
$ module load icomp/11.1.075 |
The most common optimization options behave as with version 12 above.
The earlier version of the Intel Compiler (10.1 - the oldest being kept) is still available and should work ok for the time being. But the use of 10.1 compilers is strongly discouraged. If you still need to access the old
compilers, you can use the following module command:
$ module load icomp/10.1.026 |
The GNU suite of compilers are, of course, fully compatible with both x86_64 and ia64 architecture. The C/C++ compilers are mature and robust, and most
software distributions for Linux will autodetect gcc and configure
themselves with the appropiate options (you can override this behaviour
by setting appropiate environmental variables or giving specific
command line options to the configure script, eg. to build the
application with the Intel C compilers, although you may need further
changes to the scripts). The gfortran compiler will compile
Fortran77 code, as well as Fortran90, 95 and 2003. Since version 4.3 GNU compilers support OpenMP directives, but Intel compilers are much more mature in this regard.
In general, gcc/gfortran will not generate optimal code for iether Xeon or Itanium2 chip, and for computationally intensive applications performance could be as low as 40% of the performance achieved with the Intel compilers. If you need to use some of the GNU compilers for compatibility reasons, the following flags may help improve performance.
-O3 |
Highest level of optimization for gcc/g77.
Enables most optimization flags. |
-ffast-math |
Allows certain floating point optimizations that
don't conform to the IEEE standard. |
-funroll-loops |
Might improve speed by unrolling iterative
DO-loops
and Do-while loops (-all).. |
-finline-functions |
Allows inlining of small functions. (ON with -O3) |
-fprefetch-loop-arrays |
Will prefetch arrays inside loops. |
Debuggers are an essential tool to identify and fix programming
errors. It is highly recommended that you familiarize yourself with one of them
and use it regularly when faced with unexpected behaviour of your program -
even for simple bugs, the use of a debugger is preferable to peppering your
code with print statements. Although the list of commands might
seem daunting to the novice user, there are only a handful of of them that
you will be needing to solve most problems. In addition, all the debuggers
on COSMOS can be used via an intuitive GUI that greatly simplifies
the task of interacting with the debugger. To debug your program you must
first compile your code with the -g flag, and then launch the
application from inside the debugger. On COSMOS you may need to increase
the stack size before launching your application.
The Intel Debugger is distributed with the Intel
compilers, and is accessible when the appropiate compiler module is
loaded. You can start the debugger writing idb. The
Intel Debugger presents a command line interface and can operate natively (recommended), or emulating the behaviour of the Unix debugger (DBX) or
the GNU debugger (invoke via 'idb -gdb'). An example session is
started by e.g.:
$ ifort -g -O0 myprogram.f90 -o myprogram $ idb myprogram (idb) run |
Although idb does support debugging optimized code, it is better to disable optimization, while compiling for debugging perposes (thus using '-g -O0' options). Debugging of
parallel programs is supported fully (but messy) for MPI codes, and somewhat limited
for OpenMP codes.
In the latter case you won't be able to examine shared variables and
locks (e.g. not very useful to spot race conditions). We expect better OpenMP
support in future releases.
Debugging with Intel Debugger in the comfort of GUI interface is fully supported on x86_64 platforms (via Eclipse), so will be available on the next COSMOS system, which is planned for the deployment in Q2'2010.
Extensive documentation is available from Intel, including a short tutorial for a quick introduction.
The GNU Debugger (GDB) supports C/C++ and Fortran77/90 debugging, and is compatible with code generated via GNU and Intel compilers. GDB can be used to debug parallel (MPI) and threaded programs, although OpenMP is not currently supported. The command line interface is very rich, and the documentation can be accessed by consulting the 'man' pages or looking at the official manual from the FSF online. Additionally, there are plenty of tutorials and guides available on the web. The GDB tutorial is a good starting point.
The Data Display Debugger (DDD) is a graphical application that allows
you to interact easily with an inferior debugger. DDD works best when used
with the GNU Debugger (the default when invoked with ddd). You can
also use it with idb, although there are some minor glitches that may be
confusing for the novice user, and you may need to interact with the debugger via
the command line to access all the functionality of idb. There is extensive
documentation available for DDD, including a very useful step-by-step tutorial.
To use ddd with the Intel debugger you can use the following command:
$ ddd -dbx --debugger "idb -dbx -fullname" myprogram & |
Or, if you want to run the "myprogram" with arguments (for example: '--ini-file=params.ini --out-file=myprog.out'), then call the debugger like this:
$ ddd -dbx --debugger "idb -dbx -fullname" --args myprogram --ini-file=params.ini --out-file=myprog.out & |
Profiling your application can help you understand why the program is not running as fast as you expected and will give you pointers as to what parts of the code are causing the slow down. Using performance analysis tools you can quickly identify performance bottlenecks and hot-spots (parts of the program where most of the time is spent) and guide the optimization effort accordingly. In particular, using performance data in conjuntion with the optimization reports from the Intel compilers will tell you if there are parts of the application where the compiler needs some help to generate optimal code. Optimizing an application can be very time consuming, so it is essential to focus on the areas that will have a significant impact on the overall performance of the program.
Qprof is a simple profiling utility to generate a breakdown of the time spent in
various subroutines or lines of your code. It only requires you to set an environment
variable before running your program, although more useful information will be displayed by
compiling the code with debugging symbols '-g'. Qprof works with any version of the Intel compilers.
Usage example:
|
If you prefer to set the variable manually, just set or export LD_PRELOAD to the full path of qprof.so:
|
The behaviour of qprof can be controlled by a number of environmental variables. More information here.
Running histx to sample the default event (CPU_CYCLES) is done with:
|
This will create the file profile.program.pid with the
experiment results. You can then use iprep to process the output using:
|
The output of the default run will look very similar to that
obtained by a standad profiler (e.g. gprof or prof on Unix systems). You can
specify other events to be sampled using the '-e' option. In addition, you can
relate events to particular lines in your source code by compiling with
debugging symbols (-g) and running 'histx -l'. You can find some examples on
histx use in the in the package documentation.
Pfmon is a low level tool to access the Performance Monitoring Unit (PMU) of the Itanium chip. With pfmon you can access the hardware counters available on the Itanium to sample all the perormance events available (over 300 events) in sets of 4 events at a time. This is a very powerful tool with a rich set of options, but requires careful use to extract valuable information from the large amount of counters available. To monitor a program you just need to invoke pfmon on the unmodified binary with the events that you wish to sample (If no events are specified the default is CPU_CYCLES). For example, to count the number of cycles, number of instructions retired, and number of no-ops retired you can type:
|
You can access information about a particular event using the '-i' option, and list events matching a particular pattern using '-l'. For example,
to see the L2 cache related events that can be counted use:
|
There is much more to pfmon that we can cover here. If you plan to use it for performance analysis be sure to read the pfmon userguide and the Itanium2 specific features. There are higher level wrappers to pfmon that allow easy sampling of events with a drill-down approach and quick interpretation of results.
A good alternative to pfmon is to use the higher-level interfaces provided by these two perl wrappers. In essence, these tools invoke pfmon to do the actual sampling, but allow you to progress in a drill-down fashion by having important events predefined and processing the counting results into meaningful statistics.
We recommend that you start with one of these wrappers for performance analysis. In particular, i2prof.pl will allow you to characterize the performance of your application easily and quickly by using the predefined event groups. A good starting point is:
|
This will run your program under pfmon as many times as necessary to collect relevant events and will process the output to present useful statistics such as the number of instructions per cycle, percentage of no-ops, percentage of stalls, etc. The tutorial is also a good starting point to understand how to interpret these and more detailed statistics and how to relate them back to your source code.
Using the GNU Profiler (gprof) you can get an execution profile of your application detailing which subroutines are consuming most of the run time. In order to use gprof you must first compile the application to generate profile information with the -p flag and then run it once to generate the output file (normally this will create the file gmon.out). You can then invoke gprof with the name of your application to see the time profile. e.g.:
|
Using numerical libraries is the easiest way of achieving a high performance in your application. On COSMOS there is a variety of libraries that have been designed and tuned specifically for the Altix/Itanium2 architecture delivering considerable performance improvements over freely available or general purpose numerical codes.
Some of the libraries have been extended with OpenMP directives to
provide parallel execution, so the benefits of multiprocessing can be easily
obtained just by linking with the SMP library and setting the OMP_NUM_THREADS environmental
variable to the number of processors required. Note however that SMP
libraries have limited scalability on NUMA platforms like the Altix, due to the
unavoidable higher latencies of remote memory accesses. In general, this means that
setting the number of threads to a value higher than 12/16 will not give
further performance improvements, and it is even likely that it will increase the run time.
If you need a special purpose numerical library not listed here, you can email us your request and we will consider installing it on COSMOS.
To link your program with the default MKL a typical compilation line would be:
|
Parallel version of MKL is thread-safe, which means you can use it for further parellisation of MPI threads (hybrid type of parallelisation)
In general, choosing parallel or sequential version of the MKL library depends on the the way you intend to parallelize your program. There are several considerations you need to take into account and most likely you would need to experiment. In the simplest form, you use parallel MKL to perform linear algebra manipulations, using BLAS/LAPACK calls. In this way the serial code is 'auto-parallelised', without any further ado:
|
While making a purely MPI-parallelised program, it is better to avoid any confusion and use the sequential version of MKL:
|
(...need examples here...)
Note:
'-lmkl_lapack' is now obsolete and should be removed from all Makefiles.It is worth reading the MKL documentation for a detailed description of the library and examples of use. Note that the correct environment should already be set up by default, as described above.
|
The second option gives you access to the OpenMP enabled version of the SCSL library.
FFTW is a specialized library for FFTs with C and Fortran interfaces. It offers rich functionality and a variety of options for many special purpose FFT calculations. It has been designed to deliver a portable set of routines which adapt the computation automatically to achieve a good performance accross a variety of platforms. It is compiled for multithreading and can deliver scalabable performance using OpenMP parallelism in a transparent way. To link with the FFTW3 on COSMOS use:
|
The old FFTW2 (version 2.1.5) library - still the only release that features MPI parallelism - is also available in both single and double precision implementations. To link against this version use:
|
You can read more about FFTW and access the full documentation on the FFTW webpage.
The SMP (parallel) Fortran, ordinary (single-processor) Fortran, Fortran90 and C libraries from the Numerical Algorithms Group are available on COSMOS. The libraries contain routines for a wealth of numerical problems including linear algebra (LAPACK), differential equations, random numbers, FFTs, sparse solvers, special functions, numerical integration, interpolation, optimisation and statistics.
The NAG SMP Fortran library is optimized for parallel execution via
OpenMP, which allows users to take advantage of multiprocessing in
computationally intensive routines by simply setting an appropiate
number of OpenMP threads (via OMP_NUM_THREADS).
To link with the SMP library use:
|
'-openmp' and '-mkl=parallel' flags ensure that the OpenMP run-time
libraries are called
as needed. You can access the library documentation here.
This is the single-processor version of the NAG Fortran library; both SMP and non-SMP libraries contain essentially the same set of routines, however the routines in the ordinary Fortran library will not execute in parallel across different processors (and so benefit from parallel speedup). This is not to say that they cannot be called by the individual threads of a parallel code (provided they are thread-safe so that multiple instances executing simultaneously don't interfere with each other).
There are currently two versions of this library, depending on whether you wish to use BLAS and LAPACK routines provided by NAG, or by Intel MKL: e.g.
With BLAS and LAPACK provided by NAG --
|
-- or, with BLAS and LAPACK and highly optimised routines provided by MKL --
|
-- or even you may try --
|
-- if you would like to employ the threaded version of Intel MKL library. It may or may not help to boost the speed, depending on the array sizes and other matters.
Please see also the manual and the User notes.
It is possible to call the routines in either the SMP or non-SMP NAG Fortran libraries from within a C program. How to do this in general depends on how the library was built and on the development environment (compiler system and support libraries) being used. The basic recipe is to add a statement:
#include <nag.h>
where the header file nag.h
contains
suitable prototypes for the NAG routines in C, and then compile
(using in this case the Intel compiler) with:
|
-- if using SMP NAG library, or
|
-- in the single-threaded case.
Note that module load mkl should no longer be
necessary as the corresponding MKL libraries should be linked with automatically.
Please see also the C-Header documentation.
Again, there are currently two versions of this library available, depending on whether you wish to use BLAS and LAPACK routines provided by NAG, or by Intel MKL: e.g.
|
(with BLAS and LAPACK provided by NAG) or
|
To link your C application with the NAG C Library use:
|
-- or even you may try --
|
Please see also the manual and User notes.
The most recent version of the library is installed as part of the COSMOLIB library stack. To link against it simply use command like this:
|
Here MKL Library provides all the BLAS/LAPACK routines instead of the GSL CBLAS Library for much better performance. Please refer to the GSL Reference Manual pages, for the complete information on GSL routines and options.
The SGI NUMA tools are part of the Propack distribution and are intended to give the programmer greater control over CPU and memory placement of the application. From the point of view of COSMOS users, the two most important tools are dlook and dplace.
With dplace you can bind processes to a specific CPU to avoid process
migration.
This is used on COSMOS to ensure good performance of the
high-priority (project) queues. In addition, it is convenient to use
dplace with the flag
'-x2' for OpenMP programs to skip the binding of the shepherd thread
created by the run time library, and use '-s1' with MPI applications to avoid
binding the shepherd process created by the SGI implementation of MPI. E.g:
|
for OpenMP, and
|
for MPI codes linked against MPT.
Dlook is an SGI application to determine the placement of virtual memory pages on NUMA architectures. Using dllook, you can determine if your application has most of the memory allocated on a local node (desirable), and ensure that parallel (OpenMP) applications allocate memory evenly among the worker threads (by initializing the main data structures inside a parallel region).
To use dlook with your application use:
|
where pid is the process id
of the running program you want to monitor. Note that for OpenMP
applications all the threads will share the same memory map, so the
output from dlook will be identical for all of them.
You can send us an email for assistance with programming or software issues on COSMOS. For compilation problems, please report the version of the compiler used and the error message, and preferably indicate the path to the relevant source code.
Subject to the workload, we can also help you port and optimize your application to run on the Altix. If you have an application that requires tuning, please submit a request to cosmos_sys.