COSMOS Developer Guide


Please refer to the new COSMOS web pages for the most up to date information.

Manual started byVictor Travieso, May 2004
Last updated by Andrey Kaliazin, July 18, 2011

The following pages give an overview of the development tools available on COSMOS. They are intended as a gentle introduction to the software and libraries available on the Altix, and should contain enough information to get you started. Note however, that this document is far from comprehensive and is not intended as a substitute of the official manuals. Throughout the text we have indicated links to the relevant documentation, all of which is available online. You should refer to these links for more complete information on the following topics.

This guide is work in progress, and we expect to update it frequently as new software becomes available. If you spot any errors or know of a topic that you think should be included in the guide, please contact us on cosmos_sys. Constructive feedback will be appreciated.

Quick start

1) Configure your development environment:

On COSMOS the default and the only recommended way of setting up the environment is by use of the environment modules. We have an extensive set of modules to support various compilers, scientific libraries and applications (See software, development and visualization pages).

The most commonly-used environments (to access FFTW, GSL and some other libraries) are collated into the cosmolib module. It is also the pre-requisit module for all other modules, so should always be loaded first.

Most of the cosmolib libraries are compiled with the latest Intel Compilers, which require the use of the latest module.

Thus the default environment on COSMOS is set by loading two modules (cosmolib and latest) in the .bashrc file:

$ module load cosmolib

This should be set by default - so if it is not - please contact us asap - your environment needs updating.

If you have a code that used to work with version 9 of the compilers but now fails with the newer version, please email us and let us know about it. You can still access the version 9.1 environment by loading the 'icomp91' module.

2) Compile your application with Intel compilers. E.g.:

$ ifort program.f90 -o myprogram
$ icc   program.c   -o myprogram

3) Do a test run interactively (preferably with a small/toy data set). On COSMOS, there are 8 processors reserved for interactive use, to allow rapid interactive response and the ability to test run parallel programs. You just need to launch your application from the shell, e.g.:

$ ./myprogram file.dat

Interactive jobs are limited to 30 minutes of CPU time. If you want to test a longer job, you can now use the 'express' queue with a maximum of 8 processors and 2 hours.

4) If there are any problems recompile and run the program under the debugger interactively. E.g.:

$ ifort -g -O0 program.f90 -o myprogram
$ idb myprogram
(idb) run file.dat

5) If everything seems to be working fine, create a job script and submit  the job to the appropiate queue.


Most of the development software on COSMOS presented here has a module interface. That is,  it either exists in a number of incompatible versions, or uses non-standard paths in the directory hierarchy, and access to it is not enabled by default. The interface to different packages is managed via environmental modules. This makes it easier to maintain several versions of the same software avoiding interference between similar components. To access a particular software package or library with a module interface you should first load the corresponding module using:

$ module load full-module-name

This will set up the necessary environmental variables and paths required by the package. You can load as many modules as you require in a session, but be aware that if you load multiple modules with overlapping components (e.g.. two different versions of the same compiler) the module loaded last will have precedence. In order to avoid confusion you can unload a module once it is no longer needed by typing:

$ module unload full-module-name

There are various commands available to manage modules. Among the most useful ones are:

$ module list

To see all the modules that you have currently loaded. And:

$ module avail

To see a list of all the modules available on COSMOS. Use 'module -help' for a complete list of commands. The default developer environment on COSMOS is based around the Intel Compilers.


A Makefile is a collection of instructions to automate the compilation of programs. Having one for your program is a great time saver if you are developing code yourself and often changing source files, and specially if you are planning to share the code with other people.

Creating a Makefile is very easy, you just need to specify your source files and the rules that build your program. Once you have a working Makefile, you can build your program by invoking 'make' or 'make all'.

There are many tutorials online about creating and managing Makefiles, see for instance the Introduction to Makefiles from the GNU make manual.

We have compiled a few templates that you can edit and use to build your programs on Cosmos:

C Makefile template
Fortran90 Makefile template
Example C application

Using CVS

CVS is a popular open source version control system. It works by keeping a master copy (CVS repository) of your source files. When you want to work on the application, you check out the latest version in the repository into a working copy, which you then edit and modify at will. If you're happy with the changes, you can update the repository with your working copy by committing the changes and logging a description of the modifications.

By putting your source files under the control of CVS, you can easily keep track of changes made during development, manage different versions of the code or contributions by different people, and on COSMOS, you can also simplify the task of backing up your sources by keeping the CVS repository on a backed up filesystems (eg. /home/cosmos/).

You can set up the root of your CVS repository in your COSMOS home directory by setting the environment variable CVSROOT - when this variable is set, all the cvs commands operate on that directory. A CVS repository is created with the command 'cvs init':

$ export CVSROOT= /home/cosmos/PROJECT_GROUP/USER_ID/cvs
$ cvs init

Once you have a working CVS repository, you can start adding projects to it. If you already have sources and Makefiles in a directory, you can place them under CVS by going into the application directory and importing it into CVS. The import command below will copy the contents of the application directory to the CVS root, with the project name PROJECT. USER can be any tag you want, like your user id.

Every time you commit something to the repository, CVS will fire up an editor and ask you to enter some description of the changes made. You can select your favourite editor by setting the CVSEDITOR environment variable. For example:

$ export CVSEDITOR=emacs    
$ cvs import PROJECT USER start

To work on a project, you 'check out' a working copy from the repository, which you can then use to build your application or to edit your source files. If you make modifications to the sources that you want to keep, you can 'commit' them to the repository. CVS will ask you to enter a brief description of the changes and will update the repository with the incremental difference from the working copy, assigning it a new version number. In this way, you can later review or undo any changes you have made. Some useful commands in CVS are:

# Working on a project - making a local work copy from the repository
$ cvs checkout PROJECT

# Saving changes back to the repository:
# 'update' makes cvs aware of any changes in the working copy
# 'commit' saves them in the repository
$ cvs update
$ cvs commit

# Working remotely though ssh - eg. from a registered COSMOS host
# replace USER_ID and CVS_ROOT_DIR
$ export CVS_RSH="ssh"
$ export CVSROOT=""

# Examining changes in the working copy to the repository version
$ cvs -Q diff -c

# Checking the history of changes in a file
$ cvs log FILE

There are many other features and commands in CVS which are beyond the scope of this guide. An excellent CVS FAQ and Manual can be found at Ximbiot.


There are several compilers for the Itanium2 architecture available on COSMOS. We strongly recommend the use of the Intel compilers for the best performance. Note that Intel compilers are evolving rapidly, and new patches appear regularly - as part of the process to make them more robust and reliable on the Altix platform, it is important to collect and report to Intel any bugs that we you might encounter. If you have a program that fails to compile and you suspect that the code is legal, it might be due to a compiler bug. Please send an email to cosmos_sys with details of the code (preferably include a toy example) and the error message; if you have your own Premier support account with Intel and submit a report, please let us know the details.

The Itanium architecture follows a philosophy of less logic, more resources. This means that the chip has enough functional units and registers to sustain a very high performance (6 operations per clock cycle) but the compiler must do a very good job to generate optimal code. Careful use of optimization flags and close analysis of the optimizations done by the compiler is often necessary to avoid performance losses.

Intel Compilers v.12 (Intel Composer XE Suite, as it is called now)

The latest version of the Intel compiler suite - Intel Composer XE, contains numerous enhancements over the previous releases. (only available on x86_64 architectures (universe), so for the old cosmos machine version 11.1 is the last one avaialble). Codes that used to build well under previous versions should also compile with the latest version compilers, but changes in the routines of the run-time library and in the compiler behaviour may mean that your code will need some modifications before it can be compiled with the new compilers.

There are several flags that determine the type and level of optimizations that the compiler can do on your code. Here we list the options that have shown the most significant impact in achieving good performance levels. Note that a fixed set of options cannot be given as the 'best choice', since the final result will depend heavily on the particular program. You may need to experiment with different combinations before you get a satisfactory result. Sometimes the compiler will need some help to deal with specific loops. Be sure to look at the optimization report to identify factors inhibiting optimizations. For a comprehensive list of optimization options and more detailed explanations refer to the Intel optimization guide, which can be found here; there is also a step-by-step tutorial from Intel on code optimization.

Enables -O2 plus more aggressive loop and floating point optimizations. It also turns on prefetching.
-ftz Flushes denormalized numbers to zero. (ON with -O3).
-fno-alias Assumes no aliasing between pointers (i.e. they don't overlap in memory). Allows the compiler to find more opportunities for optimal pipelining.
-fno-fnalias Instructs the compiler not to assume aliasing within functions.
-ip Enables optimizations accross procedures/subroutines (eg. inlining) in the same source file.
-ipo Enables inter-procedural optimizations across multiple files.
-align Ensures proper aligment of data on memory boundaries for faster loads.
-auto Allocates local variables on the stack. (ON with -openmp).
Enables profile guided optimizations (PGO). Requires three phases: Compilation with -prof_gen, running the program and recompiling with -prof_use. Has a greater impact on codes that make heavy use of branches.
Generates an optimization report, detailing changes to the code in different optimization pahes. A useful starting point is: "-opt_report -opt_report_fileOR.out -opt_report_phase hlo -opt_report_phase ecg_swp". Then examine $quot;OR.out$quot; for pipelining failures and loop transformations.
-O2 A more conservative approach to optimization. Useful in combination with selected loop and floating point optimizations when accuracy is an issue or -O3 is degrading performance.

Intel Compiler 11

The recent versions of the Intel Compiler (11.1) are still available on both universe and cosmos and work ok for the time being. They remain on universe for compatibility and on cosmos as the default compilers. If you need to access the 11.1 compilers, you can use the following module command (11.1.075 was the last released version of the 11.1 line):

$ module load icomp/11.1.075

The most common optimization options behave as with version 12 above.

Intel older compilers (10.x etc)

The earlier version of the Intel Compiler (10.1 - the oldest being kept) is still available and should work ok for the time being. But the use of 10.1 compilers is strongly discouraged. If you still need to access the old compilers, you can use the following module command:

$ module load icomp/10.1.026


GNU Compilers

The GNU suite of compilers are, of course, fully compatible with both x86_64 and ia64 architecture. The C/C++ compilers are mature and robust, and most software distributions for Linux will autodetect gcc and configure themselves with the appropiate options (you can override this behaviour by setting appropiate environmental variables or giving specific command line options to the configure script, eg. to build the application with the Intel C compilers, although you may need further changes to the scripts). The gfortran compiler will compile Fortran77 code, as well as Fortran90, 95 and 2003. Since version 4.3 GNU compilers support OpenMP directives, but Intel compilers are much more mature in this regard.

In general, gcc/gfortran will not generate optimal code for iether Xeon or Itanium2 chip, and for computationally intensive applications performance could be as low as 40% of the performance achieved with the Intel compilers. If you need to use some of the GNU compilers for compatibility reasons, the following flags may help improve performance.

Highest level of optimization for gcc/g77. Enables most optimization flags.
Allows certain floating point optimizations that don't conform to the IEEE standard.
Might improve speed by unrolling iterative DO-loops and Do-while loops (-all)..
Allows inlining of small functions. (ON with -O3)
Will prefetch arrays inside loops.



Debuggers are an essential tool to identify and fix programming errors. It is highly recommended that you familiarize yourself with one of them and use it regularly when faced with unexpected behaviour of your program - even for simple bugs, the use of a debugger is preferable to peppering your code with print statements. Although the list of commands might seem daunting to the novice user, there are only a handful of of them that you will be needing to solve most problems. In addition, all the debuggers on COSMOS can be used via an intuitive GUI that greatly simplifies the task of interacting with the debugger. To debug your program you must first compile your code with the -g flag, and then launch the application from inside the debugger. On COSMOS you may need to increase the stack size before launching your application.

Intel Debugger (idb)

The Intel Debugger is distributed with the Intel compilers, and is accessible when the appropiate compiler module is loaded. You can start the debugger writing idb. The Intel Debugger presents a command line interface and can operate natively (recommended), or emulating the behaviour of the Unix debugger (DBX) or the GNU debugger (invoke via 'idb -gdb'). An example session is started by e.g.:

$ ifort -g -O0 myprogram.f90 -o myprogram
$ idb myprogram
(idb) run

Although idb does support debugging optimized code, it is better to disable optimization, while compiling for debugging perposes (thus using '-g -O0' options). Debugging of parallel programs is supported fully (but messy) for MPI codes, and somewhat limited for OpenMP codes. In the latter case you won't be able to examine shared variables and locks (e.g. not very useful to spot race conditions). We expect better OpenMP support in future releases.

Debugging with Intel Debugger in the comfort of GUI interface is fully supported on x86_64 platforms (via Eclipse), so will be available on the next COSMOS system, which is planned for the deployment in Q2'2010.

Extensive documentation is available from Intel, including a short tutorial for a quick introduction.

GNU Debugger (gdb)

The GNU Debugger (GDB) supports C/C++ and Fortran77/90 debugging, and is compatible with code generated via GNU and Intel compilers. GDB can be used to debug parallel (MPI) and threaded programs, although OpenMP is not currently supported. The command line interface is very rich, and the documentation can be accessed by consulting the 'man' pages or looking at the official manual from the FSF online. Additionally, there are plenty of tutorials and guides available on the web. The GDB tutorial is a good starting point.

Data Display Debugger (DDD)

The Data Display Debugger (DDD) is a graphical application that allows you to interact easily with an inferior debugger. DDD works best when used with the GNU Debugger (the default when invoked with ddd). You can also use it with idb, although there are some minor glitches that may be confusing for the novice user, and you may need to interact with the debugger via the command line to access all the functionality of idb. There is extensive documentation available for DDD, including a very useful step-by-step tutorial. To use ddd with the Intel debugger you can use the following command:

$ ddd -dbx --debugger "idb -dbx -fullname" myprogram &

Or, if you want to run the "myprogram" with arguments (for example: '--ini-file=params.ini --out-file=myprog.out'), then call the debugger like this:

$ ddd -dbx --debugger "idb -dbx -fullname" --args myprogram --ini-file=params.ini --out-file=myprog.out &

Performance Analysis

Profiling your application can help you understand why the program is not running as fast as you expected and will give you pointers as to what parts of the code are causing the slow down. Using performance analysis tools you can quickly identify performance bottlenecks and hot-spots (parts of the program where most of the time is spent) and guide the optimization effort accordingly. In particular, using performance data in conjuntion with the optimization reports from the Intel compilers will tell you if there are parts of the application where the compiler needs some help to generate optimal code. Optimizing an application can be very time consuming, so it is essential to focus on the areas that will have a significant impact on the overall performance of the program.

qprof (cosmos only)

Qprof is a simple profiling utility to generate a breakdown of the time spent in various subroutines or lines of your code. It only requires you to set an environment variable before running your program, although more useful information will be displayed by compiling the code with debugging symbols '-g'. Qprof works with any version of the Intel compilers.

Usage example:

$ module load qprof
 $ ./mycommand
 qprof: /tmp/mycommand: 150 total samples
 main:dumb_test.c:6                                                     59
 main:dumb_test.c:7                                                     61
 main:dumb_test.c:8                                                     30

If you prefer to set the variable manually, just set or export LD_PRELOAD to the full path of

$ LD_PRELOAD=/home/cosmos/share-ia64/lib/ ./mycommand

The behaviour of qprof can be controlled by a number of environmental variables. More information here.

histx tools - cosmos only (SGI: Performance Analysis and Debugging)

HistX is a set of tools from SGI with a similar functionality to the performance analysis programs available on the Origin. Those familiar with the tools on the previous COSMOS machine will find HistX very useful. In brief, the command 'histx' lets you sample selected events throughout the run time of your application without any instrumentation to the code. The output from histx can then be processed with 'iprep' to present a human-readable report of the results.

Running histx to sample the default event (CPU_CYCLES) is done with:

$ module load sgi
$ histx -f -o profile program

This will create the file with the experiment results. You can then use iprep to process the output using:

$ iprep <

The output of the default run will look very similar to that obtained by a standad profiler (e.g. gprof or prof on Unix systems). You can specify other events to be sampled using the '-e' option. In addition, you can relate events to particular lines in your source code by compiling with debugging symbols (-g) and running 'histx -l'. You can find some examples on histx use in the in the package documentation.

pfmon - cosmos only (SGI: Performance Analysis and Debugging)

Pfmon is a low level tool to access the Performance Monitoring Unit (PMU) of the Itanium chip. With pfmon you can access the hardware counters available on the Itanium to sample all the perormance events available (over 300 events) in sets of 4 events at a time. This is a very powerful tool with a rich set of options, but requires careful use to extract valuable information from the large amount of counters available. To monitor a program you just need to invoke pfmon on the unmodified binary with the events that you wish to sample (If no events are specified the default is CPU_CYCLES). For example, to count the number of cycles, number of instructions retired, and number of no-ops retired you can type:

$ pfmon -e cpu_cycles,ia64_inst_retired,nops_retired program

You can access information about a particular event using the '-i' option, and list events matching a particular pattern using '-l'. For example, to see the L2 cache related events that can be counted use:

$pfmon -lL2

There is much more to pfmon that we can cover here. If you plan to use it for performance analysis be sure to read the pfmon userguide and the Itanium2 specific features. There are higher level wrappers to pfmon that allow easy sampling of events with a drill-down approach and quick interpretation of results.

Perl wrappers to pfmon - cosmos only

A good alternative to pfmon is to use the higher-level interfaces provided by these two perl wrappers. In essence, these tools invoke pfmon to do the actual sampling, but allow you to progress in a drill-down fashion by having important events predefined and processing the counting results into meaningful statistics.

We recommend that you start with one of these wrappers for performance analysis. In particular, will allow you to characterize the performance of your application easily and quickly by using the predefined event groups. A good starting point is:

$ -p -d pfmon.out -t "efficiency" program

This will run your program under pfmon as many times as necessary to collect relevant events and will process the output to present useful statistics such as the number of instructions per cycle, percentage of no-ops, percentage of stalls, etc. The tutorial is also a good starting point to understand how to interpret these and more detailed statistics and how to relate them back to your source code.

GNU gprof

Using the GNU Profiler (gprof) you can get an execution profile of your application detailing which subroutines are consuming most of the run time. In order to use gprof you must first compile the application to generate profile information with the -p flag and then run it once to generate the output file (normally this will create the file gmon.out). You can then invoke gprof with the name of your application to see the time profile. e.g.:

$ ifort -p test.f90 -o test
$ ./test
$ gprof -p test

Numerical Libraries

Using numerical libraries is the easiest way of achieving a high performance in your application. On COSMOS there is a variety of libraries that have been designed and tuned specifically for the Altix/Itanium2 architecture delivering considerable performance improvements over freely available or general purpose numerical codes.

Some of the libraries have been extended with OpenMP directives to provide parallel execution, so the benefits of multiprocessing can be easily obtained just by linking with the SMP library and setting the OMP_NUM_THREADS environmental variable to the number of processors required. Note however that SMP libraries have limited scalability on NUMA platforms like the Altix, due to the unavoidable higher latencies of remote memory accesses. In general, this means that setting the number of threads to a value higher than 12/16 will not give further performance improvements, and it is even likely that it will increase the run time.

If you need a special purpose numerical library not listed here, you can email us your request and we will consider installing it on COSMOS.

Intel MKL

See online documentation here 
Version: 10.3
Module: none required - MKL is installed with the Intel compilers
Notes: n/a

The Math Kernel Library contains the following routines:

To link your program with the default MKL a typical compilation line would be:

$ ifort program.f90 -o program -mkl=<sequential|parallel> -lmkl_subset

Parallel version of MKL is thread-safe, which means you can use it for further parellisation of MPI threads (hybrid type of parallelisation)

In general, choosing parallel or sequential version of the MKL library depends on the the way you intend to parallelize your program. There are several considerations you need to take into account and most likely you would need to experiment. In the simplest form, you use parallel MKL to perform linear algebra manipulations, using BLAS/LAPACK calls. In this way the serial code is 'auto-parallelised', without any further ado:

$ ifort myprog.f90 -o myprog -mkl=parallel

While making a purely MPI-parallelised program, it is better to avoid any confusion and use the sequential version of MKL:

$ ifort mpiprog.f90 -o mpiprog -mkl=sequential -lmpi

(...need examples here...)


  1. BLAS/LAPACK routines are now internal to the MKL core library, so there is no separate MKL Lapack library and the customary flag '-lmkl_lapack' is now obsolete and should be removed from all Makefiles.
  2. there is no separate subset for FFTs
  3. some subsets, e.g. solver, lapack95, are available only as .a archives for static linking.

It is worth reading the MKL documentation for a detailed description of the library and examples of use. Note that the correct environment should already be set up by default, as described above.

SGI SCSL - cosmos only (now obsolete anyway)

The Scientific Computing Software Library (SCSL) from SGI has been kept on COSMOS for the backward compatibility, but is no longer developed and supported and is likely to disappear altogether in the next COSMOS incarnation. Intel's MKL, FFTW and GSL libraries are recommended replacements for the SCSL routines. The library covers the following areas:
To link with the SCSL library add the following flag when linking:

$ ifort program.f90 -lscs
$ ifort program.f90 -lscs_mp

The second option gives you access to the OpenMP enabled version of the SCSL library.


FFTW is a specialized library for FFTs with C and Fortran interfaces. It offers rich functionality and a variety of options for many special purpose FFT calculations. It has been designed to deliver a portable set of routines which adapt the computation automatically to achieve a good performance accross a variety of platforms. It is compiled for multithreading and can deliver scalabable performance using OpenMP parallelism in a transparent way. To link with the FFTW3 on COSMOS use:

$ ifort program.f90 -lfftw3 -openmp

The old FFTW2 (version 2.1.5) library - still the only release that features MPI parallelism - is also available in both single and double precision implementations. To link against this version use:

#-- eg. serial program --
$ icc program.c -lfftw

#-- eg.  parallel program, using MPI --
$ icc program.c  -lfftw_mpi -lfftw -lmpi

You can read more about FFTW and access the full documentation on the FFTW webpage.

NAG Libraries

The SMP (parallel) Fortran, ordinary (single-processor) Fortran, Fortran90 and C libraries from the Numerical Algorithms Group are available on COSMOS. The libraries contain routines for a wealth of numerical problems including linear algebra (LAPACK), differential equations, random numbers, FFTs, sparse solvers, special functions, numerical integration, interpolation, optimisation and statistics.

NAG SMP Fortran Library (Mark 22)

The NAG SMP Fortran library is optimized for parallel execution via OpenMP, which allows users to take advantage of multiprocessing in computationally intensive routines by simply setting an appropiate number of OpenMP threads (via OMP_NUM_THREADS). To link with the SMP library use:

$ ifort -openmp program.f -o program -lnagsmp -lnag_mkl -mkl=parallel

The '-openmp' and '-mkl=parallel' flags ensure that the OpenMP run-time libraries are called as needed. You can access the library documentation here.

NAG Fortran Library (Mark 22)

This is the single-processor version of the NAG Fortran library; both SMP and non-SMP libraries contain essentially the same set of routines, however the routines in the ordinary Fortran library will not execute in parallel across different processors (and so benefit from parallel speedup). This is not to say that they cannot be called by the individual threads of a parallel code (provided they are thread-safe so that multiple instances executing simultaneously don't interfere with each other).

There are currently two versions of this library, depending on whether you wish to use BLAS and LAPACK routines provided by NAG, or by Intel MKL: e.g.

With BLAS and LAPACK provided by NAG --

$ ifort -fpic -g -O3 -xHost program.f -o program -lnag_nag

-- or, with BLAS and LAPACK and highly optimised routines provided by MKL --

$ ifort -fpic -g -O3 -xHost program.f -o program -lnag_mkl -mkl=sequential

-- or even you may try --

$ ifort -fpic -g -O3 -xHost program.f -o program -lnag_mkl -mkl=parallel

-- if you would like to employ the threaded version of Intel MKL library. It may or may not help to boost the speed, depending on the array sizes and other matters.

Please see also the manual and the User notes.

Using the NAG Fortran Libraries from C

It is possible to call the routines in either the SMP or non-SMP NAG Fortran libraries from within a C program. How to do this in general depends on how the library was built and on the development environment (compiler system and support libraries) being used. The basic recipe is to add a statement:

#include <nag.h> 

where the header file nag.h contains suitable prototypes for the NAG routines in C, and then compile (using in this case the Intel compiler) with:

$ icc -fpic -g -O3 -xHost -openmp program.c -o program -lnagsmp -lnag_mkl -mkl=parallel

-- if using SMP NAG library, or

$ icc -fpic -g -O3 -xHost program.c -o program -lnag_mkl -mkl=sequential

-- in the single-threaded case.

Note that module load mkl should no longer be necessary as the corresponding MKL libraries should be linked with automatically.

Please see also the C-Header documentation.

NAG Fortran90 Library (Release 4)

Again, there are currently two versions of this library available, depending on whether you wish to use BLAS and LAPACK routines provided by NAG, or by Intel MKL: e.g.

$ ifort program.f90 -o program -lnagfl90

(with BLAS and LAPACK provided by NAG) or

$ ifort -o program -lnagfl90_mkl -mkl=sequential

(with BLAS and LAPACK and highly optimised routines provided by MKL). Please see also the manual  and User notes.

NAG C Library (Mark 09)

To link your C application with the NAG C Library use:

$ icc -fpic -g -O3 -xHost program.c -o program -lnagc_nag -lpthread

-- or even you may try --

$ icc -fpic -g -O3 -xHost program.c -o program  -lnagc_nag -lnag_mkl -mkl=sequential

Please see also the manual  and User notes.

GNU Scientific Library (GSL)

The most recent version of the library is installed as part of the COSMOLIB library stack. To link against it simply use command like this:

$ icc program.c -o program -lgsl -mkl=sequential

Here MKL Library provides all the BLAS/LAPACK routines instead of the GSL CBLAS Library for much better performance. Please refer to the GSL Reference Manual pages, for the complete information on GSL routines and options.

Miscellaneous Tools

NUMA Tools

The SGI NUMA tools are part of the Propack distribution and are intended to give the programmer greater control over CPU and memory placement of the application. From the point of view of COSMOS users, the two most important tools are dlook and dplace.

With dplace you can bind processes to a specific CPU to avoid process migration. This is used on COSMOS to ensure good performance of the high-priority (project) queues. In addition, it is convenient to use dplace with the flag '-x2' for OpenMP programs to skip the binding of the shepherd thread created by the run time library, and use '-s1' with MPI applications to avoid binding the shepherd process created by the SGI implementation of MPI. E.g:

dplace -x2 ./my_openmp_program

for OpenMP, and

 mpirun -np N dplace -s1 ./my_mpi_program

for MPI codes linked against MPT.

Dlook is an SGI application to determine the placement of virtual memory pages on NUMA architectures. Using dllook, you can determine if your application has most of the memory allocated on a local node (desirable), and ensure that parallel (OpenMP) applications allocate memory evenly among the worker threads (by initializing the main data structures inside a parallel region).

To use dlook with your application use:

$ dlook pid

where pid is the process id of the running program you want to monitor. Note that for OpenMP applications all the threads will share the same memory map, so the output from dlook will be identical for all of them.

Technical documentation

Getting Help

    You can send us an email for assistance with programming or software issues on COSMOS. For compilation problems, please report the version of the compiler used and the error message, and preferably indicate the path to the relevant source code.

Subject to the workload, we can also help you port and optimize your application to run on the Altix. If you have an application that requires tuning, please submit a request to cosmos_sys.