histx 1.1 Documentation 0. Introduction == ============ histx 1.1 is a small set of tools that can assist with performance analysis and bottleneck identification in IA-64 applications running on Linux. It works strictly at the application level and cannot be used to identify operating system or I/O bottlenecks. This documentation briefly describes the tools and how to run them. It does not describe how the results produced by these tools should be used for application analysis and tuning. Here is a brief description of the tools. Each tool is described in depth in later sections. . lipfpm - obtains PMU event counts for the entire execution of an application. . samppm - repeatedly samples PMU event counts during application execution . dumppm - dumps binary files produced by samppm in human/script readable form . histx - samples instruction pointer (ip) or callstack on timer or PMU related events . iprep - sorts, merges, and "pretty print"s one or more raw ip sampling reports produced by histx . csrep - produces "butterfly report" from one or more raw callstack sampling reports produced by histx 1. lipfpm (Linux IPF Performance Monitor) == ====== ====== === =========== ======== lipfpm allows one to obtain values of IPF Performance Monitor Event counts for the entire execution of an application. Here is a quick sample using the default event: % lipfpm /bin/ls [...normal ls output...] lipfpm summary ======= ======= CPU Cycles................................................. 2831964 % One can simultaneously count as many events as the hardware supports. Itanium and Itanium2 have 4 hardware registers. lipfpm also contains a database of useful expressions (additions welcome) involving multiple events, and will display the result if applicable: % lipfpm -e IA64_INST_RETIRED.THIS /bin/ls [...normal ls output...] lipfpm summary ======= ======= Retired IA-64 Instructions -- Retired IA-64 Instructions... 517267 CPU Cycles................................................. 661705 Average number of Itanium instructions per cycle........... 0.781718 % To see how to run lipfpm, just type "lipfpm". % lipfpm usage: lipfpm [-c name] [-e name]* [-f] [-i] [-h] [-k] [-l] [-o path] [-p] command args... -c request named collection of events may not be used with `-i' or `-e' arguments -e specify events to monitor (for event names see Intel Documents) -f follow fork (default: off) -h this message (command not run) -i specify events interactively: Use or to display next event Use to display previous event Use to select event Type letters to skip to events starting with the same letters Note that etc. are treated as letters Use to finish -k count at privilege level 0 as well (default: off) -l list names of all events (other args ignored) -o send output to path.. instead of standard output -p produce easier to parse output % If you have trouble remembering the names of all the events, they can be listed with the `-l' option, or the `-i' option can be useful for interactively finding events of interest. However note that one shouldn't redirect standard input when using `-i', i.e. % lipfpm -i a.out < input will not behave as desired. One of the most useful features of lipfpm is its ability to account properly for applications that fork, pthreaded applications, and applications that exec. For example, an application called `ptprog' forks once, the fork child calls exec, and all instances create an extra thread for computation. For example: % lipfpm -f -e LOADS_RETIRED -e NOPS_RETIRED -o cnts ptprog [...normal ptprog output...] % ls cnts.* cnts.ptprog.16196 cnts.ptprog.16198 cnts.ptprog.16201 cnts.ptprog.16197 cnts.ptprog.16199 cnts.ptprog.16202 cnts.ptprog.16197-1 cnts.ptprog.16200 cnts.ptprog.16203 A report file is created for each thread. (libpthread creates an extra helper thread so there are 9 total). Since the fork() child has exec()ed the same ptprog, and it will have the same PID, `-1' was appended to the "usual" name to distinguish it. Subsequent exec()s would result in further suffixes: `-2', `-3',... NOTE: lipfpm does not work on statically linked applications. NOTE: lipfpm does not exhaustively check the set of events requested for correctness. The user must only supply sets of events which the processor is capable of monitoring correctly. NOTE: To run an MPI program which uses SGI MPT: % mpirun -np N lipfpm a.out 2. samppm (SAMPle Performance Monitor) == ====== ======= =========== ======== lipfpm is a useful tool to obtain total performance monitor event counts for the entire execution of a program. However, one may also wish to examine the time varying value of various event counts. samppm allows one to sample a number of requested event counts at a rate specified on the command line. At each sample time, the counts are read and stored in binary form in the output file along with a time stamp. The `dumppm' filter (see below) is then used to convert the binary file to a human/script readable columnar format. samppm is invoked in almost exactly the same manner as lipfpm with the following differences: . An output file (prefix) must be given (`-o' argument) . A sample rate can be given (`-r nnn'), and defaults to 10 `ticks'. . lipfpm's `-i' mode is not available. . lipfpm's event count expressions are not available. To see how to run it, just type "samppm": % samppm usage: samppm [-e name]* [-f] [-h] [-k] -o path [-r n] command args... -c request named collection of events may not be used with `-e' arguments -e specify events to monitor (for event names see Intel Documents) -f follow fork (default: off) -h this message (command not run) -k count at privilege level 0 as well (default: off) -o send binary output to path.. (REQUIRED) -r sample every n ticks (default 10) Note: One tick is about 0.977 milliseconds % Like lipfpm, samppm produces separate output files for each thread, or exec, and optionally for each fork (as requested via the `f' argument). The NOTEs in the section on lipfpm apply equally to samppm. 3. dumppm (DUMP Performance Monitor results from samppm) == ====== ===== =========== ======= ======= ==== ======= dumppm is used to dump the binary file produced by a samppm run into a human/script readable columnar format. To see how to use it: % dumppm -h usage: dumppm [-c] [-d] [-h] [-l n1,n2,...] [<] infile [>] outfile -c only list event names -d print differences and relative times -h print help -l print only events n1, n2, ... (event 0 is time) Note that dumppm can operate as a filter, e.g.: % cat file.prog.1234 | dumppm -d | more % dumppm -d < file.prog.1234 | more % dumppm -l 2,3 file.prog.1234 > results % dumppm -d file.prog.1234 results The `-c' flag is used to quickly tell one what is contained in the file, and shows the correspondence between event and output column. For example: % samppm -e CPU_CYCLES -e LOADS_RETIRED -r 2 -o dat myprog % dumppm -c dat.myprog.3214 dumppm: event 1 is: CPU_CYCLES dumppm: event 2 is: LOADS_RETIRED % If the `-c' is dropped, the output will consist of 3 columns: time of day, sampled CPU_CYCLES value, and sampled LOADS_RETIRED value, e.g.: 1042470662.010721 2059042 420371 1042470662.012656 3454206 711769 1042470662.014595 4866156 1006722 ... If the `-d' flag is added, time becomes relative, and differences between counts are output instead: 0.000000 2059042 420371 0.001935 1395164 291398 0.003874 1411950 294953 ... If a different order of, and possibly a different number of columns are needed, the `-l' flag can be used to specify which of the original columns to output, in which order. E.g. to output the change in CPU_CYCLES followed by relative time, use "-d -l 1,0": 2059042 0.000000 1395164 0.001935 1411950 0.003874 ... Further filtering and plotting of the data is left to the user. 4. histx (HISTogram eXecution) == ===== ========== ========= histx allows one to discover where time is being spent or where performance monitor events are occurring in a program. It does this by arranging that the thread ip (instruction pointer) or callstack is sampled based on either timer events, or overflows of a single performance counter register. Like lipfpm, it accounts properly for applications that fork or exec, and pthreaded applications. Unlike some other profiling tools, histx does not require the executable to be instrumented by a compiler. In fact, this instrumentation will confuse histx. The typical result of running histx on an executable compiled using the common `-p' flag is a crash with no information. The very high level optimization known as "function splitting" can also confuse histx. Thus if one uses `-prof_use' with the Intel compiler, one must also use `-fnsplit-' to disable function splitting. Here is an example of using histx on a program `prog': % histx -o out prog % sort -r out.prog.* | head 402: *Total for thread 30493* 285: libm.so.6.1:*Total* 142: libm.so.6.1:cos 106: a.out:*Total* 69: libm.so.6.1:acos 65: libm.so.6.1:asin 47: a.out:f2 [prog.c:29] 44: a.out:f1 [prog.c:19] 15: a.out:_init 10: ld-linux-ia64.so.2:*Total* % Just as with samppm, one must use the `-o file' option to place the output report(s) into file(s) named `file..'. In the case above, the report went into a file named "out.prog.31415". The default type of sampling is ip sampling, and the default event source is a timer event triggered every time the thread consumes an additional `tick' of user+system time. Each line of an ip sampling report has the form: : : [:] is the number of events that occurred within the body of contained in , starting at within . In the case of the main executable, is shown as "a.out". If line information is not available in the executable, the "[:]" information will not appear. Totals are also accumulated on a per library, and per thread basis. Note that corresponds to the first source line of , and NOT the line where a sample was taken. The count attributed to is the sum of counts in all bins interior to . If it is desired to see counts attributable to individual lines, and if the executable contains line information (usually produced using the `-g' compiler flag), one can add the `-l' flag, e.g: % histx -l -o out prog % sort -r out.prog.* | head 399: *Total for thread 31390* 278: libm.so.6.1:*Total* 122: libm.so.6.1:cos 111: a.out:*Total* 77: libm.so.6.1:acos 67: libm.so.6.1:asin 56: a.out:f1 [prog.c:19] 56: [prog.c:23] (a.out) 41: a.out:f2 [prog.c:29] 41: [prog.c:33] (a.out) Now additional lines appear in the report of the form: : [:] (library) Here, is the number of events that ocurred on the instructions generated by the compiler for within used in building . From the report above, it is apparent that all 56 events recorded in function f1() occurred on instructions from line 23 of prog.c. NOTE: Function level counts are computed by summing over all bins interior to a function. Line level counts are computed by summing over all bins intersecting instructions for a particular source line. The two are different as instructions from multiple lines can fall into a single bin. NOTE: When performance counter overflows occur, the ip associated with the overflow may be off by as many as 48 dynamic instructions. This must be kept in mind when using the reports. NOTE: Compilers and linkers are known to occasionally generate incorrect or invalid line information. This can confuse histx, or even cause a crash when running the application with histx. histx can do ip sampling at very high rates when using performance monitor register overflows. For example, to sample the ip every 10,000 CPU cycles (i.e. around 100KHz on a 1GHz processor system): % histx -l -o out -e pm:CPU_CYCLES@10000 prog % sort -r out.prog.* | head 34996: *Total for thread 31410* 28160: libm.so.6.1:*Total* 13533: libm.so.6.1:cos 8225: libm.so.6.1:asin 6008: a.out:*Total* 5706: libm.so.6.1:acos 5653: a.out:f2 [prog.c:29] 5653: [prog.c:33] (a.out) 796: ld-linux-ia64.so.2:*Total* 696: libm.so.6.1:sin NOTE: The ip sampling report is deliberately as simple as possible. It is envisioned that filters (e.g. iprep, see below) will be used to transform histx output into something more useful/intuitive. While ip sampling can tell us which function and source lines events are occurring in, it is often important to understand the control flow that led to the function being called. This is done by not sampling just the ip when an event occurs, but rather the entire callstack. To activate callstack sampling, use `-s callstackN' where `N' is the maximum number of frames to record. (Larger values require more memory.) E.g.: % histx -o cs -s callstack10 prog After the run, cs.prog.18494 contains several callstacks. The two most frequent are: 72 5 libm.so.6.1:acos a.out:f2+0x80 [prog.c:33] a.out:main+0x2a0 [prog.c:93] libc.so.6.1:__libc_start_main+0x230 [../sysdeps/generic/libc-start.c:129] a.out:_start+0x80 79 5 libm.so.6.1:cos a.out:f1+0xb0 [prog.c:23] a.out:main+0x90 [prog.c:70] libc.so.6.1:__libc_start_main+0x230 [../sysdeps/generic/libc-start.c:129] a.out:_start+0x80 Each unique callstack is reported in the form: : [:] :+ [:] ... :+ [:] This indicates that callstacks (of `height' ) were sampled, and the additional lines indicate the call chain: in in called in , which from then called in ... which finally called in . NOTE: A filter (csrep) is provided to generate a "butterfly" report from raw callstack reports. See below. To see how to run histx, type "histx": % histx usage: histx [-b width] [-f] [-e source] [-h] [-k] -o file [-s type] command args... -b specify bin bits when using ip sampling: 16,32 or 64 (default: 16) -e specify event source (default: timer@1) -f follow fork (default: off) -h this message (command not run) -k also count kernel events for pm source (default: off) -l include line level counts in ip sampling report (default: off) -o send output to file.. (REQUIRED) -s type of sampling (default: ip) Event sources: timer@N profiling timer events. A sample is recorded every N ticks. pm:@N performance monitor events. A sample is recorded whenever the number of occurrences of is N larger than the number of occurrences at the time of the previous sample. dlatM@N A sample is recorded whenever the number of loads whose latency exceeded M cycles is N larger than the number at the time of the previous sample. M must be a power of 2 between 4 and 4096 Types of sampling: ip Sample instruction pointer callstack[N] Sample callstack. N, if given, specifies the maximum callstack depth (default: 8) Notes: A list of valid performance monitor s can be found in Intel manuals. `command' must not be compiled using the `-p' compiler flag One tick is about 0.977 milliseconds For long running programs where any particular instruction bundle could be sampled more that 65535 times, the counts attributed to a function or line may not be accurate. The `-b' flag must be used in such cases to force the use of "bins" that can handle counts of up to 2^32-1 or even 2^64-1 before overflowing. However, these require the use of more memory per thread. histx produces a separate output file for each thread in a program. It also handles programs that exec, and, similar to lipfpm and samppm, in cases where the exec'd program name is the same as the original, will produce multiple reports in files named file..-. histx will also produce reports for forked child processes if the `-f' flag is given on the command line. NOTE: histx does not work on statically linked executables NOTE: To use histx on an MPI program which uses SGI MPT: % mpirun -np N \ histx \ a.out NOTE: To use histx on an OpenMP code that has been placed using dplace: % dplace \ histx \ a.out NOTE: histx runs "inside" the executable being sampled, and makes use of every thread's stack. Threaded programs whose threads almost completely consume their stack may fail when run with histx. histx also uses virtual memory, so programs that consume large amounts of memory may crash or fail to allocate needed memory when run with histx. NOTE: A particular version of histx is closely bound to a particular version of glibc. A program may crash when run with histx on a system where glibc is incompatible with histx. NOTE: The ability of histx to capture callstacks is only as good as the unwind information in the executable and libraries. When run under histx, a program may crash if bad unwind data is encountered. This is largely a function of the compiler and linker used to build the libraries and application. histx does take steps to avoid crashing in this situation, but there are always cases not seen previously. NOTE: histx can handle programs that use dlopen() and dlclose(). However, in callstack sampling mode, histx will not (currently) be able to assign a useful name to functions in libraries that have been dlclose()d. 5. iprep (IP sampling REPort) == ===== === ======== ======= iprep is used to create a useful report from one or more raw ip sampling reports produced by histx. E.g. % iprep < ip.prog.12345 > report.12345 or % iprep -c 5 ip.prog.* > report.all Output from iprep consists of one or two sections. First, a listing of summed functions counts, sorted from largest to smallest, along with their self and cumulative percentages. For example: Count Self % Cum. % Name -------------------- ------- ------- -------------------- 12362 29.730 29.730 libm.so.6.1:cos 7716 18.557 48.286 libm.so.6.1:acos 6533 15.712 63.998 libm.so.6.1:asin 6338 15.243 79.241 a.out:f1 [prog.c:19] 5655 13.600 92.840 a.out:f2 [prog.c:29] 1401 3.369 96.210 a.out:_init 625 1.503 97.713 libm.so.6.1:sin ... The second section presents the line level data (if available) in a similar format. E.g. Count Self % Cum. % Name -------------------- ------- ------- -------------------- 6345 48.989 48.989 [prog.c:23] (a.out) 5401 41.700 90.689 [prog.c:33] (a.out) 96 0.741 91.430 [elf/elf.h:80] (ld-linux-ia64.so.2) ... iprep will not report counts for functions or lines whose self percentage is less than a cutoff. The default cutoff is 0.5%. A different cutoff percentage may be provided to iprep by placing "-c " first on the command line, where 0 <= < 100. 6. csrep (CallStack REPort) == ===== ========== ======= csrep is used to create a "butterfly" report from one or more raw callstack reports produced by histx. E.g. % csrep -c 3 < cs.prog.12345 > butterfly.12345 or % csrep cs.prog.* > butterfly.all Output from csrep consists of 2 sections. First, a listing of the frequency of appearance of each function found in a callstack, e.g.: Appearance Frequency ========== ========= 97.06% libc.so.6.1:__libc_start_main+0x230 [../sysdeps/generic/libc-start.c:129] 96.32% a.out:_start+0x80 48.28% a.out:main+0x2a0 [prog.c:93] 48.04% a.out:main+0x90 [prog.c:70] 35.05% libm.so.6.1:cos 20.10% a.out:f1+0xb0 [prog.c:23] 19.36% a.out:f2+0xb0 [prog.c:33] 18.38% a.out:f2+0x80 [prog.c:33] 17.65% libm.so.6.1:acos 15.69% a.out:f1+0x80 [prog.c:23] 13.73% libm.so.6.1:asin 12.25% a.out:f1 [prog.c:19] 10.54% a.out:f2 [prog.c:29] Each percentage, is followed by :+ [:line] information. The second section consists of a "butterfly" report for each function. Each has the form: -------------------------------------------------------------------------------- xxx.xx% (yyy.yy%) xxx.xx% (yyy.yy%) xxx.xx% (yyy.yy%) ... xxx.xx% (yyy.yy%) ....... (yyy.yy%) xxx.xx% (yyy.yy%) xxx.xx% (yyy.yy%) xxx.xx% (yyy.yy%) ... xxx.xx% (yyy.yy%) e.g. -------------------------------------------------------------------------------- 100.00% ( 20.10%) a.out:main+0x90 [prog.c:70] ....... ( 20.10%) a.out:f1+0xb0 [prog.c:23] 96.34% ( 19.36%) libm.so.6.1:cos 3.66% ( 0.74%) a.out:_init Each report conveys a lot of information. Above the function of interest, there is a line for each caller of the function. The xxx.xx percentage in those lines is computed from the formula: <# callstacks containing caller_j followed by function> / <# callstacks containing function> * 100 so the sum for all callers is 100%. Each (yyy.yy%) percentage is computed from: <# callstacks containing caller_j followed by function> / * 100 Below the function of interest, there is a line for each callee of the function. The xxx.xx percentages in those lines are now given by the formula: <# callstacks containing function followed by callee_j> / <# callstacks containing function> * 100 and again sum to 100%, while the (yyy.yy%) percentages are computed from <# callstacks containing function followed by callee_j> / * 100 This type of report allows one to not only quickly identify the functions of interest in an application, but also identify the must likely paths through the program by which such functions were called. csrep will not mention functions whose percentage in all recorded callstacks is less than a cutoff. The default cutoff is 0.5%. A different cutoff percentage may be provided to csrep by placing "-c " first on the command line, where 0 <= < 100.