------------------------------------------------------ pfmon-2.0 A tool to collect monitoring information for Linux/ia64 ------------------------------------------------------ Copyright (c) 2001-2002 Hewlett-Packard Company Stephane Eranian Pfmon is a performance monitoring tool uniquely designed for Linux/ia64. It does not work with Linux/ia32. It is meant as a sample tool to demonstrate how to use the perfmon subsystem provided by the Linux/ia64 as of version 2.4.0. This tool uses the powerful IA-64 Performance Monitoring Unit (PMU) to do counting and sampling on unmodified binaries or for the entire system. This document is an attempt at providing some documentation on how to use pfmon. The content covers pfmon-2.0 uniquely. -----> YOU MUST at least HAVE kernel v2.4.18 <----- 1/ Introduction Pfmon can be used to monitor unmodified binaries in its per-process mode ans it can also be used to run system wide monitoring sessions. Such sessions are active across all processes executing on a given CPU. Pfmon can launch a system wide session on a dedicated CPU or set of CPUs in parallel. Pfmon can monitor activities happening at the user and/or kernel level for both type of sessions. Pfmon can be used to collect basic event counts. It can also be used to sample program or system execution. In per-process mode, pfmon can only monitor the first process (task). Subsequent processes or threads created by that initial process will not be monitored. Pfmon can run on any IA-64 CPU model and provides the minimal features mandated by the architecture but it also provides model specific extensions. For instance, on Itanium pfmon has support for the EAR and BTB features. Pfmon is based on a generic helper library called libpfm which is included in this package. The library is not specific to pfmon and can be used directly by other programs as is demonstrated in the set of examples also included in this package. Both the library and pfmon have a modular architecture which makes it easier to support new PMU models as they become available. In the remainder of this document, we describe the key options and features of pfmon which are available on all CPU (PMU) models. Please refer to the model specific documentation for advanced features. 2/ pfmon options The set of command line options provided by pfmon depends on the host PMU. It is possible to compile pfmon for more than one PMU model and then it will auto-detect the host PMU and provide the corresponding set of options. The options common to all PMU models are as follows: -h, --help display the list of supported options -V, --version display pfmon version information -l[regex], --show-event-list[=regex] show list of supported events by host PMU -i , --event-info=event get information about a particular event. -u, -3, --user-level monitor at the user level for all events. Per-event setting is possible with --priv-levels -k, -0, --kernel-level monitor at the kernel level for all events. Per-event setting is possible with --priv-levels -2 monitor at privilege level 2 -1 monitor at privilege level 1 -e, --events=ev1,ev2,... select events to monitor. There should be no space between the events.The number of events that you can specify is dependent on the underlying PMU model. Four events is typical. -I,--info list the compiled in PMU models supported by pfmon and detected host PMU as well as sampling output formats. --debug enable debug prints. --verbose print more information during execution. --outfile=filename print counts in a file --append when used with --outfile, will open the file in append mode. --overflow-block block the monitored program on overflow notifications (per process mode only). --system-wide create a system wide monitoring session. Default session type is per process. --cpu-mask=0xn bitmask indicating on which CPU to start system wide monitoring. When this option is not specified, pfmon will monitor on all CPUs. -S format, --smpl-output-info=format display information about a sampling output format. -t secs, --session-timeout=secs duration of the session in seconds. In per process mode, the process will get killed if the timeout expires. --smpl-outfile=filename save sampling results in a file. --smpl-entries=val size of the sampling buffer in number of entries (default=2048). --long-smpl-periods=val1,val2,... set sampling periods for each event after user level notification. --short-smpl-periods=val1,val2,... set sampling period for each event. --with-header generate a machine description header with results. --aggregate-results aggregate counts and sampling buffer outputs when running system wide monitoring on multiple CPUs. --tigger-start-address=addr start monitoring only when execution reaches addr (code) for the first time. trigger stop address is not currently supported --priv-levels=lvl1,lvl2,... set privilege level per event. lvl can be any combination of u or 3 (user), k or 0 ( kernel), 1 (priv level 1), 2 (priv level2 ). Unspecified events will get global setting which is user only by default. --show-time show real,user, and system time for the executed command. --us-counter-format print counters using commas (1,024). --eu-counter-format print counters using points (1.024). --hex-counter-format print counters in hexadecimal (0x400). --smpl-output-format=fmt select fmt as sampling output format, use -L to list formats. --long-show-events[=regex] display detailed information about matching events in a single (easy grep) --symbol-file=filename use the ELF archive filename to look for symbols --sysmap-file=filename use the System.map filename to look for symbols --check-events-only check that the event combination is valid and exit (no measurement) --smpl-periods-random=mask1:seed1,... randomize both the short and long sampling period. The mask indicate the significant bits to keep in the randomly generated value. The seed is use to initialize the pseudo-random number generator. You can use a different mask and seed per event. --trigger-start-delay=secs number of seconds before activating monitoring --smpl-print-counts print counter results when sampling (off by default) --exclude-idle exclude idle tasks from system wide monitoring 3/ Getting event information with pfmon The list of events supported by pfmon depends on the host PMU. You can get the list of supported events using the following pfmon option: % pfmon -l CPU_CYCLES IA64_INST_RETIRED IA64_TAGGED_INST_RETIRED_PMC8 IA64_TAGGED_INST_RETIRED_PMC9 INST_DISPERSED EXPL_STOPBITS ALL_STOPS_DISPERSED IA32_INST_RETIRED ISA_TRANSITIONS NOPS_RETIRED .... If you specify an argument to the -l option (no space between l and the argument), it is interpreted as a regular expression and all matching events will be listed: % pfmon -ll1d L1D_READ_FORCED_MISSES_RETIRED L1D_READ_MISSES_RETIRED L1D_READS_RETIRED PIPELINE_FLUSH_L1D_WAYMP_FLUSH You can get more detailed information about each event using the following option: % pfmon -i nops_retired Name : NOPS_RETIRED VCode : 0x30 Code : 0x30 PMD/PMC: [ 4 5 ] EAR : No (N/A) Umask : None BTB : No Thres : 6 Qual : [Instruction Address Range] [OpCode match] Pfmon is case insensitive for event names. Here you see some details about the event. The first 4 lines are generic and provided on all PMU models even though the codes may vary: - Code is the event code used by the PMU. - Vcode is a libpfm internal event code which encapsulates the event code and other information describing the type of the event. For simple events, the two codes are usually identical. - PMD/PMC: list the counting monitors on which this event can be programmed. Not all events can necessarily be programmed on all available counting monitors. This constraint is taken care of by the libpfm library. Here the remaining information is specific to the Itanium 2 PMU. Even with the -i option, you can use a regular expression for the event: % pfmon -i'writes$' Name : L2_DATA_REFERENCES_WRITES VCode : 0x20069 Code : 0x69 PMD/PMC: [ 4 5 6 7 ] Umask : 0010 EAR : No (N/A) BTB : No MaxIncr: 2 (Threshold [0-1]) Qual : [Instruction Address Range] [OpCode Match] [Data Address Range] On some PMU models (currently Itanium2), the events information contains a text description of the event. Events can be specified using their code: % pfmon -i 0x45 Name : L2_INST_PREFETCHES VCode : 0x45 Code : 0x45 PMD/PMC: [ 4 5 6 7 ] Umask : 0000 EAR : No (N/A) BTB : No MaxIncr: 1 (Threshold 0) Qual : [Instruction Address Range] Group : None Set : None Desc : L2 Instruction Prefetch Requests Information about what each event measures can be found in the relevant CPU model specific micro-architecture documentation. The architecture imposes that only two events be defined by all PMUs: - CPU_CYCLES : the number of elapsed CPU cycles. - IA64_INST_RETIRED : the number of instructions retired. Those two events are guaranteed to exist on all PMU but their codes may vary. The PMU specific event names may not be exactly the same, however, pfmon and especially the library it uses (libpfm) will always ensure that those two events can always be called by the two names list above. As alluded to earlier, pfmon can support more than one PMU in a single binary. Pfmon also incorporates a generic PMU model which provides only the features defined by the architecture, this includes the two events. If pfmon does not have specific support for the host PMU it will default to the so called 'Generic' PMU support, if compiled in. You can find out what PMU support is compiled into pfmon as follows: % pfmon -I detected host CPUs: 4-way 800MHz Itanium (Merced, C0) supported PMU models: [itanium2] [itanium] [generic] detected host PMU: itanium supported sampling outputs: [detailed-itanium] [raw] [compact] [btb] [example] pfmlib version: 2.0 kernel perfmon version: 1.0 It is possible to force pfmon to operate in generic mode even though it has support for the host CPU using the pfmon_gen command: % pfmon_gen -I forced libpfm to generic support detected host CPUs: 4-way 800MHz Itanium (Merced, C0) supported PMU models: [itanium2] [itanium] [generic] detected host PMU: generic supported sampling outputs: [raw] [compact] [example] pfmlib version: 2.0 kernel perfmon version: 1.0 % pfmon_gen -i CPU_CYCLES forced libpfm to generic support Name : CPU_CYCLES VCode : 0x12 Code : 0x12 PMD/PMC: [ 4 5 6 7 ] The pfmon_gen is not a separate command but just a symlink to pfmon. In fact, pfmon always checks the name it was invoked with. If this name is equal to 'pfmon_gen' and the generic support is compiled in, then pfmon will operate in generic mode. Such feature is useful when moving pfmon to a PMU for which neither pfmon itself nor libpfm have support yet. 3/ Basic counting with pfmon In generic mode, pfmon only supports the two architected events listed above. For comparison, the Itanium PMU supports about 230 events and the Itanium2 PMU about 470. No instrumentation of the program is required to monitor the system or a single process. a/ simple examples To collect counts on a specific command, you just need to launch it via pfmon, just like you would do with the time or strace command: % pfmon ls /var/spool anacron at cron fax lpd mail mqueue news rwho samba slrnpull squid up2date uucp uucppublic vbox voice 2910724 CPU_CYCLES When invoked with no particular event, pfmon default to CPU_CYCLES. To monitor specific events, you can type: % pfmon -e cpu_cycles,IA64_inst_Retired ls /var/spool anacron at cron fax lpd mail mqueue news rwho samba slrnpull squid up2date uucp uucppublic vbox voice 2984546 CPU_CYCLES 2666884 IA64_INST_RETIRED As you can see, pfmon is not case sensitive with regards to event names. More than one event can be measured at a time using a comma separated list of events. You MUST not have space after the comma. If the command you want to run takes options, you can clearly distinguish the options of pfmon from the options of your command using the '--' symbol: % pfmon -e ia64_inst_retired -- ls -ial /dev/null 210135 crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null 2709704 IA64_INST_RETIRED Otherwise, pfmon will stop parsing arguments as option as the first argument which does not start with a - or --. b/ privilege levels By default, pfmon monitors only what is going at the user level (application level). This is true for both per-process and system wide mode. It is possible to monitor at any of the 4 privilege levels provided by IA-64. It is also possible to monitor at several levels at the same time by specifying more than one level. The levels can be specified for all events or on a per-event basis. To affect all events, you can use any combinations of -k (-0), -1, -2, -u (or -3). To set the level for each event, the --priv-levels option must be used. By default, pfmon only measures at the user level: % pfmon -e nops_retired ls counts the number of NOPS_RETIRED when ls is running at the user level only (equivalent to specifying -u or -3). % pfmon -k -e nops_retired ls counts the number of NOPS_RETIRED when ls is running at the kernel level only. % pfmon -k -u -e nops_retired ls counts the number of NOPS_RETIRED when ls is running at the kernel level or user level, i.e. all the time. It is possible to refine the settings on a per event basis using the --priv-levels option. % pfmon -e loads_retired,nops_retired ls Both events are measured at the user level only. % pfmon --priv-level=u,k -e loads_retired,nops_retired ls LOADS_RETIRED is measured at the user level only, NOPS_RETIRED at the kernel level only. % pfmon --priv-level=,uk -e loads_retired,nops_retired ls LOADS_RETIRED is measured at the user level only, NOPS_RETIRED at the user and kernel levels. % pfmon -k --priv-level=uk -e loads_retired,nops_retired ls LOADS_RETIRED is measured at the user and kernel levels, NOPS_RETIRED at the kernel level only. c/ counter formats Pfmon can display the final counts in various formats. There are 4 formats defined. The default one is shown in the example above. To make is easier to read large numbers or to feed the number to other programs, pfmon supports: --us-counter-format where the thousands, millions, billions are separated with commas (US and UK style): % pfmon --us-counter-format ls -l /dev/null crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null 2,292,689 CPU_CYCLES --eu-counter-format where the thousands, millions, billions are separated with points (European style): % pfmon --eu-counter-format ls -l /dev/null crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null 1.703.898 CPU_CYCLES --hex-counter-format where the counts are shown in hexadecimal format: % pfmon --hex-counter-format ls -l /dev/null crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null 0x000000000019c164 CPU_CYCLES d/ saving counts By default, the counts are printed on the controlling tty. However it is possible to save them in a file using the --outfile option: % pfmon --outfile=b --hex-counter-format ls -l /dev/null crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null % cat b 0x000000000016a8b1 CPU_CYCLES It is possible to include a header with the results using the --with-header option. It will be printed on the controlling tty or saved in the output file. The header contains detailed information about the configuration of the host machine and on the monitoring session: % pfmon --with-header --outfile=b --hex-counter-format ls -l /dev/null crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null % cat b # # date: Wed Nov 20 16:03:13 2002 # # hostname: hpljumbo.hpl.hp.com # # kernel version: Linux 2.4.18 #2 SMP Tue Aug 6 11:54:56 PDT 2002 # # pfmon version: 2.0 # kernel perfmon version: 1.0 # # # # page size: 16384 bytes # CLK_TCK: 1024 ticks/second # CPU configured: 4 # CPU online: 4 # physical memory: 6827933696 # physical memory available: 5606391808 # # host CPUs: 4-way 800MHz Itanium (Merced, C0) # PAL_A: 6.6.23 # PAL_B: 7.7.28 # Cache levels: 3 Unique caches: 4 # L1D: 16384 bytes, line 32 bytes, load_lat 2, store_lat 0 # L1I: 16384 bytes, line 32 bytes, load_lat 2, store_lat 0 # L2 : 98304 bytes, line 64 bytes, load_lat 6, store_lat 6 # L3 : 4194304 bytes, line 64 bytes, load_lat 21, store_lat 21 # # # captured events: # PMD4: CPU_CYCLES, user level(s) # # monitoring mode: per-process # # # instruction sets: # PMD4: CPU_CYCLES, ia32/ia64 # # # command: pfmon --with-header --outfile=b --hex-counter-format ls -l /dev/null # # # 0x00000000001a8956 CPU_CYCLES e/ delayed start By default, pfmon will start monitoring at the first instruction of the program, i.e., the entry point when the privilege level is limited to user level. Even when kernel level monitoring is enabled nothing will be measured until the process leaves the kernel for the first time, after fork. Sometimes, it may be useful to delay the activation of monitoring until a certain point in the execution is reached. This is the case when the initialization must not be included in the counts. Pfmon provides two different ways to delay the point at which monitoring is turned on with the --trigger-start-address and --trigger-start-delay options. The --trigger-start-address option only applies to per-process sessions and is ignored for system-wide. It uses a code address to trigger monitoring. Once execution reaches the bundle address specified with the option, the monitoring will be turned on and will remain on until the program terminates. The address can be specified in hexadecimal or a code symbol name can be provided. It is not possible to specify a kernel address, pfmon will reject any such address. When an address is explicitely used, pfmon will not try to validate it except by checking it is not in the kernel. The delayed start mechanism will be used only the first time the address is reached. If main() is at address 0x40000000000004a0, then we can delay monitoring until main() is reached using: % pfmon --trigger-start-address=0x40000000000004a0 -e loads_retired foo 74 LOADS_RETIRED or using the symbol table: % pfmon --trigger-start-address=main -e loads_retired foo 74 LOADS_RETIRED IMPORTANT: Note that pfmon can ONLY lookup symbols in the "main" program and NOT in any dynamically linked libraries. To allow complete coverage, the program MUST be linked statically. Whereas the same program executed without the trigger address, will get: % pfmon -e loads_retired foo 1598 LOADS_RETIRED This example proves that the libc initialization used 1598-74=1524 loads all by itself. The --trigger-start-delay option uses time to delay monitoring. You simply specify a delay in seconds. When the delay expires, monitoring will be turned on. This options works for both per-process and system-wide monitoring. If the monitored process terminates before the delay expires, then nothing gets measured. This applies to both per-process and system wide sessions using a process to delimit session. Note that the session effectively starts when monitoring is turned on. Hence, the --session-timeout is only armed when monitoring in turned on. The following example will start monitoring 5 seconds in the execution of foo: % pfmon --trigger-start-delay=5 -e loads_retired foo The following example will start monitoring 5 seconds in the execution of foo and for 10 seconds after that point: % pfmon --trigger-start-delay=5 --session-timeout=10 -e loads_retired foo f/ getting timing information It is possible to get a tim breakdown of the execution of the monitored command for both per-process and system-wide mode using the --show-time option. The output is similar to the time(1) command. For instance: % pfmon --show-time -e nops_retired ls /dev/null /dev/null real 0h00m00.098s user 0h00m00.000s sys 0h00m00.095s 247913 NOPS_RETIRED g/ Testing event combinations Sometimes it is handy to check if some events can be measured simultaneously without actually starting the monitoring session. The --check-events-only option of pfmon allows this mode of operation. It will check that the combination is valid and then exit. If the conbination is invalid, it will print out the reason and return with an exit value of 1, otherwise the exit value is 0. On Itanium2, for instance, you can try: % pfmon --check-events-only -e loads_retired,stores_retired event LOADS_RETIRED and STORES_RETIRED cannot be measured at the same time % echo $? 1 Note that in this mode, you do not need to specify a command to execute. 4/ System wide sessions When the --system-wide option is used, pfmon operates in system wide mode. This means that it does not monitor a specific program anymore but instead all the processes that execute on a specific set of CPUs. In this mode, you do no need to specify a command. You do not need to be root to create a system wide session. A system wide session cannot co-exist with any per-process sessions. But a system wide session can run concurrently with other system wide sessions as long as they do not monitor the same set of CPUs. Of course multiple per-process sessions are possible. a/ selecting CPUs to monitor The --cpu-mask option can be used to restrict monitoring to a specific set of CPUs. When this option is not present, pfmon will automatically launch a system wide session on all available CPUs as reported by /proc/cpuinfo. So if the system has 2 available CPUS: % pfmon --system-wide -u -e cpu_cycles,ia64_inst_retired CPU0 248793 CPU_CYCLES CPU0 60710 IA64_INST_RETIRED CPU1 26690 CPU_CYCLES CPU1 7706 IA64_INST_RETIRED A system wide session can monitor at any privilege level (kernel, user, or both). If you want to restrict to a specific CPU, you can use the --cpu-mask command: % pfmon --system-wide --cpu-mask=0x2 -u -e cpu_cycles,ia64_inst_retired CPU1 17841 CPU_CYCLES CPU1 7577 IA64_INST_RETIRED The CPU mask is a bitmask where each bit represents a CPU. CPU are numbered starting at 0. So bit 0 represents CPU0, bit 1, CPU1 and so on. Therefore the above command will only monitor events happening on CPU1. More than one bit can be set in the mask. For instance, with --cpu-mask=0x3, pfmon will monitor on CPU0 and CPU1 at the same time. b/ delimiting a system wide session There are three ways to delimit a system wide session. By default, the session will terminate when the user press the key. It is also possible to use a timeout expressed in seconds. Finally, the session can also be delimited by the execution of a command. It will start when the command starts and stops when it terminates. Here are some examples: Monitor cpu_cycles and instruction retired on the first two CPUs at both user and kernel levels and wait for a keypress to stop: % pfmon --cpu-mask=0x3 --system-wide -u -k -e cpu_cycles,ia64_inst_retired CPU0 821818169 CPU_CYCLES CPU0 1338893885 IA64_INST_RETIRED CPU1 821813442 CPU_CYCLES CPU1 1341176908 IA64_INST_RETIRED Monitor cpu_cycles and instruction retired on the first two CPUs at both user and kernel levels for 10 seconds: % pfmon --session-timeout=10 --cpu-mask=0x3 --system-wide -u -k -e cpu_cycles,ia64_inst_retired CPU0 8003156088 CPU_CYCLES CPU0 12800683300 IA64_INST_RETIRED CPU1 8003106584 CPU_CYCLES CPU1 12899764561 IA64_INST_RETIRED Monitor cpu_cycles and instruction retired on the first two CPUs at the user level only during the execution of the ls command (here obviously run on CPU0): % pfmon --cpu-mask=0x3 --system-wide -u -e cpu_cycles,ia64_inst_retired -- ls -l /dev/null crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null CPU0 46560 CPU_CYCLES CPU0 26839 IA64_INST_RETIRED CPU1 7514 CPU_CYCLES CPU1 1184 IA64_INST_RETIRED c/ results aggregation It is possible to aggregate counts when monitoring more than one CPU: % pfmon --aggregate-results --system-wide -k -e cpu_cycles,ia64_inst_retired 852331455 CPU_CYCLES 1387206797 IA64_INST_RETIRED In which case, the per CPU results are summed. Pfmon does not allow different events to be monitored on different CPUs. For this you can run separate instances of pfmon with a different CPU mask, using a command line similar to: % pfmon --session-timeout=10 --cpu-mask=0x1 --system-wide -k -e cpu_cycles & % pfmon --session-timeout=10 --cpu-mask=0x2 --system-wide -k -e ia64_inst_retired & 5/ Dealing with symbols Whenever an option takes an address (code or data) as argument, it is possible to directly use a symbol name rather than use its address. For instance, this is true for the --trigger-address option. The user has two ways to indicate where the find the symbol table. Pfmon can extract the symbol table using an ELF image directly. This is for instance what is done implicitely in per-process mode. Pfmon also understands the System.map format which is typically used to save the symbol table of the kernel. There are a couple of restrictions concerning the symbols. Pfmon cannot extract symbol information that is coming from dynamically linked libraries or modules. To avoid this problem, the program must be statically linked and should not explicitely use dl_open(). If the symbol table has been stripped, pfmon will not find any symbol. In case the option requires a code address, pfmon will only look for matching code symbols. Conversly, if the option requires a data address, pfmon will only look for matching data symbols. By default, the symbols are automatically extracted from the command being run. This is true in per process mode but also in system wide mode when a command is specified. In case where symbols must be extract from an alternative ELF archive, then the user must use the --symbol-file option. The filename specified there must be a ELF/ia64 binary. Note that the Linux/ia64 kernel is also an ELF/ia64 archive, however for most distribution the kernel image found in /boot/efi is oftentimes compressed. The compression scheme used for Linux/ia64 is different from the one used on Linux/ia32. The compressed is image is simply the ELF/ia64 image compressed with gzip. So it is possible to decompress it to get the original ELF archive. The main caveat is that most of the time the compressed image is stripped. Therefore the user must rely on the corresponding System.map file usually placed in /boot/efi. In this case, the user must explicitely specify the location of the System.map file via the --sysmap-file option. Here are a few examples on Itanium: Count the number of time main() is called in the noploop program: % file noploop noploop: ELF 64-bit LSB executable, IA-64, version 1, statically linked, not stripped % pfmon --checkpoint-func=main -e ia64_inst_retired noploop 10000 Here the symbol information for main() is directly extracted from noploop itself. Count the number of time main() is called in the noploop-stripped program: % file noploop-stripped noploop-stripped: ELF 64-bit LSB executable, IA-64, version 1, statically linked, stripped % pfmon --symbol-file=noploop --checkpoint-func=main -e ia64_inst_retired noploop-stripped 1000 Here noploop and noploop-stripped are the same programs except that the latter does not have the symbol table anymore. Count the number of times sys_getpid() is called during the execution of noploop: % pfmon -k --symbol-file=/boot/efi/vmlinux-nostrip --checkpoint-func=sys_getpid -e ia64_inst_retired noploop 1000 Here we assume that the kernel file vmlinux was not stripped. If the kernel has been stripped, then we can use the System.map instead: % pfmon -k --sysmap-file=/boot/efi/System.map --checkpoint-func=sys_getpid -e ia64_inst_retired noploop 1000 6/ Basic sampling with pfmon Pfmon has support for sampling on any events or combination of events. Samples are collected into a buffer which can then be written to a file or simply on the screen. a/ principles Each sample is composed of two parts, a fixed size header which contains information about the sample and a variable body which consists of a set of 64-bit values each one representing a PMD register representing the other events being monitored. All samples record the same set of PMDs, this set is determined by pfmon based on what is being measured. The sampling buffer is controlled by the kernel but its size is configurable. By default pfmon uses a buffer with 2048 entries. This can be changed using the --smpl-entries option. The sampling works as follows: 1- the user specifies what needs to be recorded. 2- the user specifies the sampling period and optional randomization parameters. 3- at the end of a period, a sample is recorded into the buffer by the kernel. 4- a new sampling period is reloaded and execution/monitoring resumes. we go back to step 3. 5- if the sampling buffer becomes full, pfmon is notified. 6- pfmon processes the buffer, i.e., prints and/or saves the buffer. 7- pfmon then notifies the kernel that it is done. 8- the kernel reload a new sampling period and execution/monitoring resumes. we go back to 3. Pfmon (and the kernel) uses two sampling periods instead of just one. The first one is called short-smpl-period and the second is called long-smpl-period. The short-smpl-period is used in step 4, this is when the sampling buffer is not full after writing the sample. The long-smpl-period is used in step 8 when the reload occurs after the buffer became full. But why do we need 2 periods? As you might imagine there is some overhead is recording a sample. This overhead is increased even more when pfmon needs to get involved to drain the buffer. This operation can take some time and will inevitably introduce some noise in the measurements in the form of TLB and/or cache pollution. To try and hide this noise, it is sometimes beneficial to adjust the sampling period, i.e., make it larger to ensure that the next sample will not record an event that is the consequence of the overhead generated by the monitoring but rather a normal event occuring in the program/system being monitored. So it is expect that the long-smpl-period >= short-smpl-period. Of course if the two are equal, this is equivalent to having only one sampling period. Note that the long-smpl-period is only used to set the distance to the first sample recorded after the buffer is marked as empty again (step 7). b/ sampling output formats There are many ways in which the samples can be saved or printed on the screen. Pfmon has support for custom formats. Note that at this point, the kernel sampling buffer format is fixed. Here the customization happens in the tool. Pfmon comes with a set of output formats. Some of them can be used with any PMU models, others are specific to the Itanium or Itanium 2 PMUs. While all PMDs on all PMUs are 64 bits what they contains can vary from one PMU to the other. You can figure out which formats are available for the host PMU by typing: % pfmon -I supported PMU models: [itanium2] [itanium] [generic] detected host PMU: itanium supported sampling outputs: [detailed-itanium] [raw] [compact] [btb] [example] You can get a short description of what each format does by using the -S option: % pfmon -S detailed-itanium Name : detailed-itanium Description : Details each event in clear text PMU models : [itanium] Some formats are supported on all PMU models, in which case they are listed as generic: % pfmon -S compact Name : compact Description : Column-style raw values PMU models : [generic] Pfmon does not have a format by default, therefore the user MUST provide a format when starting a sampling session. % pfmon --smpl-output-format=compact --long-smpl-periods=100000 ls 0 14130 0 0x2000000000015771 0x0000582a9cf18e79 0x0010 100000 1 14130 0 0x2000000000015851 0x0000582a9cf34a40 0x0010 100000 2 14130 0 0x2000000000015941 0x0000582a9cf4e5e8 0x0010 100000 3 14130 0 0x2000000000023da0 0x0000582a9cf69db7 0x0010 100000 .... For more information about the various formats please refer to the source code :-< c/ some simple examples Suppose you want to record how many instructions are retired every 50000 cycles, i.e., you want to sample based on CPU_CYCLES and record the value of IA64_INST_RETIRED in each sample. This can be done as follows: % pfmon --smpl-output-format=detailed-itanium \ --short-smpl-period=50000 --long-smpl-period=50000 -e cpu_cycles,ia64_inst_retired -- ls /dev/null The two periods are identical in this example because the number of instruction executed by the ls command is not influenced by the fact that we monitor. The syntax is such that the 50000 value of short-period applies to the first event specified in the event list. The same rule applies for long-period. With pfmon it is possible to use more than one event as the 'sampling event'. You can also specify a sampling period for IA64_INST_RETIRED, in which case we take a sample whenever the first OR second period expires: % pfmon --smpl-output-format=detailed-itanium --short-smpl-period=50000,10000 \ --long-smpl-period=50000,10000 -e cpu_cycles,ia64_inst_retired ls Here a sample will be recorded every 50000 cpu cycles OR each time 10000 instructions have been retired. You do not necessarily need to specify both periods. If you specify one, then pfmon will use the value to initialize the other one. In other words, as soon as you specify only one period, the unspecified one will get the same value. Let us look at the information in the sampling buffer for the detailed-itanium format. For the first example above, we get something like this printed on the screen: /dev/null Entry 0 PID:1490 CPU:3 STAMP:0x39e28c5cf782 IIP:0x2000000000004c70 OVFL: 4 PMD5 : 0x0000000000004708 Entry 1 PID:1490 CPU:3 STAMP:0x39e28c5f8e0a IIP:0x2000000000026ee0 OVFL: 4 LAST_VAL: 5000 PMD5 : 0x0000000000007310 Entry 2 PID:1490 CPU:3 STAMP:0x39e28c6273d2 IIP:0x2000000000025e40 OVFL: 4 LAST_VAL: 5000 PMD5 : 0x000000000000b5e6 Entry 3 PID:1490 CPU:3 STAMP:0x39e28c63ef1b IIP:0x2000000000018490 OVFL: 4 LAST_VAL: 5000 PMD5 : 0x000000000001137f Entry 4 PID:1490 CPU:3 STAMP:0x39e28c64c6f5 IIP:0x2000000000024f60 OVFL: 4 LAST_VAL: 5000 PMD5 : 0x0000000000018a73 Entry 5 PID:1490 CPU:3 STAMP:0x39e28c6596cb IIP:0x2000000000018490 OVFL: 4 LAST_VAL: 5000 PMD5 : 0x00000000000222df ..... The first line is the output from the ls command. Next you see the entries extracted from the sampling buffer. Entry 0 is the first entry recorded in this monitoring session. The first line of each sample (entry) shows the fixed header. The fields are as follows: - PID : the identity of the process that generated the event - CPU : the CPU number on which the event occurred - STAMP : a time stamp guaranteed to be unique in time per CPU. - IIP : the value of the IP when the event occurred (DANGER, see note below) - OVFL : the counter that triggered the recording of the sample (more than one possible). - LAST_VAL: the last value loaded into the first counter which overflowed VERY IMPORTANT NOTE: Users are advised NOT TO TRUST the value reported in IIP. Samples get recorded by forcing a counter overflow and which then triggers an interrupt which will cause the kernel to record the information. Because of the parallel nature of the architecture and its implementations, it is very likely that by the time the PMU realizes that there was a counter overflow and generates the interrupt, the program execution has progressed way beyond the instruction that caused the event leading the a skewed IIP. At best IIP points to the next bundle given that interrupts can only be delivered at bundle boundaries. After the header, you get the value of PMD5. This register contains the number of instructions retired for our example. The second event specified by the user DOES NOT necessarily end up in PMD5. To figure out how the events were dispatched among the various PMDs, you can use the --with-header option (described earlier). The header contains detailed machine and session description. In our case it would like as follows: # # date: Wed Nov 20 17:00:43 2002 # # hostname: hpljumbo.hpl.hp.com # # kernel version: Linux 2.4.18 #2 SMP Tue Aug 6 11:54:56 PDT 2002 # # pfmon version: 2.0 # kernel perfmon version: 1.0 # # # # page size: 16384 bytes # CLK_TCK: 1024 ticks/second # CPU configured: 4 # CPU online: 4 # physical memory: 6827933696 # physical memory available: 5598134272 # # host CPUs: 4-way 800MHz Itanium (Merced, C0) # PAL_A: 6.6.23 # PAL_B: 7.7.28 # Cache levels: 3 Unique caches: 4 # L1D: 16384 bytes, line 32 bytes, load_lat 2, store_lat 0 # L1I: 16384 bytes, line 32 bytes, load_lat 2, store_lat 0 # L2 : 98304 bytes, line 64 bytes, load_lat 6, store_lat 6 # L3 : 4194304 bytes, line 64 bytes, load_lat 21, store_lat 21 # # # captured events: # PMD4: CPU_CYCLES, user level(s) # PMD5: IA64_INST_RETIRED, user level(s) # # monitoring mode: per-process # # # instruction sets: # PMD4: CPU_CYCLES, ia32/ia64 # PMD5: IA64_INST_RETIRED, ia32/ia64 # # # command: ./pfmon --with-header --smpl-output-format=detailed-itanium --short-smpl-period=50000 --long-smpl-period=50000 -e cpu_cycles,ia64_inst_retired -- ls /dev/null # # # # # kernel sampling format: 1.0 # sampling entry size: 56 # # recorded PMDs: PMD5 # sampling buffer entries: 2048 # # short sampling rates (base/mask/seed): # CPU_CYCLES 50000 # IA64_INST_RETIRED none # # long sampling rates (base/mask/seed): # CPU_CYCLES 50000 # IA64_INST_RETIRED none # # Near the end of the header, you see in the "captured events" section: PMD5: IA64_INST_RETIRED. Pfmon will record the value of the PMD for which the event has no sampling period defined. For our first example, it means that it will record the value of the PMD counting the number of instructions retired. Let us look at a more complicated example using some of the Itanium specific events: % pfmon --with-header --short-smpl-periods=50000 --long-smpl-periods=50000 \ -e cpu_cycles,ia64_inst_retired,l2_misses,cpu_cpl_changes -- ls /dev/null Here cpu_cycles is controlling the sampling period and each sample will include value of the PMDs counting the number of L2 misses (L2_MISSES) and the number of CPU privilege level changes (CPU_CPL_CHANGES): entry 0 PID:18723 CPU:3 STAMP:0x23b06dc011261 IIP:0x2000000000024d40 PMD OVFL: 4 PMD5 : 0x00000000000017d7 PMD6 : 0x00000000000001de PMD7 : 0x0000000000000008 Where the assignments were: # captured events: # PMD4: CPU_CYCLES, user level(s) # PMD5: IA64_INST_RETIRED, user level(s) # PMD6: L2_MISSES, user level(s) # PMD7: CPU_CPL_CHANGES, user level(s) Using the compact format instead of the detailed one, you get results that are formatted such that they can be easily parsed by other tools. The header contains the description of every column: # column 1: entry number # column 2: process id # column 3: cpu number # column 4: instruction pointer # column 5: unique timestamp # column 6: bitmask of PMDs which overflowed # column 7: initial value of PMD which overflowed # column 8: PMD5 # column 9: PMD6 # column 10: PMD7 and the data is formatted as follows: When sampling, the counts printed at the end of the session are not very useful, especially for the counters used as sampling periods. Those should be discarded and they are NOT saved in the sampling result file. d/ sampling in system wide mode Sampling is possible in the same manner for system wide sessions. By default, the buffer is printed on the controlling tty. When sampling on more than one CPU at a time, samples for each CPU will be printed. When sampling results are redirected into a file, then you get one file per CPU. If the file is called 'myresults', then 'myresults.cpu0' contains the samples captured on CPU0, 'myresults.cpu1' the ones from CPU1, and so on. The --aggregate-results options also influences the way samples are saved to files. When this option is used, then samples are merged into a single file. In our example, they would go into 'myresults'. If you don't use the --smpl-no-entry-header every sample will have the CPU information. e/ randomization of sampling periods Pfmon supports randomization of both sampling periods. The user must supply a bitmask and a seed value using the --smpl-periods-random option. The same mask and seed applies to both the long and short period for each event. Each event can have a different mask and seed. Two separate invocations of pfmon using the same seed and mask arguments are guaranteed to generate to same "pseudo-random" series of numbers allowing reproducibility. The sampling buffer will report the random value used for the sampling period used to generate each sample in the LAST_VAL field in the detailed output format, otherwise it is in one of the columns in compact modes In the following command, the long (and short) sampling period are initially set to 100000 and we activate randomization using a seed of 5. The mask indicates that we allow the value to vary between 100000 and 100255 (inclusive): % pfmon --smpl-periods-random=0xff:5 --long-smpl-period=100000 -e cpu_cycles -- noploop 1000000000 entry 0 PID:509 CPU:0 STAMP:0xa9b83faf28 IIP:0x4000000000000400 OVFL: 4 LAST_VAL: 100000 entry 1 PID:509 CPU:0 STAMP:0xa9b8413a4d IIP:0x4000000000000400 OVFL: 4 LAST_VAL: 100005 entry 2 PID:509 CPU:0 STAMP:0xa9b842c532 IIP:0x4000000000000400 OVFL: 4 LAST_VAL: 100067 entry 3 PID:509 CPU:0 STAMP:0xa9b8445077 IIP:0x4000000000000400 OVFL: 4 LAST_VAL: 100181 entry 4 PID:509 CPU:0 STAMP:0xa9b845db4e IIP:0x4000000000000400 OVFL: 4 LAST_VAL: 100064 entry 5 PID:509 CPU:0 STAMP:0xa9b84766b5 IIP:0x4000000000000400 OVFL: 4 LAST_VAL: 100212 entry 6 PID:509 CPU:0 STAMP:0xa9b848f1d5 IIP:0x4000000000000400 OVFL: 4 LAST_VAL: 100140 The randomization is shown in the LAST_VAL field which shows the value loaded into PMD4 (the PMD which overflowed) for each sample. Hence, 100181 is the number of cycles elapsed between entry 2 and entry 3. Randomization is important when sampling to avoid getting in lockstep with the execution and thereby collecting biased results. 6/ Blocking on overflow notifications Whenever the sampling buffer becomes full and pfmon is notified you have the option of either letting the monitored program continue or block it. In both cases, monitoring is off during the processing of the sampling buffer. By default, pfmon lets the program continue its execution. It is possible to block the program using the --overflow-block option. Blocking the program ensures pfmon sees the entire execution. Keeping the program running ensures that the caches and TLB are kept somewhat warm, i.e., with some state belonging to the running process, especially on SMP systems. 7/ Excluding idle tasks in system wide sessions Pfmon now allows the user to exclude the idle tasks from system wide monitoring session. This only works with a kernel that has perfmon 1.3 or higher. Pfmon checks the kernel version and may abort in case the wrong version is detected. Linux has one idle task per cpu. This task is run when nothing else can. The idle task is a kernel only task with a pid if 0. The pid 0 is use for ALL idle tasks. They do not show up in ps or top. When running a system wide session, it may be useful to stop monitoring when the idle task is running, this way we monitor only the USEFUL execution. Of course, monitoring the idle task or not implies that monitoring is active at the kernel privilege level, i.e., when using the -k or -0 option of pfmon. When monitoring only at the user level, excluding the idle task has no effect. Similarly, excluding the idle task for a per-process session has not effect. For instance, here is what we get without exclusion: % pfmon -k --session-timeout=10 --system-wide 8003084826 CPU_CYCLES This is run on a 800MHz Itanium CPU, so 10s is 8 billions cycles. But if we run with exclusion: % pfmon --exclude-idle -k --session-timeout=10 --system-wide 259663 CPU_CYCLES This is the useful cycles for the 10s period. 8/ Further documentation You can find a lot of information about the Linux/ia64 kernel in the book: 'ia-64 linux kernel design and implementation' David Mosberger and Stephane Eranian Prentice Hall ISBN: 0130610143 Also see http://www.lia64.org for the book's web site. This book contains a chapter about the IA-64 PMU, the design of the kernel perfmon subsystem and also a small description of pfmon. More detailed information about the IA-64 architecture, including the PMU can be found on the Intel developers' web site at: http://developer.intel.com/design/itanium/family/ 9/ Support You can subscribe to the official Linux/ia64 mailing list at www.linuxia64.org. Alternatively, you send send me an E-mail at eranian@hpl.hp.com 10/ Bug reports You can send a bug report to myself at eranian@hpl.hp.com. Patches are also welcomed. 12/20/2002 S.Eranian