------------------------------------------------------ pfmon-2.0 Itanium2 specific documentation ------------------------------------------------------ Copyright (c) 2002 Hewlett-Packard Company Stephane Eranian This document describes pfmon features which are specific to the Itanium 2 (a.k.a. McKinley) PMU. For information about the generic support refer to usersguide.txt. 1/ Itanium2 features supported by pfmon Pfmon provides access to ALL the Itanium2 PMU specific features. This includes: - Event Address Registers (Data & Code & ALAT) - Opcode matching (PMC8, PMC9) - Address range restrictions (Data & Code) with fine mode and inverse code range - Branch Trace Buffer (BTB) - Event thresholds - IA-32 execution monitoring The Itanium2 specific options of pfmon are as follows: --event-thresholds=thr1,thr2,... set event thresholds (no space) --opc-match8=val set opcode match for pmc8 --opc-match9=val set opcode match for pmc9 --btb-tm-tk capture taken IA-64 branches only --btb-tm-ntk capture not taken IA-64 branches only --btb-ptm-correct capture branch if target predicted correctly --btb-ptm-incorrect capture branch if target is mispredicted --btb-ppm-correct capture branch if path is predicted correctly --btb-ppm-incorrect capture branch if path is mispredicted --btb-ds-pred capture info about branch predictions --btb-brt-iprel capture IP-relative branches only --btb-brt-ret capture return branches only --btb-brt-ind capture non-return indirect branches only --btb-all-mispredicted capture all mispredicted branches --irange=start-end specify an instruction address range constraint --drange=start-end specify a data address range constraint --checkpoint-func=addr a bundle address to use as checkpoint --ia32 monitor IA-32 execution only --ia64 monitor IA-64 execution only --insn-sets=set1,set2,... set per event instruction set (setX=[ia32|ia64|both]) --inverse-irange inverse instruction range restriction In this section, we review how each feature and related options are used. 2/ Event thresholds Pfmon has support for event thresholds. It is possible to further refine certain events using a threshold. If an event as a threshold set to n, it means that the PMU will not count the occurrences of that event unless it happens more than n times per cycles. So, if the threshold is zero, which is the default, then ALL occurrences are recorded. But if it is set to 3, then the counter will be increased by one only when the event happens more than 3 times per cycle. Not all events have the same threshold value. You can determine the maximum increment per cycle for each event using the event info (-i) option of pfmon: % pfmon -i nops_retired Name : NOPS_RETIRED VCode : 0x50 Code : 0x50 PMD/PMC: [ 4 5 6 7 ] Umask : 0000 EAR : No (N/A) BTB : No MaxIncr: 6 (Threshold [0-5]) Qual : [Instruction Address Range] [OpCode Match] Group : None Set : None Desc : Retired NOP Instructions The information includes the maximum increment for the event. Here 6 means that the CPU can execute up to 6 nop per cycle which corresponds to the two bundles maximum window of Itanium2. This combination is possible when using the right template to fill all the execution units. Next to it you see the allowed values for the threshold which go from 0 to max increment-1. Now if you want to count the number of times 6 nops are executed in a single cycle, you can do: % pfmon --event-threshold=5 -e nops_retired ls /dev/null 0 NOPS_RETIRED Luckily enough, there is no such bundle executed with the invocation of ls! You can specify the threshold for every event you use. They MUST be specified in the same order as the event. 3/ Opcode matchers (PMC8, PMC9) The opcode matcher feature allows constraining of what is being monitored based on the instruction opcode, opcode pattern or functional unit. Pfmon has two options to support this features: --opc-match8: set the value for PMC8 (first opcode matcher) --opc-match9: set the valuer for PMC9 (second opcode matcher) These options constrain what is included in the measurement but they do not set what is to be measured, i.e. which event. Many times, the user just wants to count the number of occurrences of a certain instructions or instruction patterns. For this, you need to combine PMC8/PMC9 with an event. To count the number of machine instruction constrained by: - PMC8 you need to use the: IA64_TAGGED_INST_RETIRED_IBRP0_PMC8 or IA64_TAGGED_INST_RETIRED_IBRP2_PMC8 - PMC9 you need to use the IA64_TAGGED_INST_RETIRED_IBRP1_PMC9 or IA64_TAGGED_INST_RETIRED_IBRP3_PMC9 For instance, if you want to count the number of br.cloop executed in a program using PMC8, you can do: % pfmon --opc-match8=0x1400028003fff1f8 -e IA64_TAGGED_INST_RETIRED_IBRP0_PMC8 ls /dev/null /dev/null 2134 IA64_TAGGED_INST_RETIRED_IBRP0_PMC8 The two opcode matchers are not symmetrical in what they can constrain, please refer to Itanium2 documentation for further information. Not all events can be constrained with the opcode matchers. Pfmon will reject any invalid combination. You can figure out if an event support the opcode matcher feature using the event info option of pfmon: % pfmon -i cpu_cycles Name : CPU_CYCLES VCode : 0x12 Code : 0x12 PMD/PMC: [ 4 5 6 7 ] Umask : 0000 EAR : No (N/A) BTB : No MaxIncr: 1 (Threshold 0) Qual : None Group : None Set : None Desc : CPU Cycles Here you see on the Qual line that CPU_CYCLES does not support any constraint at all. But if we look at NOPS_RETIRED: % pfmon -i nops_retired Name : NOPS_RETIRED VCode : 0x50 Code : 0x50 PMD/PMC: [ 4 5 6 7 ] Umask : 0000 EAR : No (N/A) BTB : No MaxIncr: 6 (Threshold [0-5]) Qual : [Instruction Address Range] [OpCode Match] Group : None Set : None Desc : Retired NOP Instructions You see that this event supports opcode matching: 'OpCode Match' Pfmon supports two ways of specifying the value to load into PMC8 or PMC9: a numerical value or a logical name. a/ Using a numerical value The numerical value can be entered in hexadecimal or decimal form. Internally pfmon does not verify the validity of the value provided by the user. For instance, if you want to count the number of ld1.* that you execute when running ls /dev/null, then you can type as follows: % pfmon --opc-match8=0x8400000007e7fffb -e ia64_tagged_inst_retired_ibrp0_pmc8 ls /dev/null /dev/null 34219 IA64_TAGGED_INST_RETIRED_IBRP0_PMC8 b/ Using a logical name Constructing the value to load into PMC8 or PMC9 is a tedious process as the structure is quite complicated and the process is prone to errors. This version of pfmon comes with a primitive configuration file which at this point is only used for opcode matching. Pfmon allows you to specify a logical name, i.e. a string, instead of a numerical value. The configuration file contains a small database of logical names for the opcode matchers. The database is in clear text and has a simple name,value structure. Pfmon supports two configuration files, a system wide file and a user specific file. Pfmon uses only one of the two. It first looks for a user specific file called .pfmon.conf in the user's home directory. If found, it is used, otherwise, pfmon looks at the system wide configuration file in $prefix/lib/pfmon/pfmon.conf, where prefix depends on the installation, usually prefix=/usr. The format of the configuiration is fairly trivial at this point, it is a just a collection of name,value pairs. There MUST be one pair per line, and the value MUST be in hexadecimal: % cat ~/.pfmon.conf iload1 0x8400000007e7fffb br.cloop 0x1400028003fff1fb Pfmon does not come with a pre-established configuration file, so it is up to the user to define the name,value pairs that are of interest. With the database above, you can then invoke pfmon as follows: % pfmon --opc-match8=iload1 -e ia64_tagged_inst_retired_ibrp0_pmc8 ls /dev/null 34216 IA64_TAGGED_INST_RETIRED_IBRP0_PMC8 4/ Event Address Registers (EARS) The Event Address Registers provide a way to capture where cache, TLB, and ALAT misses occur. For each captured miss, you get the instruction address, the data address (when relevant), the latency of the miss (when relevant), the TLB level at which the miss was resolved (if relevant). Let us first look at cache misses. You can filter out which misses you are interested in based on the miss latency. EARS DO NOT CAPTURE NON MISSING cache (L1) accesses. For instance you can say that you want misses that take more than 16 cycles to resolve. The Itanium2 PMU supports a fixed set of latencies going from 4 to 4096. Of course not all latencies are possible, they are usually powers of two. The Itanium2 PMU uses two events to indicate the type of cache misses: code or data. The L1I_EAR_CACHE is used for instruction and DATA_EAR_CACHE is used for data cache misses. Similarly, DATA_EAR_ALAT is used for the ALAT. Theoretically, the latency filter is programmed in one the field on the PMC controlling the monitor. However to make it easier to use, the library on which pfmon is built encapsulates the latency with the event by creating 'virtual events'. If you list the events using pfmon -l and a regular expression of '_ear_', you get: % pfmon -l_ear_ DATA_EAR_ALAT DATA_EAR_CACHE_LAT1024 DATA_EAR_CACHE_LAT128 DATA_EAR_CACHE_LAT16 DATA_EAR_CACHE_LAT2048 DATA_EAR_CACHE_LAT256 DATA_EAR_CACHE_LAT32 DATA_EAR_CACHE_LAT4 DATA_EAR_CACHE_LAT4096 DATA_EAR_CACHE_LAT512 DATA_EAR_CACHE_LAT64 DATA_EAR_CACHE_LAT8 DATA_EAR_EVENTS DATA_EAR_TLB_ALL DATA_EAR_TLB_FAULT DATA_EAR_TLB_L2DTLB DATA_EAR_TLB_L2DTLB_OR_FAULT DATA_EAR_TLB_L2DTLB_OR_VHPT DATA_EAR_TLB_VHPT DATA_EAR_TLB_VHPT_OR_FAULT L1I_EAR_CACHE_LAT0 L1I_EAR_CACHE_LAT1024 L1I_EAR_CACHE_LAT128 L1I_EAR_CACHE_LAT16 L1I_EAR_CACHE_LAT256 L1I_EAR_CACHE_LAT32 L1I_EAR_CACHE_LAT4 L1I_EAR_CACHE_LAT4096 L1I_EAR_CACHE_LAT8 L1I_EAR_CACHE_RAB L1I_EAR_EVENTS L1I_EAR_TLB_ALL L1I_EAR_TLB_FAULT L1I_EAR_TLB_L2TLB L1I_EAR_TLB_L2TLB_OR_FAULT L1I_EAR_TLB_L2TLB_OR_VHPT L1I_EAR_TLB_VHPT L1I_EAR_TLB_VHPT_OR_FAULT You see the events for both TLB and caches and the ALAT. For instance, DATA_EAR_CACHE_LAT64 is the event used to capture data cache misses with a latency of 64 cycles OR more. Similarly, the DATA_EAR_TLB_VHPT is used to capture TLB misses that were resolved by the hardware walker (VHPT). The Data EAR events are all subevents of DATA_EAR_EVENTS. Similarly the Instruction EAR events are all subevents of L1I_EAR_EVENTS. You can get detailed information about EAR events using the event info (-i) option of pfmon: % pfmon -i DATA_EAR_CACHE_LAT64 Name : DATA_EAR_CACHE_LAT64 VCode : 0x405c8 Code : 0xc8 PMD/PMC: [ 4 5 6 7 ] Umask : 000000100 EAR : Data (Cache Mode) BTB : No MaxIncr: 1 (Threshold 0) Qual : [Instruction Address Range] [OpCode Match] [Data Address Range] Group : None Set : None Desc : Data EAR Cache -- >= 64 Cycles The EARs are mostly used for sampling, therefore you typically associate a sampling period to them. You configure a sampling period with EAR just like you would do with regular counters. But let us take a simple example to help visualize the difference. Let us suppose you want to capture the data cache misses that take more than 8 cycles. The sampling period is set to 2000 which is quite small but is just used to show the sampling output: % pfmon --smpl-output-format=detailed-itanium2 --long-smpl-periods=2000 -e DATA_EAR_CACHE_LAT4 -- ls -l /dev/null crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null entry 0 PID:4195 CPU:0 STAMP:0x3c5512d98bf9 IIP:0x2000000000012620 PMD OVFL: 4 LAST_VAL: 2000 PMD2 : 0x40000000000008f8 PMD3 : 0x0000000000004005, valid Y, latency 5, overflow N PMD17: 0x20000000000122e8, valid Y, bundle 0, address 0x20000000000122e0 18446744073709551088 DATA_EAR_CACHE_LAT4 Here again, we get sampling entries which the usual header. However the information in the body of each sample is quite different from what we saw earlier. With the detailed output format for Itanium2, pfmon decodes the meaning of each PMD which contains EAR information. For instance, with EAR and data cache misses, PMD3 contains the latency of the miss. In Entry 0, the miss took 5 cycles to resolve. The data that was being access was at address 0x40000000000008f8 (PMD2) and the instruction which generated the access what at 0x20000000000122e8, WHICH YOU NEED TO INTERPRET as bundle address 0x20000000000122e0 slot 0. The Itanium2 has a dispersion window of two bundles. The EAR address in PMD17 is the address of the first bundle is that window. To further distinguish which of the two bundles caused the missed, you must rely on the bundle field. Here bundle 0 indicates the miss was coming from the first bundle. The slot information is encoded in the address field (low 2 bits) as 0, 1, or 2. If we look at the TLB instead, we get samples that look as follows: % pfmon --smpl-output-format=detailed-itanium2 --long-smpl-periods=50 -e DATA_EAR_TLB_VHPT -- ls -l /dev/null crw-rw-rw- 1 root root 1, 3 Mar 1 10:08 /dev/null entry 0 PID:4198 CPU:0 STAMP:0x3caf724e1f42 IIP:0x2000000000015580 PMD OVFL: 4 LAST_VAL: 50 PMD2 : 0x4000000000001508 PMD3 : 0x0000000000004068, valid Y, TLB VHPT PMD17: 0x2000000000015578, valid Y, bundle 0, address 0x2000000000015570 18446744073709551588 DATA_EAR_TLB_VHPT Note that this time the interpretation of PMD3 has changed. In TLB mode, you specify the level at which you want to capture the misses. Here we wanted TLB request that missed in L1 and hit in VHPT and that is what is reflected by PMD3. There is no latency information on TLB misses. PMD17 contains the address of the instruction that caused the TLB miss. And PMD2 is the address of the data that was being accessed. Cache and TLB misses can also be captured for instructions. Pfmon operates in the same manner for instructions. The difference is in the information that is captured. For instance, if we want to capture the instruction TLB misses that hit in the VHPT you can do as follows: % pfmon --smpl-output-format=detailed-itanium2 --long-smpl-periods=50 -e L1I_EAR_TLB_VHPT -- ls -l /dev/null crw-rw-rw- 1 root root 1, 3 Mar 1 10:08 /dev/null entry 0 PID:4204 CPU:0 STAMP:0x3cbd980f1877 IIP:0xe0000000044012a0 PMD OVFL: 4 LAST_VAL: 50 PMD0 : 0x200000000005fc42, valid=Y cache line 0x200000000005fc40, TLB VHPT entry 1 PID:4204 CPU:0 STAMP:0x3cbd981304f8 IIP:0xe0000000044012a0 PMD OVFL: 4 LAST_VAL: 50 PMD0 : 0x2000000000253162, valid=Y cache line 0x2000000000253160, TLB VHPT 18446744073709551612 L1I_EAR_TLB_VHPT This time, the set of PMDs used to capture the information is different, allowing both data and instruction EAR to operate in parallel. In our example, PMD0 contains the address of the cache line that caused the TLB miss (which was resolved by the VHPT). For instruction cache misses, you can do: % pfmon --smpl-output-format=detailed-itanium2 --long-smpl-periods=5000 -e L1I_EAR_CACHE_LAT8 -- ls -l /dev/null crw-rw-rw- 1 root root 1, 3 Mar 1 10:08 /dev/null entry 0 PID:4207 CPU:0 STAMP:0x3ccebbab52dd IIP:0x20000000002f4880 PMD OVFL: 4 LAST_VAL: 5000 PMD0 : 0x20000000002f6101, valid=Y cache line 0x20000000002f6100 PMD1 : 0x0006000600060006, latency 6, overflow N entry 1 PID:4207 CPU:0 STAMP:0x3ccebbb3cb30 IIP:0x20000000000130c0 PMD OVFL: 4 LAST_VAL: 5000 PMD0 : 0x2000000000012f81, valid=Y cache line 0x2000000000012f80 PMD1 : 0x0006000600060006, latency 6, overflow N 18446744073709547987 L1I_EAR_CACHE_LAT8 This time both PMD0 and PMD1 contains relevant information. PMD0 contains the address of the cache line that caused the miss and PMD1 the latency to resolve it. 5/ Branch Trace Buffer (BTB) The BTB is used to capture branch events. Depending on the configuration of the BTB, it is possible to record the source and target of each branch instruction. It is possible to filter out branches based on how they were predicted by the hardware, whether they were taken or not taken, and so on. Each qualified branch is recorded into the branch buffer and usually each takes two entries (a pair) one for the source (the branch instruction itself) and one for the target of the branch. The hardware buffer has a size of 8 meaning that it can hold up to 4 branch events. The buffer is managed like a ring buffer, once it is full the oldest entries is overwritten. The PMD16 register is used to maintain the index, i.e., where to write next. It also contains a flag indicating whether or not the buffer wrapped around. You can count how many branch are captured using the BRANCH_EVENT event. You MUST use this event if you want to sample with the BTB. Because the BTB can hold 4 branches, sampling with the BTB means that at the end of each sampling period, up to the last 4 branches are recorded. By default, pfmon will capture ALL branches (taken, not taken, predicted correctly or mispredicted). Let us take a look at a simple example: % pfmon --smpl-output-format=detailed-itanium2 --long-smpl-periods=5000 -e branch_event -- ls -l /dev/null /dev/null entry 0 PID:4236 CPU:0 STAMP:0x3dd5f625bad8 IIP:0x2000000000023de0 PMD OVFL: 4 PMD9 : 0x200000000000ec4d b=1 mp=0 bru=0 b1=0 valid=Y Source Address: 0x200000000000ec40 Taken=N Prediction: Success PMD10: 0x200000000000ec59 b=1 mp=0 bru=0 b1=1 valid=Y Source Address: 0x200000000000ec62 Taken=Y Prediction: Success PMD11: 0x2000000000023d82 b=0 mp=1 bru=0 b1=0 valid=Y Target Address: 0x2000000000023d80 PMD12: 0x2000000000023d9d b=1 mp=0 bru=0 b1=1 valid=Y Source Address: 0x2000000000023da0 Taken=N Prediction: Success PMD13: 0x2000000000023dcd b=1 mp=0 bru=0 b1=1 valid=Y Source Address: 0x2000000000023dd0 Taken=N Prediction: Success PMD14: 0x2000000000023deb b=1 mp=1 bru=1 b1=0 valid=Y Source Address: 0x2000000000023de2 Taken=Y Prediction: FE Failure PMD15: 0x2000000000023dc2 b=0 mp=1 bru=1 b1=0 valid=Y Target Address: 0x2000000000023dc0 PMD8 : 0x2000000000023dcd b=1 mp=0 bru=0 b1=1 valid=Y Source Address: 0x2000000000023dd0 Taken=N Prediction: Success .... This time, each entry contains as many as 8 PMDs. Because of wrap around conditions, there is no guarantee that the buffer will be full. It depends of the sampling period and how it compares to the size of the BTB. The BRANCH_EVENT counter is incremented by 1 FOR EACH PAIR OF ENTRIES (each branch event). So if BRANCH_EVENT is equal to 4, then 8 4 branches (or 8 entries) are in the BTB. The branches are recorded one after the other. But because of wrap around conditions, you can have situations where PMD8 is not necessarily the first, i.e., the oldest branch event in the buffer. This can easily be seen in the example above. The detailed-itanium2 output format prints the BTB in sequential order, i.e., in the order in which the branches occurred. Note that this is not necessarily true of all output formats. It is always possible to reconstruct the sequential order if PMD16 is present in the entry (which pfmon ensures). If we look at Entry 0, PMD9 is the oldest branch in the buffer. It reports a NON-TAKEN branch source that was located at address 0x200000000000ec40, the slot is not reported. It was predicted correctly (success) by the hardware. For taken branches, such as the one reported by PMD10, the address encodes the slot number of the low 2 bits, in this case 0x200000000000ec62 indicates slot 2. It is possible to vary the kind of branches that are recorded using the following options: --btb-ds-pred capture info about branch predictions --btb-brt-iprel capture IP-relative branches only --btb-brt-ret capture return branches only --btb-brt-ind capture non-return indirect branches only These four options relate to the Itanium2 branch architecture. Please refer to proper documentation for further information. Furthermore, you also have: --btb-tm-tk capture taken IA-64 branches only --btb-tm-ntk capture not taken IA-64 branches only These are easy to understand! --btb-ptm-correct capture branch if target predicted correctly --btb-ptm-incorrect capture branch if target is mispredicted --btb-ppm-correct capture branch if path is predicted correctly --btb-ppm-incorrect capture branch if path is mispredicted Same here. --btb-all-mispredicted capture all mispredicted branches This one is a freebie, it combines the other to capture only the mispredicted branches. It possible to combine BTB and EAR sampling. One interesting case is when you combine the BTB (taken branches) with the instruction cache misses. For each cache miss captured, you will get the last 4 branches that led to the misses. So you will have the last few steps in the path that led to the miss. With this information, one can imagine possible optimizations such as prefetching. 6/ IA-32 monitoring a/ Introduction By default, pfmon captures events for both IA-32 and IA-64 programs. Not all events are functional in IA-32 mode. The following features are not available when monitoring in IA-32 mode ONLY: - The Branch Trace Buffer (BRANCH_EVENT) - Code range restriction (--irange, --checkpoint-func) - Data range restriction (--drange) However those features are accepted when monitoring for both IA-64 and IA-32 (default). The results will ONLY represent what was generated by the IA-64 execution. b/ The --ia32 and --ia64 options Using the --ia32 option, the user restricts monitoring to execution occuring while psr.is = 1, i.e., for IA-32 code. Using the --ia64 restricts monitoring to IA-64 code only, i.e., psr.is = 0. Note that those options do apply to ALL specified events. c/ Per event instruction set tuning Pfmon also provides a way to fine-tune the instruction set on a per event basis using the --insn-sets option. The order in which the events are listed determines to which event does each instruction set option apply. The first event gets the first instruction set option specified and so on. You do not need to specify all instruction set option for all events. In this case the event for which no instruction set is specified will use whatever the "global" option, i.e. --ia64 or --ia32 is set to. Note that by default, pfmon does both IA-64 and IA-32 at the same time. You can skip certains events, for instance: % pfmon --insn-sets=,ia64 -e l2_misses,l2_misses hello This will have the first l2_misses event use the default mode, i.e. IA-64 & IA32, while the second l2_misses will be configured for IA-64 only. Similarly, the following command: % pfmon --insn-sets=ia32 -e l2_misses,l2_misses hello will set the first l2_misses event for IA-32 only and the second for both IA-64 and IA-32. d/ Some examples Let us look at a simple example with two hello program, one an IA-64 binary (hello) and the same program compiled as an IA-32 binary (hello.x86): % file hello hello: ELF 64-bit LSB executable, IA-64, version 1, statically linked, not stripped % pfmon --insn-sets=ia32,ia64 -e l2_misses,l2_misses hello Hello world 0 L2_MISSES 578 L2_MISSES Here we measure twice the same event, but the first one is configured to monitor IA-32 execution whereas the second monitors IA-64. When running an IA=64 binary, the counter is 0. Now let us see what happens with an IA-32 binary: % file hello.x86 hello: ELF 32-bit LSB executable, Intel 80386, version 1, statically linked, not stripped % pfmon --insn-sets=ia32,ia64 -e l2_misses,l2_misses hello.x86 Hello world 184 L2_MISSES 0 L2_MISSES Now the first counter reports a non zero value. e/ Limitations Linux/ia64 does not currently support processes where both instructions set are mixed. However the dual mode (IA-32, IA-64) is interesting when running system wide monitoring where all execution is captured. The Linux/ia64 kernel execution ALWAYS happens in IA-64 mode, therefore using --ia32 to monitor kernel level execution has no effect. Similarly, some events are only relevant in one mode. For instance, IA32_INST_RETIRED only counts IA-32 instructions. Conversly, IA64_INST_RETIRED will return 0 on an IA-32 program. 7/ Range restrictions a/ Introduction Pfmon allows the monitoring to be constrained to a certain range of data or code addresses and provides the following set of options: --irange=start-end|code_symbol : specify a code address range --drange=start-end|data_symbol : specify a data address range --checkpoint-func=code_addr|code_symbol : specify a checkpoint address --inverse-irange : inverse a code range The third option is a refinement of the first option as we will see shortly. The range can be specified in hexadecimal or decimal. Alternatively, the range can be specified using symbols from the program. Pfmon currently supports only one range per type at a time, i.e., you cannot specify two instruction ranges. When a range is specified using a numerical value, pfmon does not try to see if the range represents a valid part of the address space of the process. It will simply do sanity check on the bounds. It is possible to specify code or data ranges inside the kernel. When symbols are used, then pfmon checks that the symbol corresponds to data for --drange and code for --irange and --checkpoint-func. For a code range pfmon verifies that the bounds are bundle-aligned. The range can be delimited by two symbols, but pfmon also supports using a single symbol. In this case, it will use the size of the symbol which is encoded in the symnbol table. NOTE: earlier versions of the IA-64 GNU toolchain did not generate the size in the symbol table. In this case, pfmon will try to approximate the size of a symbol by using the next symbol given that the symbol table is sorted by increasing address values. This mechanism is not always accurate, you can check the numerical values used for the range by turning on the verbose mode (--verbose). You can also check your binaries with the readelf -s command. b/ Itanium 2 PMU limitations The Itanium 2 PMU imposes some restrictions on alignment of the ranges due to the way they are implemented, i.e., using the debug registers. There is one improvment over the Itanium PMU for code range when the addition of the fine mode. With this mode, a code range can be specified without using a mask but instead with a start and end address. However, there are still some limitations on the size of the range (4096KB)in this case. Also depending on the event, it may not be possible to using multiple debug register pairs to gain accuracy when the fine mode cannot be used. Even with mulitple pairs, it is possible that the programmed range will be slightly larger than what was asked for. You can determine which mode was used for each range and also by how much the debug registers will 'bleed' from the specified range by using the --verbose option of pfmon: % pfmon --verbose --irange=0x1000-0x1590 -e ia64_inst_retired /bin/ls /dev/null ... irange is [0x1000-0x1590)=1424 bytes ... [0x1000-0x1590): 2 register pair(s), fine_mode start offset: -0x0 end_offset: +0x10 brp0: db0: 0x0000000000001000 db1: plm=0x8 mask=0x00fffffffffff000 brp2: db4: 0x0000000000001590 db5: plm=0x8 mask=0x00fffffffffff000 ... As you can see here, it was possible to use the fine mode because the size of the range was below the 4kB limit and was not crossing a 4KB page boundary. Due to some bug in the implementation of the fine mode, there is still some 'bleeding' at the edges as shown by an end offset of 16 bytes (1 bundle). Just like for the opcode matcher, not all events support address range restrictions, you can use the event info option (-i) to verify. The --drange options works just like the --irange options. In fact, both can be combined as they rely on distinct sets of debug registers. IMPORTANT: The program being monitored by pfmon MUST NOT be using the debug registers. c/ privilege level mask of range The range restriction also uses a privilege level mask. It has the same role as the one for events. Pfmon uses the default global privilege level to setup the range restrictions. For instance, the following example: % pfmon --irange=main --verbose -eloads_retired,nops_retired,loads_retired noploop 1000000000 ... [0x40000000000004c0-0x4000000000000690): 2 register pair(s), fine_mode start offset: -0x0 end_offset: +0x10 brp0: db0: 0x40000000000004c0 db1: plm=0x8 mask=0x00fffffffffff000 brp2: db4: 0x4000000000000690 db5: plm=0x8 mask=0x00fffffffffff000 ... uses user privilege level only (pfmon default) for the range as indicated by plm=8. This is even more apparent in the following example: % pfmon -k --irange=main --verbose -eloads_retired,nops_retired,loads_retired noploop 1000000000 ... [0x40000000000004c0-0x4000000000000690): 2 register pair(s), fine_mode start offset: -0x0 end_offset: +0x10 brp0: db0: 0x40000000000004c0 db1: plm=0x1 mask=0x00fffffffffff000 brp2: db4: 0x4000000000000690 db5: plm=0x1 mask=0x00fffffffffff000 ... But when privilege level masks are set per event, there can be confusion as the range is systematically applied to all events. Therefore pfmon disallow the use of the --priv-levels option when a range is provided and vice-versa. d/ inversing code range It is possible to inverse the code range using the --inverse-irange. Inversing the code range means that the PMU will count the events only when they occur outside the specified range. Let us use our noploop example to demonstrate what happens. First let us measure the total number of nops instructions retired. % pfmon -enops_retired noploop 1000000000 1000004080 NOPS_RETIRED Now if we focus on the core loop function: % pfmon --irange=noploop -enops_retired noploop 1000000000 1000000002 NOPS_RETIRED If we look at main(): % pfmon --irange=main -enops_retired noploop 1000000000 20 NOPS_RETIRED Now if we inverse main(), i.e., count all the nops outside of main(): % pfmon --inverse-irange --irange=main -enops_retired noploop 1000000000 1000004071 NOPS_RETIRED To get to main() and to leave main(), the program executes: 1000004071 - 1000000002 = 4069 nops which is close to what you would get using the other possible formula: 1000004080 - 1000000002 - 20 = 4058 nops d/ Some examples Let us look at some more examples which use symbols directly. First suppose we have a program which contains a data array called B and we want to know the number of loads from the array: % pfmon --verb --drange=B -e loads_retired my_test_program ... symbol B (data): [0x600000000001c000-0x6000000000024000)=32768 bytes drange is [0x600000000001c000-0x6000000000024000)=32768 bytes ... [0x600000000001c000-0x6000000000024000): 2 register pair(s) start offset: -0x0 end_offset: +0x0 brp0: db0: 0x6000000000020000 db1: plm=0x8 mask=0x00ffffffffffc000 end=0x6000000000023fff brp1: db2: 0x600000000001c000 db3: plm=0x8 mask=0x00ffffffffffc000 end=0x600000000001ffff ... 100000000 LOADS_RETIRED Here pfmon was able to extract the size of B directly from the symbol table. The array is aligned properly for its size, therefore both start and end offset are 0. Now suppose we want to know the number of loads from B which where executed in function doit(). We can combine --irange with --drange for LOADS_RETIRED: % pfmon --verb --irange=doit --drange=B -e loads_retired my_test_program ... symbol doit (code): [0x4000000000003000-0x40000000000030d0)=208 bytes irange is [0x4000000000003000-0x40000000000030d0)=208 bytes symbol B (data): [0x600000000001c000-0x6000000000024000)=32768 bytes drange is [0x600000000001c000-0x6000000000024000)=32768 bytes ... [0x4000000000003000-0x40000000000030d0): 2 register pair(s), fine_mode start offset: -0x0 end_offset: +0x10 brp0: db0: 0x4000000000003000 db1: plm=0x8 mask=0x00fffffffffff000 brp2: db4: 0x40000000000030d0 db5: plm=0x8 mask=0x00fffffffffff000 ... [0x600000000001c000-0x6000000000024000): 2 register pair(s) start offset: -0x0 end_offset: +0x0 brp0: db0: 0x6000000000020000 db1: plm=0x8 mask=0x00ffffffffffc000 end=0x6000000000023fff brp1: db2: 0x600000000001c000 db3: plm=0x8 mask=0x00ffffffffffc000 end=0x600000000001ffff ... 100000000 LOADS_RETIRED Here, pfmon extracted the size of function doit() from the symbol table and its is small enough to qualify for the fine mode. Due to the fine mode, we still have an end offset of 1 bundle. The run shows that all of the loads are coming from doit(). e/ The checkpoint-func option The --checkpoint-func option is a variation of the --irange option as such it cannot be used in conjunction with --irange. It allows a user to specify a bundle address and can be used to verify that execution crosses a certain point (bundle). When the bundle is the first of a function, you can check how many times the function was called. You need to combine the constraint with the IA64_INST_RETIRED event. The result then needs to be divided by three to get the number of calls. Note that pfmon does not impose that the bundle be the first of a function, in fact, it can be anything. There is no equivalent of this option for data. With this option, you can easily determine the number of times a particular system call is invoked. For instance, to count the number of times sys_open() (function which implements open(2)) is called: % pfmon --verb --symbol-file=vmlinux -k --checkpoint-func=sys_open -e ia64_inst_retired ls /dev/null ... loaded 20105 symbols from ELF file vmlinux symbol sys_open (code): [0xe0000000044b8420-0xe0000000044b85b0)=400 bytes checkpoint function at 0xe0000000044b8420 ... [0xe0000000044b8420-0xe0000000044b8430): 1 register pair(s) start offset: -0x0 end_offset: +0x0 brp0: db0: 0xe0000000044b8420 db1: plm=0x1 mask=0x00fffffffffffff0 end=0xe0000000044b842f ... 54 IA64_INST_RETIRED Here we specified, -k to monitor at the kernel level given that sys_open() is a kernel function. The count is 54 which indicates that the function was called 18 times (18=54/3). The result is ALWAYS a multiple of 3 as you have 3 instructions per bundle (predicated off instruction are counted here). The use of any other event is possible here if that event supports the instruction address range restriction (see pfmon -i). But to count the number of time the function is invoked you MUST use IA64_INST_RETIRED. At this point only one checkpoint per session is supported. 8/ References The Itanium2 PMU is described in details in the micro-architecture manual entitled: 'Intel Itanium 2 Processor Reference Manual for Software Development and Optimization' Additional information can be found in the IA-64 architecture manuals. All the documents are available from Intel Developer's web site at: http://developer.intel.com/design/itanium/index.htm 06/06/2002 S.Eranian