              The perfmon hardware monitoring interface
              ------------------------------------------
		           Stephane Eranian
			  <eranian@gmail.com>

I/ Introduction

   The perfmon interface provides access to the hardware performance counters
   of major processors. Nowadays, all processors implement some flavor of
   performance counters which capture micro-architectural level information
   such as the number of elapsed cycles, number of cache misses, and so on.

   The interface is implemented as a set of new system calls and a set of
   config files in /sys.

   It is possible to monitor a single thread or a CPU. In either mode,
   applications can count or sample. System-wide monitoring is supported by
   running a monitoring session on each CPU. The interface supports event-based
   sampling where the sampling period is expressed as the number of occurrences
   of event, instead of just a timeout. This approach provides a better
   granularity and flexibility.

   For performance reason, it is possible to use a kernel-level sampling buffer
   to minimize the overhead incurred by sampling. The format of the buffer,
   what is recorded, how it is recorded, and how it is exported to user is
   controlled by a kernel module called a sampling format. The current
   implementation comes with a default format but it is possible to create
   additional formats. There is an kernel registration interface for formats.
   Each format is identified by a simple string which a tool can pass when a
   monitoring session is created.

   The interface also provides support for event set and multiplexing to work
   around hardware limitations in the number of available counters or in how
   events can be combined. Each set defines as many counters as the hardware
   can support. The kernel then multiplexes the sets. The interface supports
   time-based switching but also overflow-based switching, i.e., after n
   overflows of designated counters.

   Applications never manipulates the actual performance counter registers.
   Instead they see a logical Performance Monitoring Unit (PMU) composed of a
   set of config registers (PMC) and a set of data registers (PMD). Note that
   PMD are not necessarily counters, they can be buffers. The logical PMU is
   then mapped onto the actual PMU using a mapping table which is implemented
   as a kernel module. The mapping is chosen once for each new processor. It is
   visible in /sys/kernel/perfmon/pmu_desc. The kernel module is automatically
   loaded on first use.

   A monitoring session is uniquely identified by a file descriptor obtained
   when the session is created. File sharing semantics apply to access the
   session inside a process. A session is never inherited across fork. The file
   descriptor can be used to receive counter overflow notifications or when the
   sampling buffer is full. It is possible to use poll/select on the descriptor
   to wait for notifications from multiple sessions. Similarly, the descriptor
   supports asynchronous notifications via SIGIO.

   Counters are always exported as being 64-bit wide regardless of what the
   underlying hardware implements.

II/ Kernel compilation

    To enable perfmon, you need to enable CONFIG_PERFMON and also some of the
    model-specific PMU modules.

III/ OProfile interactions

    The set of features offered by perfmon is rich enough to support migrating
    Oprofile on top of it. That means that PMU programming and low-level
    interrupt handling could be done by perfmon. The Oprofile sampling buffer
    management code in the kernel as well as how samples are exported to users
    could remain through the use of a sampling format. This is how Oprofile
    works on Itanium.

    The current interactions with Oprofile are:
	- on X86: Both subsystems can be compiled into the same kernel. There
		  is enforced mutual exclusion between the two subsystems. When
		  there is an Oprofile session, no perfmon session can exist
		  and vice-versa.

	- On IA-64: Oprofile works on top of perfmon. Oprofile being a
		    system-wide monitoring tool, the regular per-thread vs.
		    system-wide session restrictions apply.

	- on PPC: no integration yet. Only one subsystem can be enabled.
	- on MIPS: no integration yet.  Only one subsystem can be enabled.

IV/ User tools

    We have released a simple monitoring tool to demonstrate the features of
    the interface. The tool is called pfmon and it comes with a simple helper
    library called libpfm. The library comes with a set of examples to show
    how to use the kernel interface. Visit http://perfmon2.sf.net for details.

    There maybe other tools available for perfmon.

V/ How to program?

   The best way to learn how to program perfmon, is to take a look at the
   source code for the examples in libpfm. The source code is available from:

		http://perfmon2.sf.net

VI/ System calls overview

   The interface is implemented by the following system calls:

   * int pfm_create_context(pfarg_ctx_t *ctx, char *fmt, void *arg, size_t arg_size)

      This function create a perfmon2 context. The type of context is per-thread by
      default unless PFM_FL_SYSTEM_WIDE is passed in ctx. The sampling format name
      is passed in fmt. Arguments to the format are passed in arg which is of size
      arg_size. Upon successful return, the file descriptor identifying the context
      is returned.

   * int pfm_write_pmds(int fd, pfarg_pmd_t *pmds, int n)

      This function is used to program the PMD registers. It is possible to pass
      vectors of PMDs.

   * int pfm_write_pmcs(int fd, pfarg_pmc_t *pmds, int n)

      This function is used to program the PMC registers. It is possible to pass
      vectors of PMDs.

   * int pfm_read_pmds(int fd, pfarg_pmd_t *pmds, int n)

      This function is used to read the PMD registers. It is possible to pass
      vectors of PMDs.

   * int pfm_load_context(int fd, pfarg_load_t *load)

      This function is used to attach the context to a thread or CPU.
      Thread means kernel-visible thread (NPTL). The thread identification
      as obtained by gettid must be passed to load->load_target.

      To operate on another thread (not self), it is mandatory that the thread
      be stopped via ptrace().

      To attach to a CPU, the CPU number must be specified in load->load_target
      AND the call must be issued on that CPU. To monitor a CPU, a thread MUST
      be pinned on that CPU.

      Until the context is attached, the actual counters are not accessed.

   * int pfm_unload_context(int fd)

     The context is detached for the thread or CPU is was attached to.
     As a consequence monitoring is stopped.

     When monitoring another thread, the thread MUST be stopped via ptrace()
     for this function to succeed.

   * int pfm_start(int fd, pfarg_start_t *st)

     Start monitoring. The context must be attached for this function to succeed.
     Optionally, it is possible to specify the event set on which to start using the
     st argument, otherwise just pass NULL.

     When monitoring another thread, the thread MUST be stopped via ptrace()
     for this function to succeed.

   * int pfm_stop(int fd)

     Stop monitoring. The context must be attached for this function to succeed.

     When monitoring another thread, the thread MUST be stopped via ptrace()
     for this function to succeed.


   * int pfm_create_evtsets(int fd, pfarg_setdesc_t *sets, int n)

     This function is used to create or change event sets. By default set 0 exists.
     It is possible to create/change multiple sets in one call.

     The context must be detached for this call to succeed.

     Sets are identified by a 16-bit integer. They are sorted based on this
     set and switching occurs in a round-robin fashion.

   * int pfm_delete_evtsets(int fd, pfarg_setdesc_t *sets, int n)

     Delete event sets. The context must be detached for this call to succeed.


   * int pfm_getinfo_evtsets(int fd, pfarg_setinfo_t *sets, int n)

     Retrieve information about event sets. In particular it is possible
     to get the number of activation of a set. It is possible to retrieve
     information about multiple sets in one call.


   * int pfm_restart(int fd)

     Indicate to the kernel that the application is done processing an overflow
     notification. A consequence of this call could be that monitoring resumes.

   * int read(fd, pfm_msg_t *msg, sizeof(pfm_msg_t))

   the regular read() system call can be used with the context file descriptor to
   receive overflow notification messages. Non-blocking read() is supported.

   Each message carry information about the overflow such as which counter overflowed
   and where the program was (interrupted instruction pointer).

   * int close(int fd)

   To destroy a context, the regular close() system call is used.


VII/ /sys interface overview

   Refer to Documentation/ABI/testing/sysfs-perfmon-* for a detailed description
   of the sysfs interface of perfmon2.

VIII/ debugfs interface overview

  Refer to Documentation/perfmon2-debugfs.txt for a detailed description of the
  debug and statistics interface of perfmon2.

IX/ Documentation

   Visit http://perfmon2.sf.net
