Profiling code on an embedded platform

Edwin Olson (eolson@mit.edu), April 2007

I develop on several embedded platforms. My current platform of choice is an ARM7TDMI device in the form of an LPC2378. I develop primarily in C, but also write assembly as necessary. These hosts run small real-time operating systems with limited capabilities.

There are two things that I wish I could do on my embedded platform that I can do on a proper Linux host.

  1. Run a memory heap validator such as valgrind
  2. Run a profiler

I've got no solution for #1 yet, and I'd love to hear from you if you've got an approach. (Some possibilities: gcc -ffortify, port libmudflap, etc.) A good simulator would also help out.

I do have an approach to profiling now, however. Really, this is a question of "what's the least amount of work I can do to implement a simple profiler?" Because this is an embedded system in which peripheral systems trigger all processing, simulation is hard to do: many peripherals would have to be faithfully simulated, as would the real-world events! In-situ profiling presents fewer hurdles.

Because the CPU is relatively limited (48 MIPS, give or take), any significant profiling overhead (or code instrumentation) is a show-stopper. These considerations led to my strategy:

Approach

  1. Write a FIQ (fast interrupt) that will periodically sample the PC (program counter), and write the sampled address to a buffer.
  2. Write a user task that periodically polls the sample buffer and sends it to a host. (I send the data via UDP packets, but serial would have worked too.)
  3. Process the samples on the local host to generate useful profiling data.

The implementation of #1 and #2 are strongly platform and embedded-O/S dependent, but they're also fairly easy to do! Below is my FIQ, for example, which uses the LPC timer 0 to generate an interrupt every 100k cycles (or there-abouts). Not shown, of course, is the modification to the reset vectors to point to on_profile_irq, and the normal O/S task which polls profile_out and sends the data to the host.


#define PROFILE_SAMPLES 256
uint32_t profile_buffer[PROFILE_SAMPLES];
volatile uint32_t profile_out;

void profile_init()
{
    PCONP |= (1<<1); // PCTIM0
    
    T0PR   = 1;               // prescale divider
    T0TCR  = (1<<0);          // enable the counter
    T0MR0  = 9000;            // match register
    T0MCR  = (1<<0) | (1<<1); // when we hit MR0, throw an IRQ and reset.
    T0CTCR = 0;               // counter mode, por favor

    VICIntSelect |= (1<<4);   // make MR0 match a FIQ
    VICIntEnable |= (1<<4);
}

void on_profile_irq() __attribute__ ((interrupt ("FIQ")));
void on_profile_irq()
{
    uint32_t addr;
    asm volatile("mov %r0, lr \r\n"
                 : "=r"(addr)
                 );

    if (profile_out < PROFILE_SAMPLES)
        profile_buffer[profile_out++] = addr;

    T0IR |= (1<<0); // ack the interrupt
}

Running this (slightly) modified firmware and logging the resulting PC samples to a file results in a file containing lines of addresses. It looks like this:


000029bc
000038ec
00005444
00003738
00003528
000038d4
00005424
0000371c
00005458
00003764
00002a0c
00006c24
0000546c
00003918
00006bf0
00003680
000038d0

Each line represents a simple sample of the PC. Because I used a FIQ rather than an IRQ, and because FIQs are enabled during IRQs, I am able to profile time spent in ISRs as well as user code. This is great! I can easily get a thousand samples per second (or so) with no significant loading on the CPU.

Processing the data

My solution for processing the data is somewhat inelegant, but it was easy and produces useful and easy-to-interpret results.

I use the standard binutils program "addr2line" to translate each PC sample into a tuple of (source code file, line number, function name). Addr2line is included in all standard GNU toolchains, and will read our file of addresses and produce a new file of tuples. Suppose the file containing PC samples is called "samples.log". I run the following command:

% cat /tmp/samples.log | arm-elf-addr2line -e main.out  -fi > /tmp/profile.dat

Note: it is important for you to compile your code with -g so that the debugging information is included in the file. But it is not necessary for you to disable compile-time optimizations, which is good because otherwise the profiling results would be less meaningful!

Addr2line, with the options -fi, produces two lines for every sample: the function name, followed by filename:line number. The output of addr2line looks like:

scheduler_add
../nkern/core/scheduler.c:25
nkern_utime
../nkern/platforms/lpc2378/platform.c:61
_nkern_schedule
../nkern/core/nkern.c:393
scheduler_add
../nkern/core/scheduler.c:40
_nkern_schedule
../nkern/core/nkern.c:314
nkern_in_irq
../nkern/include/nkern.h:131
nkern_yield
../nkern/core/nkern.c:188
sleep_queue_remove_head_if_sleep_done
../nkern/core/sleep_queue.c:43
nkern_utime
../nkern/platforms/lpc2378/platform.c:63
_userspace_reschedule
../nkern/archs/arm7tdmi/arch.c:35
scheduler_pick
../nkern/core/scheduler.c:54

The next step in processing the data was to write a short application that reads the output of addr2line (written to a file named "profile.dat") and produces summary data. The program simply tallies the number of hits on each function and on each line number, producing an annotated listing giving the amount of CPU time of each line of source code. You can easily write an application yourself, or use mine, Prof.java. My application takes the samples.log file as the first argument; additional arguments are added to the search path used to locate source code.

The most important task is to produce useful output that makes it easy to see where CPU is being spent. My program sorts functions by CPU usage and outputs the source code of the functions (most CPU time to lowest CPU time) with each line labelled with the CPU usage.

Here is some Sample output from my program. I'd be very interested in any improvements, so please send 'em to me!

Limitations

There are some serious limitations to my current implementation. Some could be easily fixed; others, not so much.

Small issues

  1. The sampling timer should be randomized in order to prevent any sampling bias. This could be easily done by rewriting the MR0 register with the output of a pseudo-random quantity every time the FIQ finishes.
  2. Too many steps are needed in order to produce the profiling output (run the firmware, log the samples.out, convert it into profile.dat, run Prof.java); this is remedied by writing a few wrapper programs and a little UI polish.

The big issue

The output would be far more useful with backtrace information: i.e., the sequence of calls led to the current output. This data ends up being very difficult to harvest from an ARM core: the calling convention is different between Thumb mode and ARM32 mode, and it is not obvious whether the FIQ interrupted user code or an IRQ. All of these would lead to different code paths and would make the FIQ much more costly. Because of these complications, I have not pursued this. It would also be necessary to limit the maximum depth to, say, 10.

But if I had backtrace information, I would modify the "samples.log" file format to contain multiple addresses per line. Each line would correspond to a single sample, and successive samples on each line would represent the caller of the previous address. This file couldn't be piped directly through addr2line, but only a short wrapper would be needed. Then, for each function, I would tally each trace that led to the current sample. When outputting the data for each function, I would print the top N traces that led to the execution of the function.

Feedback

I'm keen on feedback. Send me a note at eolson@mit.edu.