printk()

So, a week in Prague has come to its end. The Embedded Linux Conference Europe was this year co-located with Open Source Summit and offered a lot of interesting talks on various topics.

One of the hottest topics this year was about our most beloved debugging function - prink(). What is so hard with printing? It turns out that printk is quite deadlock-prone and that is not an easy thing to work around in the current infrastructure of the kernel.

A common misconception is that printk() is a fast operation that simply writes the message to the global __log_buf variable. It's not.

A printk() may involve many different subsystems, different contexts or nesting, just to mention a few parts that needs to be handled. For example:

  1. The output needs to go over some output medium (consoles)
    • The monitor
    • Frame buffers
    • UART / Serial console
    • Network console
    • Braille
    • ...
  2. Uses different locking mechanismes
    • The console_lock (described below)
    • The logbuf_lock spinlock
    • Consoles often have their own locks
  3. Wake up waiting applications
    • syslogd
    • journald
    • ...

Besides that, printk() is expected to work in every context, whether it's process, softirq, IRQ or NMI context. With all these locking mechanisms involved, what happens if a printk in process context is interrupted by an NMI, and the NMI also calls printk? In other words, there is a lot of special cases that needs to be handled.

How it works

Historical

Lets look back on how the printing was handled in a pre-history kernel.

SMP (Symmetric Multi Processing) SoCs became common in the late 1990s. Before that, everything was easy and everyone was happy. No NMIs. No races between multiple cores. Simple locking. No Facebook.

As a response to SMP systems, Linux v2.1.80 introduced a spin_lock to printk to avoid race conditions between multiple cores. The solution we came up with was to serialize all prints to the console. If two CPUs called printk() at the same time, the second core has to wait for the first core to finish.

This does not scale well. In fact, it does not scale at all. What about a modern system with 100+ CPUs that all calls printk at the same time? Depending on the console, the printing may take milliseconds and you will surely end up with an unresponsive system.

Nowdays

Now we are doing things differently. The first core that grabs the console_lock is responsible to print all messages in the __log_buf. If another core is calling printk() in meanwhile, it puts its data into __log_buf , tries to grab the lock which is busy, and then simple returns.

As __log_buf continues getting new data, the unlucky core that grabbed the console_lock may end up doing nothing but printing.

The good thing is that we only locks up a single core instead of all cores. The bad thing is that we locks up a single core.

The code

printk()

printk() is defined in kernel/printk/printk.c and does not look much to the world

asmlinkage __visible int printk(const char *fmt, ...)
{
    va_list args;
    int r;

    va_start(args, fmt);
    r = vprintk_func(fmt, args);
    va_end(args);

    return r;
}

It simple calls vprintk_function with its own arguments.

vprintk_func()

vprintk_func() is a function that forward the arguments to different print-functions depending on the current context

__printf(1, 0) int vprintk_func(const char *fmt, va_list args)
{
    if (this_cpu_read(printk_context) & PRINTK_NMI_CONTEXT_MASK)
        return vprintk_nmi(fmt, args);

    if (this_cpu_read(printk_context) & PRINTK_SAFE_CONTEXT_MASK)
        return vprintk_safe(fmt, args);

    if (this_cpu_read(printk_context) & PRINTK_NMI_DEFERRED_CONTEXT_MASK)
        return vprintk_deferred(fmt, args);

    return vprintk_default(fmt, args);
}

The different contexts we have to consider are:

Normal context

If we are on normal context, there is nothing to consider at all, go for the vprintk_default() and just do our thing.

NMI context

In the case that the CPU supports NMIs (Non-Maskable Interrupts, (look for CONFIG_HAVE_NMI and CONFIG_PRINTK_NMI in your .config ), we go for vprintk_nmi().

vprintk_nmi() do a safe copy to a per-CPU buffer, not the global __log_buf.

Since NMIs are not nested by its nature, there is always only one write running. However, NMIs is only for the local CPU, and the buffer might get flushed from another CPU, so we still need to be careful.

"Recursive" context

If the printk() routine is interrupted and we end up in another call to printk from somewhere else, we go for the lock-less vprintk_safe() to prevent a recursion deadlock. vprintk_safe() is using a per-CPU buffer to store the message, just like NMI.

Deferred context

As already said, multiple locks is involved in the call chain of printk(). vprintk_deferred() is using the main logbuf_lock but avoid calling console drivers that might have their own locks. The actual printing is deferred to klogd_work kernel thread.

vprintk_emit()

vprintk_emit() is responsible to write to __log_buf, (but not the only function, cont_flush() also write to __log_buf) and print out the content to all consoles.

asmlinkage int vprintk_emit(int facility, int level,
                const char *dict, size_t dictlen,
                const char *fmt, va_list args)
{

    ...

    <<<<< Strip kernel syslog prefix >>>>>

    ...

    <<<<< log_output() does the actual printing to __log_buf >>>>>
    printed_len = log_output(facility, level, lflags, dict, dictlen, text, text_len);

    ...

    if (!in_sched) {
        /*
         * Try to acquire and then immediately release the console
         * semaphore.  The release will print out buffers and wake up
         * /dev/kmsg and syslog() users.
         */
        if (console_trylock())
            console_unlock();
    }

    return printed_len;
}

The function is quite straight forward. The only thing that looks a little bit strange is

if (console_trylock())
    console_unlock();

Really? Grab the console_lock and immediately unlock it? The thing is that all magic happens in console_unlock().

console_unlock()

The CPU that is grabbing the console_lock is responsible to print to all registered consoles until all new data in __log_buf is printed. This regardless if other CPUs keeps filling the buffer with new data.

In the worst case, this CPU is doing nothing but printing and will never leave this function.

void console_unlock(void)
{
    ...


    <<<<< Endless loop? >>>>>
    for (;;) {

        <<<<< Go through all new messages >>>>>

        ...

        <<<<< Print to all consoles >>>><
        call_console_drivers(ext_text, ext_len, text, len);

        ...
    }

    ...

    <<<<<  Release the exclusive_console once it's used >>>>>
    console_locked = 0;

    ...

    <<<<< Wake up klogd >>>>>
    if (wake_klogd)
        wake_up_klogd();
}

The function is looping until all new messages is printed. For each new message, a call to call_console_drivers() is made. The last thing that we do is waking up the klogd kernel thread that will signal to all userspace application that is waiting on klogctl(2).

call_console_drivers()

call_console_drivers() is asking all registered consoles to print out a message. The console_lock must be held when calling this function.

static void call_console_drivers(const char *ext_text, size_t ext_len,
                 const char *text, size_t len)
{
    struct console *con;

    trace_console_rcuidle(text, len);

    if (!console_drivers)
        return;

    for_each_console(con) {
        if (exclusive_console && con != exclusive_console)
            continue;
        if (!(con->flags & CON_ENABLED))
            continue;
        if (!con->write)
            continue;
        if (!cpu_online(smp_processor_id()) &&
            !(con->flags & CON_ANYTIME))
            continue;
        if (con->flags & CON_EXTENDED)
            con->write(con, ext_text, ext_len);
        else
            con->write(con, text, len);
    }
}

trace_printk()

As we see, there is a lot of logic involved in a simple call to printk() and you should not be surprised if all your printing has impact on your systems performance or timing. But how do we debug if printk() is a no-no? The answer is trace_printk().

This function write (almost) directly to a trace buffer and is therefore a fairly fast operation. The trace buffer is exposed from tracefs, usually mounted at /sys/kernel/tracing.

As a bonus, the messages is merged with other output from ftrace when doing a function trace.

Other things that is good to know about __log_buf

__log_buf

The kernel log buffer is exported as a global symbol called __log_buf. If you have an systems that deadlocks without any output on the console and you may reboot the system without resetting RAM, then you may print the content of __log_buf from the bootloader.

Determine the physical address of __log_buf

[09:59:31]marcus@little:~/git/linux$ grep __log_buf System.map
c14cfba8 b __log_buf

The 0xc14cfba8 is the virtual address of __log_buf. This kernel is compiled for a 32bit ARM with the CONFIG_VMSPLIT_3G set, so the kernel virtual address space start at 0xc0000000. To get the physical address out of the virtual, subtract the offset (0xc14cfba8 - 0xc0000000) and you will end up with 0x014cfba8. Dump this address from your bootloader and you will see your kernel log.

sizeof(__log_buf)

The size of __log_buf is set at compile-time with CONFIG_LOG_BUF_SHIFT. The value defines the size as a power of 2 and is usually set to 16 (64K).

There is also a CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT that is the per-CPU buffer where messages printed from unsafe context are temporary stored. Examples on unsafe context would be NMI and printk recursions. The messages are copied to the main log buffer in a safe context to avoid a deadlock.

This buffer is rarely used but has to be there to avoid the nasty deadlocks. The CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT was introduced in v4.11 and is also expressed as a power of 2.