V4L2 and media controller

V4L2 and media controller

The media infrastructure in the kernel is a giant beast handling many different types of devices involving different busses and electrical interfaces. Providing an interface to handle the complexity of the hardware is not an easy task. Most devices have multiple ICs with different communication protocols... so the device drivers tends to be very complex as well.

Video For Linux 2 (V4L2) is the interface for such media devices. V4L2 is the second version of V4L and is not really compatible with V4L, even if there is a compatibility mode but the support is more than often incomplete. The name Video4Linux is a counterpart to Video4Windows, but is not technically related to it at all.

Here is an example of what a system may look like (borrowed from the Linux kernel documentation)


Media controller

System-on-Chips (SoC) devices often provides wide range of hardware blocks that can be interconneced in a variety of ways to obtain the desired functionality. To configure these hardware blocks, the kernel provides the Media Controller kernel API which expose detailed information about the media device and let them to be interconnected in a dynamic and complex way at runtime, all from userspace.

Each hardware block, called entity, in the media controller framework has one or more source and sink pads. The API let the user link source to sink pads and set the format of pads.

Here is a the topology exported from my sabresd with an imx219 (camera module) connected:


Let's go through the entities in the picture. All these entities is of course specific for the iMX6 SoC. (Partly taken from the kernel documentation)

imx219 1-0010

This is the camera sensor. The sensor is controlled with I2C commands and the data stream is over the MIPI CSI-2 interface. The name tells us that the sensor is connected to I2C bus 1. The device has the address 0x10.

The entity has one source pad.


This is the MIPI CSI-2 receiver entity. It has one sink pad to receive the MIPI CSI-2 stream (usually from a MIPI CSI-2 camera sensor). It has four source pads, corresponding to the four MIPI CSI-2 demuxed virtual channel outputs. Multiple source pads can be enabled to independently stream from multiple virtual channels.


This is the video multiplexers. They have two or more sink pads to select from either camera sensors with a parallel interface, or from MIPI CSI-2 virtual channels from imx6-mipi-csi2 entity. They have a single source pad that routes to a CSI (ipuX_csiY entities).


These are the CSI entities. They have a single sink pad receiving from either a video mux or from a MIPI CSI-2 virtual channel as described above.


The VDIC carries out motion compensated de-interlacing, with three motion compensation modes: low, medium, and high motion. The mode is specified with the menu control V4L2_CID_DEINTERLACING_MODE. The VDIC has two sink pads and a single source pad.


This is the IC pre-processing entity. It acts as a router, routing data from its sink pad to one or both of its source pads.

The direct sink pad receives from an ipuX_csiY direct pad. With this link the VDIC can only operate in high motion mode.


This is the IC pre-processing encode entity. It has a single sink pad from ipuX_ic_prp, and a single source pad. The source pad is routed to a capture device node, with a node name of the format "ipuX_ic_prpenc capture".

This entity performs the IC pre-process encode task operations: color-space conversion, resizing (downscaling and upscaling), horizontal and vertical flip, and 90/270 degree rotation. Flip and rotation are provided via standard V4L2 controls.

Like the ipuX_csiY IDMAC source, this entity also supports simple de-interlace without motion compensation, and pixel reordering.


This is the IC pre-processing viewfinder entity. It has a single sink pad from ipuX_ic_prp, and a single source pad. The source pad is routed to a capture device node, with a node name of the format "ipuX_ic_prpvf capture".

This entity is identical in operation to ipuX_ic_prpenc, with the same resizing and CSC operations and flip/rotation controls. It will receive and process de-interlaced frames from the ipuX_vdic if ipuX_ic_prp is receiving from ipuX_vdic.

Capture video stream from sensor

In order to capture a video stream from the sensor we need to:

  1. Create links between the needed entities
  2. Configure pads to hold the correct image format

To do this, we use the media-ctl [1] tool.

Configure pads

We also need to configure each pad to the right format. This image sensor is ouput in raw bayer format (SRGGB8).

export fmt=SRGGB8_1X8/640x480
media-ctl --set-v4l2 "'imx219 1-0010':0[fmt:$fmt field:none]"
media-ctl --set-v4l2 "'imx6-mipi-csi2':1[fmt:$fmt field:none]"
media-ctl --set-v4l2 "'ipu1_csi0_mux':5[fmt:$fmt field:none]"
media-ctl --set-v4l2 "'ipu1_csi0':1[fmt:$fmt field:none]"

Stream to framebuffer

Now a full pipe is created from imx219 to the video0 device.

GStreamer is a handy multimedia framework that we can use to test the full chain

gst-launch-1.0 -vvv v4l2src device=/dev/video0 io-mode=dmabuf blocksize=76800 ! "video/x-bayer,format=rggb,width=640,height=480,framerate=30/1" ! queue ! bayer2rgbneon ! videoconvert ! fbdevsink sync=false

ath10k QCA6584 and Wireless network stack

ath10k QCA6584 and Wireless network stack

ATH10K is the mac80211 wireless driver for Qualcomm Atheros QCA988x family of chips, and I'm currently working [1] with the QCA6584 chip which is an automotive graded radio chip with PHY support for the abgn+ac modes. The connection interface to the chip is SDIO which is hardly supported for now, but my friend and kernelhacker, Erik Strömdahl [2] , has got his hands dirty and is currently working on it. There has been some progress, the chip now able to scan, connect, send and receive data. There is still some issues with the link speed but that is coming.

He is also the reason for why I got interested in the network part of the kernel which is quite... big.

Even only the wireless networking subsystem is quite big, and the first you meet when you start to dig is a bunch of terms thrown up in your face. I will try to briefly describe a few of these terms that is fundamental for wireless communication.

In this post will discuss the right side of the this figure:


IEEE 802.11

We will see 802.11 a lot of times, so the first thing is to know where these numbers comes from. IEEE 802.11 is a set of specifications for implementation of wireless networking over several frequency bands. The specifications cover layer 1 (Physical) and layer 2 (Data link) of the OSI model [3].

The Linux kernel MAC subsystem register ieee80211 compliant hardware device with

int ieee80211_register_hw(struct ieee80211_hw *hw)

found in .../net/mac80211/main.c

The Management Layer (MLME)

One more thing that we need to cover is the management layer, since all other layers somehow depend on it.

There are three components in the 802.11 management architecture: - The Physical Layer Management Entity (PLME) - The System Management Entity (SME) - The MAC Layer Management Entity (MLME)

The Management layer assist you in several ways. For instance, it handle things such as scanning, authentication, beacons, associations and much more.


Scanning is simply looking for other 802.11 compliant devices in the air. There are two types of scanning; passive and active.

Passive scanning

When performing a passive scanning, the radio is listening passively for beacons, without transmitting packages, as it moves from channel to channel and records all devices that it receives beacons from. Higher frequency bands in the ieee802.11a standard does not allow to transmit anything unless you have heard an Access Point (AP) beacon. Passive scanning is therefore the only way to be aware of the surroundings.

Active scanning

Active scanning on the other hand, is transmitting Probe Request (IEEE80211_STYPE_PROBE_REQ) management packets. This type of scanning is also walking from channel to channel, sending these probe requests management packet for each channel.

These requests is handled by ieee80211_send_probe_req() in .../net/mac80211/util.c:

void ieee80211_send_probe_req(struct ieee80211_sub_if_data *sdata,
                  const u8 *src, const u8 *dst,
                  const u8 *ssid, size_t ssid_len,
                  const u8 *ie, size_t ie_len,
                  u32 ratemask, bool directed, u32 tx_flags,
                  struct ieee80211_channel *channel, bool scan)


The authentication procedure sends a management frame of a authentication type (IEEE80211_STYPE_AUTH). There is not only one type of authentication but plenty of them. The ieee80211 specification does only specify one mandatory authentication type; the Open-system authentication (WLAN_AUTH_OPEN). Another common authentication type is Shared key authentication (WLAN_AUTH_SHARED_KEY).

These management frames is handled by ieee80211_send_auth() in .../net/mac80211/util.c:

void ieee80211_send_auth(struct ieee80211_sub_if_data *sdata,
             u16 transaction, u16 auth_alg, u16 status,
             const u8 *extra, size_t extra_len, const u8 *da,
             const u8 *bssid, const u8 *key, u8 key_len, u8 key_idx,
             u32 tx_flags)

Open system authentication

This is the most simple type of authentication, all clients that request authentication will be authenticated. No security is involved at all.

Shared key authentication

In this type of authentication the client and AP is using a shared key, also known as Wired Equivalent Privacy (WEP) key.


The association is started when the station sends management frames of the type IEEE80211_STYPE_ASSOC_REQ. In the kernel code this is handled by ieee80211_send_assoc() in .../net/mac80211/mlme.c

static void ieee80211_send_assoc(struct ieee80211_sub_if_data *sdata)


When the station is roaming, i.e. moving between APs within an ESS (Extended Service Set), it also sends a reassociation request to a new AP of the type IEEE802_STYPE_REASSOC_REQ. Association and reassociation has so much in common that it is both handled by ieee80211_send_assoc().

MAC (Medium Access Control)

All ieee80211 devices needs to implement the Management Layer (MLME), but the implementation could be in device hardware or software. These types of devices are divided into Full MAC device (hardware implementation) and Soft MAC device (software implementation). Most devices today are soft MAC devices.

The MAC layer can be further broken down into two pieces: Upper MAC and Lower MAC. The upper part of the MAC handle the management aspect (all that we covered in the MLME section above), and the lower part handle the time critical operations such as ACK:ing received packets.

Linux does only handle the upper part of MAC, the lower part is operated in device hardware. What we can see in the figure is that the MAC layer is separating data packets from configuration/management packets. The data packets is forwarded to the network device and will travel the same path through the network layer as data packets from all other type of network devices.

The Linux wireless subsystem consists of two major parts, where this, mac80211, is one of them. cfg80211 is the other major part.


cfg80211 is a configuration management service for mac80211 compliant devices. Both Full MAC and Soft MAC devices needs to implement operations to be compatible with the cfg80211 configuration interface in order to let userspace application to configure the device.

The configuration may be done with on of two interfaces, wext and nl80211.

Wireless Extension, WEXT (Legacy)

This is the legacy and ugly way to configure wireless devices. It is still supported only for backward compatibility reasons. Users of this configuration interface are wireless-tools (iwconfig, iwlist).


nl80211 on the other hand, is a new netlink interface intended to replace the Wireless Extension (wext) interface. Users of this interface is typically iw and wpa_supplicant.


The whole network stack of the Linux kernel is really complex and optimized for high throughput with low latencies. In this post we only covered what support for wireless devices has complemented the stack with, which is mainly the mac80211 layer for handle all device management, and cfg80211 layer to configure the MAC layer. Packets to wireless devices is divided into data packets and configuration/managment packets. The data packets follow the same path as for all network devices, and the management packets goes to the cfg80211 layer.

Linux driver for PhoenixRC adapter

Linux driver for PhoenixRC adapter

Update: Michael Larabel on Phoronix has written a post [3] about this driver. Go ahead and read it as well!

A few years ago I used to build multirotors, mostly quadcopters and tricopters. It is a fun hobby, both building and flying is incredible satisfying. The first multirotors i built was nicely made with CNC cutted details. They looked really nice and robust. However, with more than 100+ crashes under the belt, the last ones was made out of sticks and a food box. Easy to repair and just as fun to fly.

This hobby requires practice, and even if the most fun way to practice is by flying, it is really time consuming. A more time efficient way to practice is by a simulator, so I bought PhoenixRC [1], which is a flight simulator. It comes with an USB adapter that let you connect and fly with your own RC controller. I did not run the simulator so much. PhoenixRC is a Windows software and there was no driver for the adapter for Linux. The only instance of Windows I had was on a separate disk that layed on the shelf, but switching disk on your laptop each time you want to fly is simply not going to happened.

This new year eve (2017), my wife became ill and I got some time for my own. Far down into a box I found the adapter and plugged it into my Linux computer. Still no driver.

Reverse engineering the adapter

The reverse engineering was quite simple. It turns out that the adapter only has 1 configuration, 1 interface and 1 endpoint of in-interrupt type. This simply means that it only has an unidirectional communication path, initiated by sending an interrupt URB (USB Request Block) to the device. If you are not familiar with what configurations, interfaces and endpoints are in terms of USB, please google the USB standard specification.

The data from our URB was only 8 bytes. After some testing with my RC controller I got the following mapping between data and the channels on the controller:

data[0] = channel 1
data[1] = ? (Possibly a switch)
data[2] = channel 2
data[3] = channel 3
data[4] = channel 4
data[5] = channel 5
data[6] = channel 6
data[7] = channel 7

So I created a device driver that registered an input device with the following events:

Channel Event

Using a simulator

Heli-X [2] is an excellent cross platform flight simulator that runs perfect on Linux. Now I have spent several hours with a Goblin 700 Helicopter and it is just as fun as I rembembered.


Available in Linux kernel 4.17

Of course all code is submitted to the Linux kernel and should be merged in v4.17.

get_maintainers and git send-email

get_maintainers and git send-email

Many with me prefer email as communication channel, especially for patches. Github, Gerrit and all other "nice" and "userfriendly" tools that tries to "help" you to manage your submissions does not simply fit my workflow.

As you may already know, all patches to the Linux kernel is by email. scripts/get_maintainer.pl (see [1] for more info about the process) is a handy tool that takes a patch as input and gives back a bunch of emails addresses. These email addresses is usually passed to git send-email [2] for submission.

I have used various scripts to make the output from get_maintainer.pl to fit git send-email, but was not completely satisfied until I found the --to-cmd and --cc-cmd parameters to git send-email:

 Specify a command to execute once per patch file which should generate patch file specific "To:" entries. Output of this command must be single email address per line. Default is the value of sendemail.tocmd configuration value.
 Specify a command to execute once per patch file which should generate patch file specific "Cc:" entries. Output of this command must be single email address per line. Default is the value of sendemail.ccCmd configuration value.

I'm very pleased with these parameters. All I have to to is to put these extra lines into my ~/.gitconfig (or use git config):

    tocmd ="`pwd`/scripts/get_maintainer.pl --nogit --nogit-fallback --norolestats --nol"
    cccmd ="`pwd`/scripts/get_maintainer.pl --nogit --nogit-fallback --norolestats --nom"

To submit a patch, I just type:

git send-email --identity=linux ./0001-my-fancy-patch.patch

and let --to and --cc to be populated automatically.



When the system is running out of memory, the Out-Of-Memory (OOM) killer picks a process to kill based on the current memory footprint. In case of OOM, we will calculate a badness score between 0 (never kill) and 1000 for each process in the system. The process with the highest score will be killed. A score of 0 is reserved for unkillable tasks such as the global init process (see [1]) or kernel threads (processes with PF_KTHREAD flag set).


The current score of a given process is exposed in procfs, see /proc/[pid]/oom_score, and may be adjusted by setting /proc/[pid]/oom_score_adj. The value of oom_score_adj is added to the score before it is used to determine which task to kill. The value may be set between OOM_SCORE_ADJ_MIN (-1000) and OOM_SCORE_DJ_MAX (+1000). This is useful if you want to guarantee that a process never is selected by the OOM killer.

The calculation is simple (nowadays), if a task is using all its allowed memory, the badness score will be calculated to 1000. If it is using half of its allowed memory, the badness score is calculated to 500 and so on. By setting oom_score_adj to -1000, the badness score sums up to <=0 and the task will never be killed by OOM.

There is one more thing that affects the calculation; if the process is running with the capability CAP_SYS_ADMIN, it gets a 3% discount, but that is simply it.

The old implementation

Before v2.6.36, the calculation of badness score tried to be smarter, besides looking for the total memory usage (task->mm->total_vm), it also considered: - Whether the process creates a lot of children - Whether the process has been running for a long time, or has used a lot of CPU time - Whether the process has a low nice value - Whether the process is privileged (CAP_SYS_ADMIN or CAP_SYS_RESOURCE set) - Whether the process is making direct hardware access

At first glance, all these criteria looks valid, but if you think about it a bit, there is a lot of pitfalls here which makes the selection not so fair. For example: A process that creates a lot of children and consumes some memory could be a leaky webserver. Another process that fits into the description is your session manager for your desktop environment which naturally creates a lot of child processes.

The new implementation

This heuristic selection has evolved over time, instead of looking on mm->total_vm for each task, the task's RSS (resident set size, [2]) and swap space is used instead. RSS and Swap space gives a better indication of the amount that we will be able to free if we chose this task. The drawback with using mm->total_vm is that it includes overcommitted memory ( see [3] for more information ) which is pages that the process has claimed but has not been physically allocated.

The process is now only counted as privileged if CAP_SYS_ADMIN is set, not CAP_SYS_RESOURCE as before.

The code

The whole implementation of OOM killer is located in mm/oom_kill.c. The function oom_badness() will be called for each task in the system and returns the calculated badness score.

Let's go through the function.

unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
              const nodemask_t *nodemask, unsigned long totalpages)
    long points;
    long adj;

    if (oom_unkillable_task(p, memcg, nodemask))
        return 0;

Looking for unkillable tasks such as the global init process.

p = find_lock_task_mm(p);
if (!p)
    return 0;

adj = (long)p->signal->oom_score_adj;
if (adj == OOM_SCORE_ADJ_MIN ||
        test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
        in_vfork(p)) {
    return 0;

If proc/[pid]/oom_score_adj is set to OOM_SCORE_ADJ_MIN (-1000), do not even consider this task

points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
    atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm);

Calculate a score based on RSS, pagetables and used swap space

if (has_capability_noaudit(p, CAP_SYS_ADMIN))
    points -= (points * 3) / 100;

If it is root process, give it a 3% discount. We are no mean people after all

adj *= totalpages / 1000;
points += adj;

Normalize and add the oom_score_adj value

return points > 0 ? points : 1;

At last, never return 0 for an eligible task as it is reserved for non killable tasks



The OOM logic is quite straightforward and seems to have been stable for a long time (v2.6.36 was released in october 2010). The reason why I was looking at the code was that I did not think the behavior I saw when experimenting corresponds to what was written in the man page for oom_score. It turned out that the manpage was not updated when the new calculation was introduced back in 2010.

I have updated the manpage and it is available in v4.14 of the Linux manpage project [4].

commit 5753354a3af20c8b361ec3d53caf68f7217edf48
Author: Marcus Folkesson <marcus.folkesson@gmail.com>
Date:   Fri Nov 17 13:09:44 2017 +0100

    proc.5: Update description of /proc/<pid>/oom_score

    After Linux 2.6.36, the heuristic calculation of oom_score
    has changed to only consider used memory and CAP_SYS_ADMIN.

    See kernel commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10.

    Signed-off-by: Marcus Folkesson <marcus.folkesson@gmail.com>
    Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>

diff --git a/man5/proc.5 b/man5/proc.5
index 82d4a0646..4e44b8fba 100644
--- a/man5/proc.5
+++ b/man5/proc.5
@@ -1395,7 +1395,9 @@ Since Linux 2.6.36, use of this file is deprecated in favor of
 .IR /proc/[pid]/oom_score_adj .
 .IR /proc/[pid]/oom_score " (since Linux 2.6.11)"
-.\" See mm/oom_kill.c::badness() in the 2.6.25 sources
+.\" See mm/oom_kill.c::badness() in pre 2.6.36 sources
+.\" See mm/oom_kill.c::oom_badness() after 2.6.36
+.\" commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10
 This file displays the current score that the kernel gives to
 this process for the purpose of selecting a process
 for the OOM-killer.
@@ -1403,7 +1405,16 @@ A higher score means that the process is more likely to be
 selected by the OOM-killer.
 The basis for this score is the amount of memory used by the process,
 with increases (+) or decreases (\-) for factors including:
-.\" See mm/oom_kill.c::badness() in the 2.6.25 sources
+.\" See mm/oom_kill.c::badness() in pre 2.6.36 sources
+.\" See mm/oom_kill.c::oom_badness() after 2.6.36
+.\" commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10
+.IP * 2
+whether the process is privileged (\-);
+.\" More precisely, if it has CAP_SYS_ADMIN or (pre 2.6.36) CAP_SYS_RESOURCE
+Before kernel 2.6.36 the following factors were also used in the calculation of oom_score:
 .IP * 2
 whether the process creates a lot of children using
@@ -1413,10 +1424,7 @@ whether the process creates a lot of children using
 whether the process has been running a long time,
 or has used a lot of CPU time (\-);
 .IP *
-whether the process has a low nice value (i.e., > 0) (+);
-.IP *
-whether the process is privileged (\-); and
-.\" More precisely, if it has CAP_SYS_ADMIN or CAP_SYS_RESOURCE
+whether the process has a low nice value (i.e., > 0) (+); and
 .IP *
 whether the process is making direct hardware access (\-).
 .\" More precisely, if it has CAP_SYS_RAWIO



So, a week in Prague has come to its end. The Embedded Linux Conference Europe was this year co-located with Open Source Summit and offered a lot of interesting talks on various topics.

One of the hottest topics this year was about our most beloved debugging function - prink(). What is so hard with printing? It turns out that printk is quite deadlock-prone and that is not an easy thing to work around in the current infrastructure of the kernel.

A common misconception is that printk() is a fast operation that simply writes the message to the global __log_buf variable. It is not.

A printk() may involve many different subsystems, different contexts or nesting, just to mention a few parts that needs to be handled. For example:

  1. The output needs to go over some output medium (consoles) * The monitor * Frame buffers * UART / Serial console * Network console * Braille * ...
  2. Uses different locking mechanismes * The console_lock (described below) * The logbuf_lock spinlock * Consoles often have their own locks
  3. Wake up waiting applications * syslogd * journald * ...

Besides that, printk() is expected to work in every context, whether it is process, softirq, IRQ or NMI context. With all these locking mechanisms involved, what happens if a printk in process context is interrupted by an NMI, and the NMI also calls printk? In other words, there is a lot of special cases that needs to be handled.

How it works


Lets look back on how the printing was handled in a pre-history kernel.

SMP (Symmetric Multi Processing) SoCs became common in the late 1990s. Before that, everything was easy and everyone was happy. No NMIs. No races between multiple cores. Simple locking. No Facebook. As a response to SMP systems, Linux v2.1.80 introduced a spin_lock to printk to avoid race conditions between multiple cores. The solution we came up with was to serialize all prints to the console. If two CPUs called printk() at the same time, the second core has to wait for the first core to finish.

This does not scale well. In fact, it does not scale at all. What about a modern system with 100+ CPUs that all calls printk at the same time? Depending on the console, the printing may take milliseconds and you will surely end up with an unresponsive system.


Now we are doing things differently. The first core that grabs the console_lock is responsible to print all messages in the __log_buf. If another core is calling printk() in meanwhile, it puts its data into __log_buf , tries to grab the lock which is busy, and then simple returns. As __log_buf continues getting new data, the unlucky core that grabbed the console_lock may end up doing nothing but printing.

The good thing is that we only locks up a single core instead of all cores. The bad thing is that we locks up a single core.

The code


printk() is defined in kernel/printk/printk.c and does not look much to the world

asmlinkage __visible int printk(const char *fmt, ...)
    va_list args;
    int r;

    va_start(args, fmt);
    r = vprintk_func(fmt, args);

    return r;

It simple calls vprintk_function with its own arguments.


vprintk_func() is a function that forward the arguments to different print-functions depending on the current context

__printf(1, 0) int vprintk_func(const char *fmt, va_list args)
    if (this_cpu_read(printk_context) & PRINTK_NMI_CONTEXT_MASK)
        return vprintk_nmi(fmt, args);

    if (this_cpu_read(printk_context) & PRINTK_SAFE_CONTEXT_MASK)
        return vprintk_safe(fmt, args);

    if (this_cpu_read(printk_context) & PRINTK_NMI_DEFERRED_CONTEXT_MASK)
        return vprintk_deferred(fmt, args);

    return vprintk_default(fmt, args);

The different contexts we consider are:

Normal context

If we are on normal context, there is nothing to consider at all, go for the vprintk_default() and just do our thing.

NMI context

In the case that the CPU supports NMIs (Non-Maskable Interrupts, (look for CONFIG_HAVE_NMI and CONFIG_PRINTK_NMI in your .config ), we go for vprintk_nmi(). vprintk_nmi() do a safe copy to a per-CPU buffer, not the global __log_buf. Since NMIs are not nested by its nature, there is always only one write running. However, NMIs is only for the local CPU, and the buffer might get flushed from another CPU, so we still need to be careful.

"Recursive" context

If the printk() routine is interrupted and we end up in another call to printk from somewhere else, we go for the lock-less vprintk_safe() to prevent a recursion deadlock. vprintk_safe() is using a per-CPU buffer to store the message, just like NMI.

Deferred context

As already said, multiple locks is involved in the call chain of printk(). vprintk_deferred() is using the main logbuf_lock but avoid calling console drivers that might have their own locks. The actual printing is deferred to klogd_work kernel thread.


vprintk_emit() is responsible to write to __log_buf, (but not the only function, cont_flush() also write to __log_buf) and print out the content to all consoles.

asmlinkage int vprintk_emit(int facility, int level,
                const char *dict, size_t dictlen,
                const char *fmt, va_list args)


    <<<<< Strip kernel syslog prefix >>>>>


    <<<<< log_output() does the actual printing to __log_buf >>>>>
    printed_len = log_output(facility, level, lflags, dict, dictlen, text, text_len);


    if (!in_sched) {
         * Try to acquire and then immediately release the console
         * semaphore.  The release will print out buffers and wake up
         * /dev/kmsg and syslog() users.
        if (console_trylock())

    return printed_len;

The function is quite straight forward. The only thing that looks a little bit strange is

if (console_trylock())

Really? Grab the console_lock and immediately unlock it? The thing is that all magic happens in console_unlock().


The CPU that is grabbing the console_lock is responsible to print to all registered consoles until all new data in __log_buf is printed. This regardless if other CPUs keeps filling the buffer with new data.

In the worst case, this CPU is doing nothing but printing and will never leave this function.

void console_unlock(void)

    <<<<< Endless loop? >>>>>
    for (;;) {

        <<<<< Go through all new messages >>>>>


        <<<<< Print to all consoles >>>><
        call_console_drivers(ext_text, ext_len, text, len);



    <<<<<  Release the exclusive_console once it is used >>>>>
    console_locked = 0;


    <<<<< Wake up klogd >>>>>
    if (wake_klogd)

The function is looping until all new messages is printed. For each new message, a call to call_console_drivers() is made. The last thing that we do is waking up the klogd kernel thread that will signal to all userspace application that is waiting on klogctl(2).


call_console_drivers() is asking all registered consoles to print out a message. The console_lock must be held when calling this function.

static void call_console_drivers(const char *ext_text, size_t ext_len,
                 const char *text, size_t len)
    struct console *con;

    trace_console_rcuidle(text, len);

    if (!console_drivers)

    for_each_console(con) {
        if (exclusive_console && con != exclusive_console)
        if (!(con->flags & CON_ENABLED))
        if (!con->write)
        if (!cpu_online(smp_processor_id()) &&
            !(con->flags & CON_ANYTIME))
        if (con->flags & CON_EXTENDED)
            con->write(con, ext_text, ext_len);
            con->write(con, text, len);


As we see, there is a lot of logic involved in a simple call to printk() and you should not be surprised if all your printing has impact on your systems performance or timing. But how do we debug if printk() is a no-no? The answer is trace_printk().

This function write (almost) directly to a trace buffer and is therefore a fairly fast operation. The trace buffer is exposed from tracefs, usually mounted at /sys/kernel/tracing.

As a bonus, the messages is merged with other output from ftrace when doing a function trace.

Other things that is good to know about __log_buf


The kernel log buffer is exported as a global symbol called __log_buf. If you have an systems that deadlocks without any output on the console and you may reboot the system without resetting RAM, then you may print the content of __log_buf from the bootloader.

Determine the physical address of __log_buf

[09:59:31]marcus@little:~/git/linux$ grep __log_buf System.map
c14cfba8 b __log_buf

The 0xc14cfba8 is the virtual address of __log_buf. This kernel is compiled for a 32bit ARM with the CONFIG_VMSPLIT_3G set, so the kernel virtual address space start at 0xc0000000. To get the physical address out of the virtual, subtract the offset (0xc14cfba8 - 0xc0000000) and you will end up with 0x014cfba8. Dump this address from your bootloader and you will see your kernel log.


The size of __log_buf is set at compile-time with CONFIG_LOG_BUF_SHIFT. The value defines the size as a power of 2 and is usually set to 16 (64K).

There is also a CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT that is the per-CPU buffer where messages printed from unsafe context are temporary stored. Examples on unsafe context would be NMI and printk recursions. The messages are copied to the main log buffer in a safe context to avoid a deadlock.

This buffer is rarely used but has to be there to avoid the nasty deadlocks. The CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT was introduced in v4.11 and is also expressed as a power of 2.

Memory management in the kernel

Memory management in the kernel

Memory management is among the most complex parts in the Linux kernel. There is so many critical parts such as page allocator, slab allocator, virtual memory handling, memory mapping, MMU, IOMMU and so on. All these parts has to work perfect (or at least allmost perfect :-) ) because all system use them either they want to or not. If there is a bug or performance issue you will be noticed quite soon.

My goal is to produce a few posts on the topic and try to sort out the different parts and describe how they work and the connection between. I will begin from the physical bottom and work myself up to how userspace allocates memory in their little blue world with pink clouds. (Everything is so easy on the user side)

struct page

A page is the smallest unit that matters in terms of virtual memory. This is because the MMU (Memory Management Unit, described in a upcoming post) is only dealing with those pages. A typical size for a page is 4KB, at least for 32bit architectures. The most 64-bit architectures uses 8KB pages.

Every one of those physical pages is represented by a struct page that is defined in include/linux/mm_types.h. That is a lot of pages. If we do a simple calculation: We have a 32-bit system that has 512MB of physical memory, this memory is divided into 131,072 4KB pages. Think of that 512MB is not even so much memory on a modern system today.

What I want to say is that this struct page should be kept as small as possible because it scales up a lot when physical memory increases.

Ok, so there is a struct page somewhere that got allocated for each physical page, which is a lot, but what does it do? It does a lot of housekeeping, lets look at a few set of members that I think is most interresting:

struct page {
    unsigned long flags;
    unsigned long private;
    void    *virtual;
    atomic_t    _count;
    pgoff_t    index;
    spinlock_t  *ptl;
    spinlock_t  ptl;

flags is keeping track of the page status which could be dirty (need to be written to media), locked in memory (not allowed to be paged out), permissions and so on. See enum pageflags in include/linux/page-flags.h for more information.

private is not a defined field. May be used as a long or interpreted as a pointer. (Shared with ptl in a union!)

virtual is the virtual address of the page. In case that the page belongs to the high memory (memory that is not permanently mapped) this field will be NULL and require dynamic mapping.

_count is a simple reference counter to determine when the page is free for allocation.

index is the offset within a mapping.

ptl is a interresting one! I think it requires a special section in this post. (Shared with private in a union!)

Page Table Lock

PTL stands for Page Table Lock and is a per-page lock.

In the next part of these memory management posts I will describe the struct mm_struct, how PGD, PMD and PTE are related, but for now it's enough that you just have heard the words.

Ok, there is one thing that is good to know. The struct mm_struct (also defined in mm_types.h) is a structure that represent a process's address space and contains all information related to the process memory. The structure has a pointer to virtual memory areas that refers to one or more struct page. This structure also has the member mm->page_table_lock that is a spinlock that protects all page tables of the mm_struct. This was the original approach and is still used by several architectures.

However, this mm->page_table_lock is a little bit clumsy since it lock all pages at once. This is no real problem on a single-cpu without SMP system. But nowdays that is not a very common scenario.

Instead, the split page table lock was introduced and has a separate per-table lock to allow concurrency access to pages in the same mm_struct. Remember that the mm_struct is per process? So this increases page-fault/page-access performance in multi-threaded applications only.

When is split page table locks enabed? It is enabled in compile-time if CONFIG_SPLIT_PTLOCK_CPUS (I have never seen another value but 4 on this one) is less or equal to NR_CPUS.

Here is a few defines int the beginning of the mm_types.h header file:


The ALLOC_SPLIT_PTLOCKS is a little bit clever. If the sizeof a spinlock is less or equal to the size of a long, the spinlock is embedded in the struct page and can therefor save a cache line by avoiding indirect access. If a spinlock does not fit into a long, then the page->ptl is used as a pointer that points to a dynamic allocated spinlock. As I said, this is a clever construction since it allow us to increase the size of a spinlock and there is no problem. Exemple when sizeof spinlock does not fit is when using DEBUG_SPINLOCK, DEBUG_LOCK_ALLOC or applying the PREEMPT_RT patchset.

The spinlock_t is allocated in pgtable_page_ctor() for PTE tables and in pgtable_pmd_page_ctor for PMD tables. These function (and the corresponding free-functions) should be called in *every place that allocated/freeing pages. This is already done in mainline, but I know there is evil hardware vendors out there that do not. For example, if you use their evil code and apply the preempt_rt patchset (that increases the sizeof spinlock_t), you have to verify that thier code behaves.

Also, pgtable_*page_ctor() can fail, this must be handled properly.

Remember that the page-ptl should never be accessed directly, use appropriate helper functions for that.

Example on such helper functions is..

pte_offset_map_lock() pte_unmap_lock() pte_alloc_map_lock() pte_lockptr() pmd_lock() pmd_lockptr()

MMAP memory between kernel and userspace

MMAP memory between kernel and userspace

Let kernel allocate memory and let userspace map is sounds like an easy task, and sure it is. There are just a few things that is good to know about page mapping.

The MMU (Memory Management Unit) contains page tables with entries for mapping between virtual and physical addresses. These pages is the smallest unit that the MMU deals with. The size of a page is given by the PAGE_SIZE macro in asm/page.h ans is typically 4k for most architectures.

There is a few more useful macros in asm/page.h:

PAGE_SHIFT: How many steps we should shift to left to get a PAGE_SIZE PAGE_SIZE: Size of a page, defined as (1 << PAGE_SHIFT). PAGE_ALIGN(len): Will round up the length to the closest alignment of PAGE_SIZE.

How does mmap(2) work?

Every page table entry has a bit that tells us if the entry is valid in supervisor mode (kernel mode) only. And sure, all memory allocated in kernel space will have this bit set. What the mmap(2) system call do is simply creating a new page table entry with a different virtual address that points to the same physical memory page. The difference is that this supervisor-bit is not set. This let userspace access the memory as if it was a part of the application, for now it is! The kernel is not involved in those accesses at all, so it is really fast.

Magic? Kind of. The magic is called remap_pfn_range(). What remap_pfn_range() do is just essentially to update the process's specific page table with these new entries.

Example, please

Allocate memory

As we already know, the smallest unit that the MMU handle is the size of PAGE_SIZE and the mmap(2) only works with full pages. Even if you just want to share only 100 bytes, a whole page frame will be remapped and must therefor be allocated in the kernel. The allocated memory must also be page aligned.


One way to allocate pages is with __get_free_pages().:

unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)

gft_mask is commonly set to GFP_KERNEL in process/kernel context and GFP_ATOMIC in interrupt context. The order is the number of pages to allocate expressed in 2^order.

For example::
u8 *vbuf = __get_free_pages(GFP_KERNEL, size >> PAGE_SHIFT);

Allocated memory is freed with __free_pages().


A more common (and preferred) way to allocate virtual continuous memory is with vmalloc(). vmalloc() will allways allocate whole set of pages, no matter what. This is exactly what we want!

Read about vmalloc() in kmalloc(9):

Allocated memory is freed with vfree().


If you need only one page, alloc_page() will give you that. If this is the case, insead of using remap_pfn_range(), vm_insert_page() will do the work you for you. Notice that vm_insert_page() apparently only works on order-0 (single-page) allocation. So if you want to allocate N pages, you will hace to call vm_insert_page() N times.

Now some code



/* page align */
priv->a_size = PAGE_ALIGN(priv->a_size);
priv->a_area =vmalloc(priv->a_size);


static int scan_mmap (struct file *file, struct vm_area_struct *vma)
struct mmap_priv *priv = file->private_data;
unsigned long start = vma->vm_start;
unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
unsigned long page;
size_t size = vma->vm_end - vma->vm_start;
if (size > priv->a_size)
           return -EINVAL;
page = vmalloc_to_pfn((void *)priv->a_area);
if (remap_pfn_range(vma, start, page, priv->a_size, PAGE_SHARED))
           return -EAGAIN;
vma->vm_flags |= VM_RESERVED; /* avoid to swap out this VMA */
return 0;

PID1 in containers

PID1 in containers

What is PID 1

The top-most process in a UNIX system has PID (Process ID) 1 and is usually the init process. The Init process is the first userspace application started on a system and is started by the kernel at boottime. The kernel is looking in a few predefined paths (and the init kernel parameter). If no such application is found, the system will panic().

See init/main.c:kernel_init

if (!try_to_run_init_process("/sbin/init") ||
    !try_to_run_init_process("/etc/init") ||
    !try_to_run_init_process("/bin/init") ||
    return 0;
panic("No working init found.  Try passing init= option to kernel. "
      "See Linux Documentation/admin-guide/init.rst for guidance.");

All processes in UNIX has a parent/child relationship which builds up a big relationship-tree. Some resources and permissions are inherited from parent to child such as UID and cgroup restrictions.

As in the real world, with parenthood comes obligations. For example: What is usually the last line of your main()-function? Hopfully something like


All processes exits with an exit code that tells us if the operation was sucessful or not. Who is interested in this exit code anyway? In the real world, the parents are interested in their children's result, and so even here. The parent is responsible to wait(2) on their children to terminate just to fetch its exit code. But what if the parent died before the child?

Lets go back to the init process. The init process has several tasks, and one is to adopt "orphaned" (called zombie) child processes. Why? Because all processes will return an exit code and will not terminate completely until someone is listen for what they have to say. The init process is simply wait(2):ing on the exit code, throw it away and let the child die. Sad but true, but the child may not rest i peace otherwise. The operating system expects the init process to reap adopted children. Otherwise the children will exist in the system as a zombie and taking up some kernel resources and consume a slot in the kernel process table.

PID 1 in containers

Containers is a concept that isolate processes in different namespaces. Example of such namespaces are PID, users, networking and filesystem. To create a container is quite simple, just create a new process with clone(2) and provide relevant flags to create new namespaces for the process.

The flags related to namespaces are listed in include/uapi/linux/sched.h:

#define CLONE_NEWPID                0x20000000      // New pid namespace
#define CLONE_NEWCGROUP             0x02000000      // New cgroup namespace
#define CLONE_NEWUTS                0x04000000      // New utsname namespace
#define CLONE_NEWIPC                0x08000000      // New ipc namespace
#define CLONE_NEWUSER               0x10000000      // New user namespace
#define CLONE_NEWPID                0x20000000      // New pid namespace
#define CLONE_NEWNET                0x40000000      // New network namespace

All processes is running in a "container-context" because the processes allways executes in a namespace. On a system "without containers", all processes still have one common namespace that all processes is using.

When using CLONE_NEWPID, the kernel will create a new PID namespace and let the newly created process has the PID 1. As we already know, the PID 1 process has a very special task, namely to kill all orphaned children. This PID 1 process could be any application (make, bash, nginx, ftp-server or whatever) that is missing this essential adopt-and-slay-mechanism. If the reaping is not handled, it will result in zombie-processes. This was a real problem not long time ago for Docker containers (google Docker and zombies to see what I mean). Nowadays we have the --init flag on docker run to tell the container to use tini (https://github.com/krallin/tini), a zombie-reaping init process to run with PID 1.

When PID 1 dies

This is the reason to why I'm writing this post. I was wondering who is killing PID 1 in a container since we learned that a PID 1 may not die under any circumstances. PID 1 in cointainers is obviosly an exception from this golden rule, but how does the kernel differentiate between init processes in different PID namespaces?

Lets follow a process to its very last breath.

The call chain we will look at is the following: do_exit()->exit_notify()->forget_original_parent()->find_child_reaper().


kernel/exit.c:do_exit() is called when a process is going to be cleaned up from the system after it has exited or being terminated. The function is collecting the exit code, delete timers, free up resources and so on. Here is an extract of the function:


<<<<< Collect exit code >>>>>
tsk->exit_code = code;
taskstats_exit(tsk, group_dead);


if (group_dead)

<<<<< Free up resources >>>>>
if (group_dead)



<<<<< Notify tasks in the same group >>>>>
exit_notify(tsk, group_dead);


exit_notify() is to notifing our "dead group" that we are going down. One important thing to notice is that almost all resources are freed at this point. Even if the process is going into a zombie state, the footprint is relative small, but still, the zombie consumes a slot in the process table.

The size of the process table in Linux and defined by PID_MAX_LIMIT in include/linux/threads.h:

\* A maximum of 4 million PIDs should be enough for a while.
\* [NOTE: PID/TIDs are limited to 2^29 ~= 500+ million, see futex.h.]
(sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))

The process table is indeed quite big. But if you are running for example a webserver as PID 1 that is fork(2):ing on each HTTP request. All these forks will result in a zombie and the number will escalate quite fast.


kernel/exit.c:exit_notify() is sending signals to all the closest relatives so that they know to properly mourn this process. In the beginning of this function, a call is made to forget_original_parent():

static void exit_notify(struct task_struct *tsk, int group_dead)
    bool autoreap;
    struct task_struct *p, *n;

  >>>>>  forget_original_parent(tsk, &dead);


This function simply does two things

  1. Make init (PID 1) inherit all the child processes
  2. Check to see if any process groups have become orphaned as a result of our exiting, and if they have any stopped jobs, send them a SIGHUP and then a SIGCONT.

find_child_reaper() will help us find a proper reaper:

>>>>> reaper = find_child_reaper(father);
if (list_empty(&father->children))


kernel/exit.c:find_child_reaper() is looking if a father is available. If a father (or other relative) is not available at all, we must be the PID 1 process.

This is the interesting part:

if (unlikely(pid_ns == &init_pid_ns)) {
    panic("Attempted to kill init! exitcode=0x%08x\n",
        father->signal->group_exit_code ?: father->exit_code);

init_pid_ns refers (declared in kernel/pid.c) to our real init process. If the real init process exits, panic the whole system since it cannot continue without an init process. If it is not, call zap_pid_ns_processes(), here we have our PID1-cannot-be-killed-exception we are looking for! We contiue following the call chain down to zap_pid_ns_processes().


zap_pid_ns_processes function is part of the PID namespace and is located in kernel/pid_namespace.c The function iterates through all tasks in the same group and send signal SIGKILL to each of them.

nr = next_pidmap(pid_ns, 1);
while (nr > 0) {


    task = pid_task(find_vpid(nr), PIDTYPE_PID);
    if (task && !__fatal_signal_pending(task))
    >>>>> send_sig_info(SIGKILL, SEND_SIG_FORCED, task);


    nr = next_pidmap(pid_ns, nr);



The PID 1 in containers is handled in a seperate way than the real init process. This is obvious, but now we know where the codeflow differ for PID 1 in different namespaces.

We also see that if the PID1 in a PID namespace dies, all the subprocesses will be terminated with SIGKILL. This behavior reflects the fact that the init process is essential for the correct operation of any PID namespace.

2.2" TFT display on Beaglebone

2.2" TFT display on Beaglebone

I recently bought a 2.2" TFT display on Ebay (come on, 7 bucks...) and was up to use it with my BeagleBone. Luckily for me there was no Linux driver for the ILI9341 controller so it is just to roll up my sleeves and get to work.

Boot up the BeagleBone

I haven't booted up my bone for a while and support for the board seems to have reached the mainline in v3.8 (currently at v3.15), so the first step is just to get it boot with a custom kernel.

Clone the vanilla kernel from kernel.org:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Use the omap2plus_defconfig as base:

make ARCH=arm omap2plus_defconfig

I will still use my old U-boot version, which does not have support for devicetrees, so I have to make sure that


This simply tells the boot code to look for a device tree binary (DTB) appended to the zImage. Without this option, the kernel expects the address of a dtb in the r2 register (on ARM architectures), but that does not work on my ancient bootloader.

Next step is to compile the kernel. We are using U-Boot as bootloader, but we do not create an uImage since we have to append the dtb to the zImage before that.:

make ARCH=arm CROSS_COMPILE=arm-none-linux-gnueabi-

Next, create the device tree blob. We are using the arch/arm/dts/am335x-bone.dts as source.:

make ARCH=arm CROSS_COMPILE=arm-none-linux-gnueabi- am33x-bone.dtb

Now we are only two steps behind a booting kernel! First we need to append the dtb to the zImage, and then we need to create an U-boot-friendly kernel image with mkimage.:

cat arch/arm/boot/zImage arch/arm/boot/dts/am335x-bone.dtb > ./zImage_dtb
mkimage -A arm -O linux -T kernel -C none -a 0x80008000 -e 0x80008000 -n 'BeagleBone image' -d ./zImage_dtb uImage

Put the uImage on the uSD-card and boot it up. ..

BeagleBone login:


Enable SPI

First of all, we need to setup the pinmux for the spi-bus. This is done with the pinctrl subsystem in the devicetree interface file (arch/arm/boot/dts/am335x-bone-common.dtsi).

Create the pins. For more detailed explaination of the values, see the BeagleBone System Reference Manual.

spi1_pins: spi1_pins_s0 {
   pinctrl-single,pins = <
     0x190 0x33      /* mcasp0_aclkx.spi1_sclk, INPUT_PULLUP | MODE3 */
     0x194 0x33      /* mcasp0_fsx.spi1_d0, INPUT_PULLUP | MODE3 */
     0x198 0x13      /* mcasp0_axr0.spi1_d1, OUTPUT_PULLUP | MODE3 */
     0x19c 0x13      /* mcasp0_ahclkr.spi1_cs0, OUTPUT_PULLUP | MODE3 */

Then override the spi1 entry and create an instance of our device driver. The driver will have the name "ili9341-fb".

 status = "okay";
 pinctrl-names = "default";
 pinctrl-0 = <&spi1_pins>;
 ili9341: ili9341@0 {
  compatible = "ili9341-fb";
  reg = <0>;
  spi-max-frequency = <16000000>;
  dc-gpio = <&gpio3 19 GPIO_ACTIVE_HIGH>;

Create an entry in the Kbuild system

I always integrate the modules into the kbuild system as the first step. This for several reasons: - I use one kernel for all of my projects, just different branches - It is simple to jump around with cscope/ctags - It gives you control when the kernel version and your driver follow eachother - Out-of-tree modules is evil (gives you a tainted kernel and everyone will spit on you)

Those who don't know how to put a module into the kbuild system - get ready to be surprised how simple it is!

Every directory in the kernel structure contains at least two files, a Makefile and a Kconfig. The Makefile tells the make buildsystem which files to compile and the Kconfig file is interpreted by (menu|k|x|old|....)config.

Here is what's needed:

diff --git a/drivers/video/fbdev/Kconfig b/drivers/video/fbdev/Kconfig
index e1f4727..be4ec8f 100644
--- a/drivers/video/fbdev/Kconfig
+++ b/drivers/video/fbdev/Kconfig
@@ -163,6 +163,18 @@ config FB_DEFERRED_IO
        depends on FB
+config FB_ILI9341
+       tristate "ILI9341 TFT driver"
+       depends on FB
+       select FB_SYS_FILLRECT
+       select FB_SYS_COPYAREA
+       select FB_SYS_IMAGEBLIT
+       select FB_SYS_READ
+       select FB_DEFERRED_IO
+       ---help---
+       This enables functions for handling video modes using the ili9341 controller
 config FB_HECUBA
        depends on FB
diff --git a/drivers/video/fbdev/Makefile b/drivers/video/fbdev/Makefile
index 0284f2a..105166a 100644
--- a/drivers/video/fbdev/Makefile
+++ b/drivers/video/fbdev/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_FB_ATARI)            += atafb.o c2p_iplan2.o atafb_mfb.o \
                                      atafb_iplan2p2.o atafb_iplan2p4.o atafb_iplan2p8.o
 obj-$(CONFIG_FB_MAC)              += macfb.o
 obj-$(CONFIG_FB_HECUBA)           += hecubafb.o
+obj-$(CONFIG_FB_ILI9341)          += ili9341.o
 obj-$(CONFIG_FB_N411)             += n411.o
 obj-$(CONFIG_FB_HGA)              += hgafb.o
 obj-$(CONFIG_FB_XVR500)           += sunxvr500.o
diff --git a/drivers/video/fbdev/ili9341.c b/drivers/video/fbdev/ili9341.c

Deferred IO

Deferred IO is a way to delay and repurpose IO. It uses host memory as a buffer and the MMU pagefault as a pretrigger for when to perform the device IO. You simple tell the kernel the minimum delay between the triggers should occours, this allows you to do burst transfers to the device at a given framerate. This has the big benefit that if the userspace updates the framebuffer several times in this period, we will only write it once.

The interface is _really_ simple. All you need to follow is these four steps (see Documentation/fb/deferred_io.txt):

  1. Setup your structure.

    static struct fb_deferred_io hecubafb_defio = {
     .delay  = HZ,
     .deferred_io = hecubafb_dpy_deferred_io,

The delay is the minimum delay between when the page_mkwrite trigger occurs and when the deferred_io callback is called. The deferred_io callback is explained below.

  1. Setup your deferred IO callback.

    static void hecubafb_dpy_deferred_io(struct fb_info *info,
        struct list_head *pagelist)

The deferred_io callback is where you would perform all your IO to the display device. You receive the pagelist which is the list of pages that were written to during the delay. You must not modify this list. This callback is called from a workqueue.

  1. Call init:

    info->fbdefio = &hecubafb_defio;
  2. Call cleanup:



The driver is quite straight forward and there was no really hard problem with the driver itself. However, I had problem to get a high framerate because the SPI communication took time. All SPI communication is asynchronious and all jobs is stacked on a queue before it gets scheduled. This takes time. One obvious solution is to write bigger chunks with each transfer, and that is what I did.

But the problem was that when I increased the chunk size, the kernel got panic with the DMA transfers. After an half a hour of code-digging, the problem is derived to the spi-controller for the omap2 (drivers/spi/spi-omap2-mcspi.c). It defines the DMA_MIN_BYTES which is arbitrarily set to 160. The code then compare the data length to this constant and determine if it should use DMA or not. It shows up that the DMA-transfer-code itself is broken.

A temporary solution is to increase the DMA_MIN_BYTES to at least a full frame (240x320x2) bytes until I have looked at the DMA code and submitted a fix :-)


Here is a shell started from Ubuntu

I have also tested to startup Qt and directfb applications. It all works like a charm. Conclusion

The Deferred IO interface is really nice for such displays. I'm surprised that there is currently so few drivers using it.

(the not so cleaned up) Code:

 \* linux/drivers/video/ili9341.c -- FB driver for ili9341 controller
 \* Copyright (C) 2014, Marcus Folkesson
 \* This file is subject to the terms and conditions of the GNU General Public
 \* License. See the file COPYING in the main directory of this archive for
 \* more details.
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/string.h>
#include <linux/mm.h>
#include <linux/vmalloc.h>
#include <linux/delay.h>
#include <linux/interrupt.h>
#include <linux/fb.h>
#include <linux/init.h>
#include <linux/list.h>
#include <linux/uaccess.h>
#include <linux/spi/spi.h>
#include <video/ili9341.h>
#include <linux/regmap.h>
#include <linux/gpio.h>
#include <linux/of.h>
#include <linux/gpio.h>
#include <linux/of_gpio.h>
#include <linux/debugfs.h>

/\* Display specific information \*/
#define SCREEN_WIDTH (240)
#define SCREEN_HIGHT (320)
#define SCREEN_BPP  (16)
#define ID "cbdff49d683b"
#define ID_SZ 12

static unsigned int chunk_size;

struct ili9341_priv {
 struct spi_device *spi;
 struct regmap *regmap;
 struct fb_info *info;
 u32 vsize;
 int dc;
 char *fbmem;
 struct dentry *dir;
static struct fb_fix_screeninfo ili9341_fix = {
 .id =  "ili9341",
 /*.visual = FB_VISUAL_MONO01,*/
 .xpanstep = 0,
 .ypanstep = 0,
 .ywrapstep = 0,
 .line_length = SCREEN_WIDTH*2,
 .accel = FB_ACCEL_NONE,
static struct fb_var_screeninfo ili9341_var = {
 .xres  = SCREEN_WIDTH,
 .yres  = SCREEN_HIGHT,
 .xres_virtual = SCREEN_WIDTH,
 .yres_virtual = SCREEN_HIGHT,
 .bits_per_pixel = SCREEN_BPP,
 .nonstd  = 1,
 .red = {
  .offset = 11,
  .length = 5,
 .green = {
  .offset = 5,
  .length = 6,
 .blue = {
  .offset = 0,
  .length = 5,
 .transp = {
  .offset = 0,
  .length = 0,

static const struct regmap_config ili9341_regmap_config = {
 .reg_bits = 8,
 .val_bits = 8,
 .can_multi_write = 1,
static void fill(struct ili9341_priv *priv);
static void fill_area(struct ili9341_priv *priv, int y1, int y2);
/* main ili9341 functions */
static void apollo_send_data(struct ili9341_priv *par, unsigned char data)
 /* set data */
static void apollo_send_command(struct ili9341_priv *par, unsigned char data)
static void ili9341_dpy_update(struct ili9341_priv *par)
static void ili9341_dpy_update_area(struct ili9341_priv *par, int y1, int y2 )
 fill_area(par, y1, y2);
/* this is called back from the deferred io workqueue */
static void ili9341_dpy_deferred_io(struct fb_info *info,
    struct list_head *pagelist)
 struct page *cur;
 struct fb_deferred_io *fbdefio = info->fbdefio;
 struct ili9341_priv *par = info->par;

 struct page *page;
 unsigned long beg, end;
 int y1, y2, miny, maxy;
 miny = INT_MAX;
 maxy = 0;
 /* stop here if list is empty */
 if (list_empty(pagelist)){
  dev_err(&par->spi->dev, "pagelist is empty");
 list_for_each_entry(page, pagelist, lru) {
  beg = page->index << PAGE_SHIFT;
  end = beg + PAGE_SIZE - 1;
  y1 = beg / (info->fix.line_length);
  y2 = end / (info->fix.line_length);
  if (y2 >= info->var.yres)
   y2 = info->var.yres - 1;
  if (miny > y1)
   miny = y1;
  if (maxy < y2)
   maxy = y2;
 ili9341_dpy_update_area(info->par, miny, maxy);
 // dev_err(&par->spi->dev, ".");
static void ili9341_fillrect(struct fb_info *info,
       const struct fb_fillrect *rect)
 struct ili9341_priv *par = info->par;
 sys_fillrect(info, rect);
static void ili9341_copyarea(struct fb_info *info,
       const struct fb_copyarea *area)
 struct ili9341_priv *par = info->par;
 sys_copyarea(info, area);
static void ili9341_imageblit(struct fb_info *info,
    const struct fb_image *image)
 struct ili9341_priv *par = info->par;
 sys_imageblit(info, image);
 * this is the slow path from userspace. they can seek and write to
 * the fb. it's inefficient to do anything less than a full screen draw
static ssize_t ili9341_write(struct fb_info *info, const char __user *buf,
    size_t count, loff_t *ppos)
 struct ili9341_priv *par = info->par;
 unsigned long p = *ppos;
 void *dst;
 int err = 0;
 unsigned long total_size;
 if (info->state != FBINFO_STATE_RUNNING)
  return -EPERM;
 total_size = info->fix.smem_len;
 if (p > total_size)
  return -EFBIG;
 if (count > total_size) {
  err = -EFBIG;
  count = total_size;
 if (count + p > total_size) {
  if (!err)
   err = -ENOSPC;
  count = total_size - p;
 dst = (void __force *) (info->screen_base + p);
 if (copy_from_user(dst, buf, count))
  err = -EFAULT;
 if  (!err)
  *ppos += count;
 return (err) ? err : count;
static struct fb_ops ili9341_ops = {
 .owner   = THIS_MODULE,
 .fb_write  = ili9341_write,
 .fb_fillrect = ili9341_fillrect,
 .fb_copyarea = ili9341_copyarea,
 .fb_imageblit = ili9341_imageblit,
static struct fb_deferred_io ili9341_defio = {
 .delay  = HZ/60,
 .deferred_io = ili9341_dpy_deferred_io,

static void write_command(struct ili9341_priv *priv, u8 data)
 gpio_set_value(priv->dc, 0);
 spi_write(priv->spi, &data, 1);
 gpio_set_value(priv->dc, 1);
static void write_data(struct ili9341_priv *priv, u8 data)
 gpio_set_value(priv->dc, 1);
 spi_write(priv->spi, &data, 1);
static void write_data16(struct ili9341_priv *priv, u8 data)
 gpio_set_value(priv->dc, 1);
 spi_write(priv->spi, &data, 1);
static void init(struct ili9341_priv *priv)
 write_command(priv, 0xCB);
 write_data(priv, 0x39);
 write_data(priv, 0x2C);
 write_data(priv, 0x00);
 write_data(priv, 0x34);
 write_data(priv, 0x02);
 write_command(priv, 0xCF);
 write_data(priv, 0x00);
 write_data(priv, 0XC1);
 write_data(priv, 0X30);
 write_command(priv, 0xE8);
 write_data(priv, 0x85);
 write_data(priv, 0x00);
 write_data(priv, 0x78);
 write_command(priv, 0xEA);
 write_data(priv, 0x00);
 write_data(priv, 0x00);
 write_command(priv, 0xED);
 write_data(priv, 0x64);
 write_data(priv, 0x03);
 write_data(priv, 0X12);
 write_data(priv, 0X81);
 write_command(priv, 0xF7);
 write_data(priv, 0x20);
 write_command(priv, 0xC0);     //Power control
 write_data(priv, 0x23);    //VRH[5:0]
 write_command(priv, 0xC1);     //Power control
 write_data(priv, 0x10);    //SAP[2:0];BT[3:0]
 write_command(priv, 0xC5);     //VCM control
 write_data(priv, 0x3e);    //Contrast
 write_data(priv, 0x28);
 write_command(priv, 0xC7);     //VCM control2
 write_data(priv, 0x86);    //--
/* XXX: Hue?! */
 write_command(priv, 0x36);     // Memory Access Control
 write_data(priv, 0x48);   //C8    //48 68绔栧睆//28 E8 妯睆
 write_command(priv, 0x3A);
 write_data(priv, 0x55);
 write_command(priv, 0xB1);
 write_data(priv, 0x00);
 write_data(priv, 0x18);
 write_command(priv, 0xB6);     // Display Function Control
 write_data(priv, 0x08);
 write_data(priv, 0x82);
 write_data(priv, 0x27);

 write_command(priv, 0xF2);     // 3Gamma Function Disable
 write_data(priv, 0x00);
 write_command(priv, 0x26);     //Gamma curve selected
 write_data(priv, 0x01);
 write_command(priv, 0xE0);     //Set Gamma
 write_data(priv, 0x0F);
 write_data(priv, 0x31);
 write_data(priv, 0x2B);
 write_data(priv, 0x0C);
 write_data(priv, 0x0E);
 write_data(priv, 0x08);
 write_data(priv, 0x4E);
 write_data(priv, 0xF1);
 write_data(priv, 0x37);
 write_data(priv, 0x07);
 write_data(priv, 0x10);
 write_data(priv, 0x03);
 write_data(priv, 0x0E);
 write_data(priv, 0x09);
 write_data(priv, 0x00);
 write_command(priv, 0XE1);     //Set Gamma
 write_data(priv, 0x00);
 write_data(priv, 0x0E);
 write_data(priv, 0x14);
 write_data(priv, 0x03);
 write_data(priv, 0x11);
 write_data(priv, 0x07);
 write_data(priv, 0x31);
 write_data(priv, 0xC1);
 write_data(priv, 0x48);
 write_data(priv, 0x08);
 write_data(priv, 0x0F);
 write_data(priv, 0x0C);
 write_data(priv, 0x31);
 write_data(priv, 0x36);
 write_data(priv, 0x0F);
 write_command(priv, 0x11);     //Exit Sleep
 write_command(priv, 0x29);    //Display on
 write_command(priv, 0x2c);
static void setCol(struct ili9341_priv *priv, u16 start, u16 end)
 u8 tmp;
 write_command(priv, 0x2a);
 tmp = (start & 0xff00) >> 8;
 write_data(priv, tmp);
 tmp = (start & 0x00ff) >> 0;
 write_data(priv, tmp);

 tmp = (end & 0xff00) >> 8;
 write_data(priv, tmp);
 tmp = (end & 0x00ff) >> 0;
 write_data(priv, tmp);
static void setPage(struct ili9341_priv *priv, u16 start, u16 end)
 u8 tmp;
 write_command(priv, 0x2b);
 tmp = (start & 0xff00) >> 8;
 write_data(priv, tmp);
 tmp = (start & 0x00ff) >> 0;
 write_data(priv, tmp);

 tmp = (end & 0xff00) >> 8;
 write_data(priv, tmp);
 tmp = (end & 0x00ff) >> 0;
 write_data(priv, tmp);
static void setPos(struct ili9341_priv *priv, u16 x1, u16 x2, u16 y1, u16 y2)
 setPage(priv, y1, y2);
 setCol(priv, x1, x2);

static void fill_area(struct ili9341_priv *priv, int y1, int y2)
 int i = 0;
 char val = 0xaa;
 char *p = priv->fbmem;
 int ret;
 int start =y1*SCREEN_WIDTH*2 + 1;
 int stop = y2*SCREEN_WIDTH*2+1;
 int range = stop - start;

 if (!chunk_size)
  chunk_size = 10;
 if (start + range > priv->vsize)
  range = priv->vsize - start;
 setCol(priv, 0, 239);
 setPage(priv, y1, y2);
 write_command(priv, 0x2c);

 for(i = start; i < stop; i += chunk_size)
  if ( i + chunk_size > stop )
   chunk_size = stop - i;
  ret = spi_write(priv->spi, &priv->fbmem[i], chunk_size);
  if (ret != 0)
   dev_err(&priv->spi->dev, "Error code: %i\n", ret);
static void fill(struct ili9341_priv *priv)
 int i = 0;
 char val = 0xaa;
 char *p = priv->fbmem;
 setCol(priv, 0, 239);
 setPage(priv, 0, 319);
 write_command(priv, 0x2c);

 fill_area(priv, 0, 319);
static ssize_t id_show(struct device *dev, struct device_attribute *attr,
   char *buf)
 sprintf(buf, "%s", ID);
 return ID_SZ;
static ssize_t id_store(struct device *dev, struct device_attribute *attr,
    const char *buf, size_t count)
 char kbuf[ID_SZ];
 if (count != ID_SZ)
  return -EINVAL;
 memcpy(kbuf, buf, ID_SZ);
 if (memcmp(kbuf, ID, ID_SZ) != 0)
  return -EINVAL;
 return count;
DEVICE_ATTR(id, 0666, id_show, id_store);
static struct attribute *ili9341_attrs[] = {

static int ili9341_probe(struct spi_device *spi)
 struct fb_info *info;
 int retval = -ENOMEM;
 struct ili9341_priv *priv;
 struct device_node *np = spi->dev.of_node;
 int ret;
 dev_err(&spi->dev, "Hello from I!\n");
 priv = kzalloc(sizeof(struct ili9341_priv), GFP_KERNEL);
  return -ENOMEM;
 priv->spi = spi;

/* TODO: better fail handling... */
 priv->dc = of_get_named_gpio(np, "dc-gpio", 0);
 if (priv->dc  == -EPROBE_DEFER)
  return -EPROBE_DEFER;
 if (gpio_is_valid(priv->dc)) {
  ret = devm_gpio_request(&spi->dev, priv->dc, "tft dc");
  if (ret)
   dev_err(&spi->dev, "could not request dc\n");
  ret = gpio_direction_output(priv->dc, 1);
  if (ret)
   dev_err(&spi->dev, "could not set DC to output");
  dev_err(&spi->dev, "DC gpio is not valid");

 dev_err(&spi->dev, "Initialize regmap");
 priv->regmap = devm_regmap_init_spi(spi, &ili9341_regmap_config);
 if (IS_ERR(priv->regmap))
  goto err_regmap;
 dev_err(&spi->dev, "regmap OK");
 priv->fbmem = vzalloc(priv->vsize);
 if (!priv->fbmem)
  goto err_videomem_alloc;

 retval = sysfs_create_group(&spi->dev.kobj, *ili9341_groups);
 if (retval)

 dev_err(&spi->dev, "Allocate framebuffer");
 info = framebuffer_alloc(sizeof(struct fb_info), &spi->dev);
 if (!info)
  goto err_fballoc;

 info->par = priv;
 priv->info = info;
 info->screen_base = priv->fbmem;
 info->fbops = &ili9341_ops;
 info->var = ili9341_var;
 info->fix = ili9341_fix;
 info->fix.smem_len = priv->vsize;
 /* We are virtual as we only exists in memory */
 info->fbdefio = &ili9341_defio;
 retval = register_framebuffer(info);
 if (retval < 0)
  goto err_fbreg;
 spi_set_drvdata(spi, info);
 fb_info(info, "Hecuba frame buffer device, using %dK of video memory\n",
  priv->vsize >> 10);

 priv->dir = debugfs_create_dir("ili9341-fb", NULL);
 debugfs_create_u32("chunk_size", 0666, priv->dir, &chunk_size);

 return 0;
 return retval;
static int ili9341_remove(struct spi_device *spi)
 struct fb_info *info = spi_get_drvdata(spi);
 if (info) {
  struct ili9341_priv *priv = info->par;
 return 0;

static const struct spi_device_id ili9341_ids[] = {
 {"ili9341-fb", 0},
MODULE_DEVICE_TABLE(spi, ili9341_ids);
static struct spi_driver  ili9341_driver = {
 .probe = ili9341_probe,
 .remove = ili9341_remove,
 .id_table = ili9341_ids,
 .driver = {
  .owner = THIS_MODULE,
  .name = "ili9341-fb",
MODULE_DESCRIPTION("fbdev driver for ili9341 controller");
MODULE_AUTHOR("Marcus Folkesson <marcus.folkesson@gmail.com>");