ath10k QCA6584 and Wireless network stack

ATH10K is the mac80211 wireless driver for Qualcomm Atheros QCA988x family of chips, and I’m currently working [1] with the QCA6584 chip which is an automotive graded radio chip with PHY support for the abgn+ac modes.
The connection interface to the chip is SDIO which is hardly supported for now, but my friend and kernelhacker, Erik Strömdahl [2] , has got his hands dirty and is currently working on it.
There has been some progress, the chip now able to scan, connect, send and receive data. There is still some issues with the link speed but that is coming.

He is also the reason for why I got interested in the network part of the kernel which is quite… big.

Even only the wireless networking subsystem is quite big, and the first you meet when you start to dig is a bunch of terms thrown up in your face.
I will try to briefly describe a few of these terms that is fundamental for wireless communication.

In this post will discuss the right side of the this figure:

IEEE 802.11

We will see 802.11 a lot of times, so the first thing is to know where these numbers comes from.
IEEE 802.11 is a set of specifications for implementation of wireless networking over several frequency bands. The specifications cover layer 1 (Physical) and layer 2 (Data link) of the OSI model [3].

The Linux kernel MAC subsystem register ieee80211 compliant hardware device with

int ieee80211_register_hw(struct ieee80211_hw *hw)

found in …/net/mac80211/main.c

The Management Layer (MLME)

One more thing that we need to cover is the management layer, since all other layers somehow depend on it.

There are three components in the 802.11 management architecture:
– The Physical Layer Management Entity (PLME)
– The System Management Entity (SME)
– The MAC Layer Management Entity (MLME)

The Management layer assist you in several ways. For instance, it handle things such as scanning, authentication, beacons, associations and much more.


Scanning is simply looking for other 802.11 compliant devices in the air. There are two types of scanning; passive and active.

Passive scanning

When performing a passive scanning, the radio is listening passively for beacons, without transmitting packages, as it moves from channel to channel and records all devices that it receives beacons from.
Higher frequency bands in the ieee802.11a standard does not allow to transmit anything unless you have heard an Access Point (AP) beacon. Passive scanning is therefore the only way to be aware of the surroundings.

Active scanning

Active scanning on the other hand, is transmitting Probe Request (IEEE80211_STYPE_PROBE_REQ) management packets. This type of scanning is also walking from channel to channel, sending these probe requests management packet for each channel.

These requests is handled by ieee80211_send_probe_req() in …/net/mac80211/util.c:

void ieee80211_send_probe_req(struct ieee80211_sub_if_data *sdata,
                  const u8 *src, const u8 *dst,
                  const u8 *ssid, size_t ssid_len,
                  const u8 *ie, size_t ie_len,
                  u32 ratemask, bool directed, u32 tx_flags,
                  struct ieee80211_channel *channel, bool scan)


The authentication procedure sends a management frame of a authentication type (IEEE80211_STYPE_AUTH). There is not only one type of authentication but plenty of them. The ieee80211 specification does only specify one mandatory authentication type; the Open-system authentication (WLAN_AUTH_OPEN). Another common authentication type is Shared key authentication (WLAN_AUTH_SHARED_KEY).

These management frames is handled by ieee80211_send_auth() in …/net/mac80211/util.c:

void ieee80211_send_auth(struct ieee80211_sub_if_data *sdata,
             u16 transaction, u16 auth_alg, u16 status,
             const u8 *extra, size_t extra_len, const u8 *da,
             const u8 *bssid, const u8 *key, u8 key_len, u8 key_idx,
             u32 tx_flags)

Open system authentication

This is the most simple type of authentication, all clients that request authentication will be authenticated. No security is involved at all.

Shared key authentication

In this type of authentication the client and AP is using a shared key, also known as Wired Equivalent Privacy (WEP) key.


The association is started when the station sends management frames of the type IEEE80211_STYPE_ASSOC_REQ. In the kernel code this is handled by ieee80211_send_assoc() in …/net/mac80211/mlme.c

static void ieee80211_send_assoc(struct ieee80211_sub_if_data *sdata)


When the station is roaming, i.e. moving between APs within an ESS (Extended Service Set), it also sends a reassociation request to a new AP of the type IEEE802_STYPE_REASSOC_REQ. Association and reassociation has so much in common that it is both handled by ieee80211_send_assoc().

MAC (Medium Access Control)

All ieee80211 devices needs to implement the Management Layer (MLME), but the implementation could be in device hardware or software.
These types of devices are divided into Full MAC device (hardware implementation) and Soft MAC device (software implementation).
Most devices today are soft MAC devices.

The MAC layer can be further broken down into two pieces: Upper MAC and Lower MAC. The upper part of the MAC handle the management aspect (all that we covered in the MLME section above), and the lower part handle the time critical operations such as ACK:ing received packets.

Linux does only handle the upper part of MAC, the lower part is operated in device hardware.
What we can see in the figure is that the MAC layer is separating data packets from configuration/management packets.
The data packets is forwarded to the network device and will travel the same path through the network layer as data packets from all other type of network devices.

The Linux wireless subsystem consists of two major parts, where this, mac80211, is one of them.
cfg80211 is the other major part.


cfg80211 is a configuration management service for mac80211 compliant devices.
Both Full MAC and Soft MAC devices needs to implement operations to be compatible with the cfg80211 configuration interface in order to let userspace application to configure the device.

The configuration may be done with on of two interfaces, wext and nl80211.

Wireless Extension, WEXT (Legacy)

This is the legacy and ugly way to configure wireless devices. It is still supported only for backward compatibility reasons.
Users of this configuration interface are wireless-tools (iwconfig, iwlist).


nl80211 on the other hand, is a new netlink interface intended to replace the Wireless Extension (wext) interface.
Users of this interface is typically iw and wpa_supplicant.


The whole network stack of the Linux kernel is really complex and optimized for high throughput with low latencies. In this post we only covered what support for wireless devices has complemented the stack with, which is mainly the mac80211 layer for handle all device management, and cfg80211 layer to configure the MAC layer.
Packets to wireless devices is divided into data packets and configuration/managment packets.
The data packets follow the same path as for all network devices, and the management packets goes to the cfg80211 layer. and git send-email

Many with me prefer email as communication channel, especially for patches.
Github, Gerrit and all other "nice" and "userfriendly" tools that tries to "help" you to manage your submissions does not simply fit my workflow.

As you may already know, all patches to the Linux kernel is by email. scripts/ (see [1] for more info about the process) is a handy tool that takes a patch as input and gives back a bunch of emails addresses.
These email addresses is usually passed to git send-email [2] for submission.

I have used various scripts to make the output from to fit git send-email, but was not completely satisfied until I found the –to-cmd and –cc-cmd parameters to git send-email:

  Specify a command to execute once per patch file which should generate patch file specific "To:" entries. Output of this command must be single email address per line. Default is the value of sendemail.tocmd configuration value.
  Specify a command to execute once per patch file which should generate patch file specific "Cc:" entries. Output of this command must be single email address per line. Default is the value of sendemail.ccCmd configuration value.

I’m very pleased with these parameters. All I have to to is to put these extra lines into my ~/.gitconfig (or use git config):

    tocmd ="`pwd`/scripts/ --nogit --nogit-fallback --norolestats --nol"
    cccmd ="`pwd`/scripts/ --nogit --nogit-fallback --norolestats --nom"

To submit a patch, I just type:

git send-email --identity=linux ./0001-my-fancy-patch.patch

and let –to and –cc to be populated automatically.


When the system is running out of memory, the Out-Of-Memory (OOM) killer picks a process to kill based on the current memory footprint.
In case of OOM, we will calculate a badness score between 0 (never kill) and 1000 for each process in the system. The process with the highest score will be killed. A score of 0 is reserved for unkillable tasks such as the global init process (see [1]) or kernel threads (processes with PF_KTHREAD flag set).

The current score of a given process is exposed in procfs, see /proc/[pid]/oom_score, and may be adjusted by setting /proc/[pid]/oom_score_adj.
The value of oom_score_adj is added to the score before it is used to determine which task to kill. The value may be set between OOM_SCORE_ADJ_MIN (-1000) and OOM_SCORE_DJ_MAX (+1000).
This is useful if you want to guarantee that a process never is selected by the OOM killer.

The calculation is simple (nowadays), if a task is using all its allowed memory, the badness score will be calculated to 1000. If it is using half of its allowed memory, the badness score is calculated to 500 and so on.
By setting oom_score_adj to -1000, the badness score sums up to <=0 and the task will never be killed by OOM.

There is one more thing that affects the calculation; if the process is running with the capability CAP_SYS_ADMIN, it gets a 3% discount, but that is simply it.

The old implementation

Before v2.6.36, the calculation of badness score tried to be smarter, besides looking for the total memory usage (task->mm->total_vm), it also considered:
– Whether the process creates a lot of children
– Whether the process has been running for a long time, or has used a lot of CPU time
– Whether the process has a low nice value
– Whether the process is privileged (CAP_SYS_ADMIN or CAP_SYS_RESOURCE set)
– Whether the process is making direct hardware access

At first glance, all these criteria looks valid, but if you think about it a bit, there is a lot of pitfalls here which makes the selection not so fair.
For example: A process that creates a lot of children and consumes some memory could be a leaky webserver. Another process that fits into the description is your session manager for your desktop environment which naturally creates a lot of child processes.

The new implementation

This heuristic selection has evolved over time, instead of looking on mm->total_vm for each task, the task’s RSS (resident set size, [2]) and swap space is used instead.
RSS and Swap space gives a better indication of the amount that we will be able to free if we chose this task.
The drawback with using mm->total_vm is that it includes overcommitted memory ( see [3] for more information ) which is pages that the process has claimed but has not been physically allocated.

The process is now only counted as privileged if CAP_SYS_ADMIN is set, not CAP_SYS_RESOURCE as before.

The code

The whole implementation of OOM killer is located in mm/oom_kill.c.
The function oom_badness() will be called for each task in the system and returns the calculated badness score.

Let’s go through the function.

unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
              const nodemask_t *nodemask, unsigned long totalpages)
    long points;
    long adj;

    if (oom_unkillable_task(p, memcg, nodemask))
        return 0;

Looking for unkillable tasks such as the global init process.

p = find_lock_task_mm(p);
if (!p)
    return 0;

adj = (long)p->signal->oom_score_adj;
if (adj == OOM_SCORE_ADJ_MIN ||
        test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
        in_vfork(p)) {
    return 0;

If proc/[pid]/oom_score_adj is set to OOM_SCORE_ADJ_MIN (-1000), do not even consider this task

points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
    atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm);

Calculate a score based on RSS, pagetables and used swap space

if (has_capability_noaudit(p, CAP_SYS_ADMIN))
    points -= (points * 3) / 100;

If it is root process, give it a 3% discount. We are no mean people after all

adj *= totalpages / 1000;
points += adj;

Normalize and add the oom_score_adj value

return points > 0 ? points : 1;

At last, never return 0 for an eligible task as it is reserved for non killable tasks



The OOM logic is quite straightforward and seems to have been stable for a long time (v2.6.36 was released in october 2010).
The reason why I was looking at the code was that I did not think the behavior I saw when experimenting corresponds to what was written in the man page for oom_score.
It turned out that the manpage was not updated when the new calculation was introduced back in 2010.

I have updated the manpage and it is available in v4.14 of the Linux manpage project [4].

commit 5753354a3af20c8b361ec3d53caf68f7217edf48
Author: Marcus Folkesson <>
Date:   Fri Nov 17 13:09:44 2017 +0100

    proc.5: Update description of /proc/<pid>/oom_score

    After Linux 2.6.36, the heuristic calculation of oom_score
    has changed to only consider used memory and CAP_SYS_ADMIN.

    See kernel commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10.

    Signed-off-by: Marcus Folkesson <>
    Signed-off-by: Michael Kerrisk <>

diff --git a/man5/proc.5 b/man5/proc.5
index 82d4a0646..4e44b8fba 100644
--- a/man5/proc.5
+++ b/man5/proc.5
@@ -1395,7 +1395,9 @@ Since Linux 2.6.36, use of this file is deprecated in favor of
 .IR /proc/[pid]/oom_score_adj .
 .IR /proc/[pid]/oom_score " (since Linux 2.6.11)"
-.\" See mm/oom_kill.c::badness() in the 2.6.25 sources
+.\" See mm/oom_kill.c::badness() in pre 2.6.36 sources
+.\" See mm/oom_kill.c::oom_badness() after 2.6.36
+.\" commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10
 This file displays the current score that the kernel gives to
 this process for the purpose of selecting a process
 for the OOM-killer.
@@ -1403,7 +1405,16 @@ A higher score means that the process is more likely to be
 selected by the OOM-killer.
 The basis for this score is the amount of memory used by the process,
 with increases (+) or decreases (\-) for factors including:
-.\" See mm/oom_kill.c::badness() in the 2.6.25 sources
+.\" See mm/oom_kill.c::badness() in pre 2.6.36 sources
+.\" See mm/oom_kill.c::oom_badness() after 2.6.36
+.\" commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10
+.IP * 2
+whether the process is privileged (\-);
+.\" More precisely, if it has CAP_SYS_ADMIN or (pre 2.6.36) CAP_SYS_RESOURCE
+Before kernel 2.6.36 the following factors were also used in the calculation of oom_score:
 .IP * 2
 whether the process creates a lot of children using
@@ -1413,10 +1424,7 @@ whether the process creates a lot of children using
 whether the process has been running a long time,
 or has used a lot of CPU time (\-);
 .IP *
-whether the process has a low nice value (i.e., > 0) (+);
-.IP *
-whether the process is privileged (\-); and
-.\" More precisely, if it has CAP_SYS_ADMIN or CAP_SYS_RESOURCE
+whether the process has a low nice value (i.e., > 0) (+); and
 .IP *
 whether the process is making direct hardware access (\-).
 .\" More precisely, if it has CAP_SYS_RAWIO


So, a week in Prague has come to its end. The Embedded Linux Conference Europe was this year co-located with Open Source Summit and offered a lot of interesting talks on various topics.

One of the hottest topics this year was about our most beloved debugging function – prink().
What is so hard with printing? It turns out that printk is quite deadlock-prone and that is not an easy thing to work around in the current infrastructure of the kernel.

A common misconception is that printk() is a fast operation that simply writes the message to the global __log_buf variable. It is not.

A printk() may involve many different subsystems, different contexts or nesting, just to mention a few parts that needs to be handled.
For example:

  1. The output needs to go over some output medium (consoles)
    * The monitor
    * Frame buffers
    * UART / Serial console
    * Network console
    * Braille
    * …
  2. Uses different locking mechanismes
    * The console_lock (described below)
    * The logbuf_lock spinlock
    * Consoles often have their own locks
  3. Wake up waiting applications
    * syslogd
    * journald
    * …

Besides that, printk() is expected to work in every context, whether it is process, softirq, IRQ or NMI context.
With all these locking mechanisms involved, what happens if a printk in process context is interrupted by an NMI, and the NMI also calls printk?
In other words, there is a lot of special cases that needs to be handled.

How it works


Lets look back on how the printing was handled in a pre-history kernel.

SMP (Symmetric Multi Processing) SoCs became common in the late 1990s. Before that, everything was easy and everyone was happy. No NMIs. No races between multiple cores. Simple locking. No Facebook.
As a response to SMP systems, Linux v2.1.80 introduced a spin_lock to printk to avoid race conditions between multiple cores.
The solution we came up with was to serialize all prints to the console. If two CPUs called printk() at the same time, the second core has to wait for the first core to finish.

This does not scale well. In fact, it does not scale at all. What about a modern system with 100+ CPUs that all calls printk at the same time? Depending on the console, the printing may take milliseconds and you will surely end up with an unresponsive system.


Now we are doing things differently.
The first core that grabs the console_lock is responsible to print all messages in the __log_buf. If another core is calling printk() in meanwhile, it puts its data into __log_buf , tries to grab the lock which is busy, and then simple returns.
As __log_buf continues getting new data, the unlucky core that grabbed the console_lock may end up doing nothing but printing.

The good thing is that we only locks up a single core instead of all cores.
The bad thing is that we locks up a single core.

The code


printk() is defined in kernel/printk/printk.c and does not look much to the world

asmlinkage __visible int printk(const char *fmt, ...)
    va_list args;
    int r;

    va_start(args, fmt);
    r = vprintk_func(fmt, args);

    return r;

It simple calls vprintk_function with its own arguments.


vprintk_func() is a function that forward the arguments to different print-functions depending on the current context

__printf(1, 0) int vprintk_func(const char *fmt, va_list args)
    if (this_cpu_read(printk_context) & PRINTK_NMI_CONTEXT_MASK)
        return vprintk_nmi(fmt, args);

    if (this_cpu_read(printk_context) & PRINTK_SAFE_CONTEXT_MASK)
        return vprintk_safe(fmt, args);

    if (this_cpu_read(printk_context) & PRINTK_NMI_DEFERRED_CONTEXT_MASK)
        return vprintk_deferred(fmt, args);

    return vprintk_default(fmt, args);

The different contexts we consider are:

Normal context

If we are on normal context, there is nothing to consider at all, go for the vprintk_default() and just do our thing.

NMI context

In the case that the CPU supports NMIs (Non-Maskable Interrupts, (look for CONFIG_HAVE_NMI and CONFIG_PRINTK_NMI in your .config ), we go for vprintk_nmi(). vprintk_nmi() do a safe copy to a per-CPU buffer, not the global __log_buf.
Since NMIs are not nested by its nature, there is always only one write running. However, NMIs is only for the local CPU, and the buffer might get flushed from another CPU, so we still need to be careful.

"Recursive" context

If the printk() routine is interrupted and we end up in another call to printk from somewhere else, we go for the lock-less vprintk_safe() to prevent a recursion deadlock. vprintk_safe() is using a per-CPU buffer to store the message, just like NMI.

Deferred context

As already said, multiple locks is involved in the call chain of printk(). vprintk_deferred() is using the main logbuf_lock but avoid calling console drivers that might have their own locks. The actual printing is deferred to klogd_work kernel thread.


vprintk_emit() is responsible to write to __log_buf, (but not the only function, cont_flush() also write to __log_buf) and print out the content to all consoles.

asmlinkage int vprintk_emit(int facility, int level,
                const char *dict, size_t dictlen,
                const char *fmt, va_list args)


    <<<<< Strip kernel syslog prefix >>>>>


    <<<<< log_output() does the actual printing to __log_buf >>>>>
    printed_len = log_output(facility, level, lflags, dict, dictlen, text, text_len);


    if (!in_sched) {
         * Try to acquire and then immediately release the console
         * semaphore.  The release will print out buffers and wake up
         * /dev/kmsg and syslog() users.
        if (console_trylock())

    return printed_len;

The function is quite straight forward. The only thing that looks a little bit strange is

if (console_trylock())

Really? Grab the console_lock and immediately unlock it?
The thing is that all magic happens in console_unlock().


The CPU that is grabbing the console_lock is responsible to print to all registered consoles until all new data in __log_buf is printed. This regardless if other CPUs keeps filling the buffer with new data.

In the worst case, this CPU is doing nothing but printing and will never leave this function.

void console_unlock(void)

    <<<<< Endless loop? >>>>>
    for (;;) {

        <<<<< Go through all new messages >>>>>


        <<<<< Print to all consoles >>>><
        call_console_drivers(ext_text, ext_len, text, len);



    <<<<<  Release the exclusive_console once it is used >>>>>
    console_locked = 0;


    <<<<< Wake up klogd >>>>>
    if (wake_klogd)

The function is looping until all new messages is printed. For each new message, a call to call_console_drivers() is made.
The last thing that we do is waking up the klogd kernel thread that will signal to all userspace application that is waiting on klogctl(2).


call_console_drivers() is asking all registered consoles to print out a message. The console_lock must be held when calling this function.

static void call_console_drivers(const char *ext_text, size_t ext_len,
                 const char *text, size_t len)
    struct console *con;

    trace_console_rcuidle(text, len);

    if (!console_drivers)

    for_each_console(con) {
        if (exclusive_console && con != exclusive_console)
        if (!(con->flags & CON_ENABLED))
        if (!con->write)
        if (!cpu_online(smp_processor_id()) &&
            !(con->flags & CON_ANYTIME))
        if (con->flags & CON_EXTENDED)
            con->write(con, ext_text, ext_len);
            con->write(con, text, len);


As we see, there is a lot of logic involved in a simple call to printk() and you should not be surprised if all your printing has impact on your systems performance or timing.
But how do we debug if printk() is a no-no? The answer is trace_printk().

This function write (almost) directly to a trace buffer and is therefore a fairly fast operation.
The trace buffer is exposed from tracefs, usually mounted at /sys/kernel/tracing.

As a bonus, the messages is merged with other output from ftrace when doing a function trace.

Other things that is good to know about __log_buf


The kernel log buffer is exported as a global symbol called __log_buf. If you have an systems that deadlocks without any output on the console and you may reboot the system without resetting RAM, then you may print the content of __log_buf from the bootloader.

Determine the physical address of __log_buf

[09:59:31]marcus@little:~/git/linux$ grep __log_buf
c14cfba8 b __log_buf

The 0xc14cfba8 is the virtual address of __log_buf.
This kernel is compiled for a 32bit ARM with the CONFIG_VMSPLIT_3G set, so the kernel virtual address space start at 0xc0000000. To get the physical address out of the virtual, subtract the offset (0xc14cfba8 – 0xc0000000) and you will end up with 0x014cfba8. Dump this address from your bootloader and you will see your kernel log.


The size of __log_buf is set at compile-time with CONFIG_LOG_BUF_SHIFT. The value defines the size as a power of 2 and is usually set to 16 (64K).

There is also a CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT that is the per-CPU buffer where messages printed from unsafe context are temporary stored. Examples on unsafe context would be NMI and printk recursions. The messages are copied to the main log buffer in a safe context to avoid a deadlock.

This buffer is rarely used but has to be there to avoid the nasty deadlocks.
The CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT was introduced in v4.11 and is also expressed as a power of 2.

PID 1 in containers

What is PID 1

The top-most process in a UNIX system has PID (Process ID) 1 and is usually the init process.
The Init process is the first userspace application started on a system and is started by the kernel at boottime.
The kernel is looking in a few predefined paths (and the init kernel parameter). If no such application is found, the system will panic().

See init/main.c:kernel_init

if (!try_to_run_init_process("/sbin/init") ||
    !try_to_run_init_process("/etc/init") ||
    !try_to_run_init_process("/bin/init") ||
    return 0;
panic("No working init found.  Try passing init= option to kernel. "
      "See Linux Documentation/admin-guide/init.rst for guidance.");

All processes in UNIX has a parent/child relationship which builds up a big relationship-tree.
Some resources and permissions are inherited from parent to child such as UID and cgroup restrictions.

As in the real world, with parenthood comes obligations.
For example: What is usually the last line of your main()-function? Hopfully something like


All processes exits with an exit code that tells us if the operation was sucessful or not.
Who is interested in this exit code anyway?
In the real world, the parents are interested in their children’s result, and so even here. The parent is responsible to wait(2) on their children to terminate just to fetch its exit code.
But what if the parent died before the child?

Lets go back to the init process.
The init process has several tasks, and one is to adopt "orphaned" (called zombie) child processes.
Why? Because all processes will return an exit code and will not terminate completely until someone is listen for what they have to say.
The init process is simply wait(2):ing on the exit code, throw it away and let the child die. Sad but true, but the child may not rest i peace otherwise.
The operating system expects the init process to reap adopted children. Otherwise the children will exist in the system as a zombie and taking up some kernel resources and consume a slot in the kernel process table.

PID 1 in containers

Containers is a concept that isolate processes in different namespaces. Example of such namespaces are PID, users, networking and filesystem.
To create a container is quite simple, just create a new process with clone(2) and provide relevant flags to create new namespaces for the process.

The flags related to namespaces are listed in include/uapi/linux/sched.h:

#define CLONE_NEWPID                0x20000000      // New pid namespace
#define CLONE_NEWCGROUP             0x02000000      // New cgroup namespace
#define CLONE_NEWUTS                0x04000000      // New utsname namespace
#define CLONE_NEWIPC                0x08000000      // New ipc namespace
#define CLONE_NEWUSER               0x10000000      // New user namespace
#define CLONE_NEWPID                0x20000000      // New pid namespace
#define CLONE_NEWNET                0x40000000      // New network namespace

All processes is running in a "container-context" because the processes allways executes in a namespace.
On a system "without containers", all processes still have one common namespace that all processes is using.

When using CLONE_NEWPID, the kernel will create a new PID namespace and let the newly created process has the PID 1.
As we already know, the PID 1 process has a very special task, namely to kill all orphaned children.
This PID 1 process could be any application (make, bash, nginx, ftp-server or whatever) that is missing this essential adopt-and-slay-mechanism.
If the reaping is not handled, it will result in zombie-processes. This was a real problem not long time ago for Docker containers (google Docker and zombies to see what I mean).
Nowadays we have the –init flag on docker run to tell the container to use tini (, a zombie-reaping init process to run with PID 1.

When PID 1 dies

This is the reason to why I’m writing this post. I was wondering who is killing PID 1 in a container since we learned that a PID 1 may not die under any circumstances.
PID 1 in cointainers is obviosly an exception from this golden rule, but how does the kernel differentiate between init processes in different PID namespaces?

Lets follow a process to its very last breath.

The call chain we will look at is the following:


kernel/exit.c:do_exit() is called when a process is going to be cleaned up from the system after it has exited or being terminated.
The function is collecting the exit code, delete timers, free up resources and so on.
Here is an extract of the function:


<<<<< Collect exit code >>>>>
tsk->exit_code = code;
taskstats_exit(tsk, group_dead);


if (group_dead)

<<<<< Free up resources >>>>>
if (group_dead)



<<<<< Notify tasks in the same group >>>>>
exit_notify(tsk, group_dead);


exit_notify() is to notifing our "dead group" that we are going down.
One important thing to notice is that almost all resources are freed at this point.
Even if the process is going into a zombie state, the footprint is relative small, but still, the zombie consumes a slot in the process table.

The size of the process table in Linux and defined by PID_MAX_LIMIT in include/linux/threads.h:

\* A maximum of 4 million PIDs should be enough for a while.
\* [NOTE: PID/TIDs are limited to 2^29 ~= 500+ million, see futex.h.]
(sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))

The process table is indeed quite big. But if you are running for example a webserver as PID 1 that is fork(2):ing on each HTTP request. All these forks will result in a zombie and the number will escalate quite fast.


kernel/exit.c:exit_notify() is sending signals to all the closest relatives so that they know to properly mourn this process.
In the beginning of this function, a call is made to forget_original_parent():

static void exit_notify(struct task_struct *tsk, int group_dead)
    bool autoreap;
    struct task_struct *p, *n;

  >>>>>  forget_original_parent(tsk, &dead);


This function simply does two things

  1. Make init (PID 1) inherit all the child processes
  2. Check to see if any process groups have become orphaned as a result of our exiting, and if they have any stopped jobs, send them a SIGHUP and then a SIGCONT.

find_child_reaper() will help us find a proper reaper:

>>>>> reaper = find_child_reaper(father);
if (list_empty(&father->children))


kernel/exit.c:find_child_reaper() is looking if a father is available.
If a father (or other relative) is not available at all, we must be the PID 1 process.

This is the interesting part:

if (unlikely(pid_ns == &init_pid_ns)) {
    panic("Attempted to kill init! exitcode=0x%08x\n",
        father->signal->group_exit_code ?: father->exit_code);

init_pid_ns refers (declared in kernel/pid.c) to our real init process.
If the real init process exits, panic the whole system since it cannot continue without an init process.
If it is not, call zap_pid_ns_processes(), here we have our PID1-cannot-be-killed-exception we are looking for!
We contiue following the call chain down to zap_pid_ns_processes().


zap_pid_ns_processes function is part of the PID namespace and is located in kernel/pid_namespace.c
The function iterates through all tasks in the same group and send signal SIGKILL to each of them.

nr = next_pidmap(pid_ns, 1);
while (nr > 0) {


    task = pid_task(find_vpid(nr), PIDTYPE_PID);
    if (task && !__fatal_signal_pending(task))
    >>>>> send_sig_info(SIGKILL, SEND_SIG_FORCED, task);


    nr = next_pidmap(pid_ns, nr);



The PID 1 in containers is handled in a seperate way than the real init process.
This is obvious, but now we know where the codeflow differ for PID 1 in different namespaces.

We also see that if the PID1 in a PID namespace dies, all the subprocesses will be terminated with SIGKILL.
This behavior reflects the fact that the init process is essential for the correct operation of any PID namespace.

2.2″ TFT and BeagleBone

2.2" TFT display on Beaglebone

I recently bought a 2.2" TFT display on Ebay (come on, 7 bucks…) and was up to use it with my BeagleBone. Luckily for me there was no Linux driver for the ILI9341 controller so it is just to roll up my sleeves and get to work.

Boot up the BeagleBone

I haven’t booted up my bone for a while and support for the board seems to have reached the mainline in v3.8 (currently at v3.15), so the first step is just to get it boot with a custom kernel.

Clone the vanilla kernel from

git clone git://

Use the omap2plus_defconfig as base:

make ARCH=arm omap2plus_defconfig

I will still use my old U-boot version, which does not have support for devicetrees, so I have to make sure that


This simply tells the boot code to look for a device tree binary (DTB) appended to the zImage. Without this option, the kernel expects the address of a dtb in the r2 register (on ARM architectures), but that does not work on my ancient bootloader.

Next step is to compile the kernel. We are using U-Boot as bootloader, but we do not create an uImage since we have to append the dtb to the zImage before that.:

make ARCH=arm CROSS_COMPILE=arm-none-linux-gnueabi-

Next, create the device tree blob. We are using the arch/arm/dts/am335x-bone.dts as source.:

make ARCH=arm CROSS_COMPILE=arm-none-linux-gnueabi- am33x-bone.dtb

Now we are only two steps behind a booting kernel! First we need to append the dtb to the zImage, and then we need to create an U-boot-friendly kernel image with mkimage.:

cat arch/arm/boot/zImage arch/arm/boot/dts/am335x-bone.dtb > ./zImage_dtb
mkimage -A arm -O linux -T kernel -C none -a 0x80008000 -e 0x80008000 -n 'BeagleBone image' -d ./zImage_dtb uImage

Put the uImage on the uSD-card and boot it up. ..

BeagleBone login:


Enable SPI

First of all, we need to setup the pinmux for the spi-bus. This is done with the pinctrl subsystem in the devicetree interface file (arch/arm/boot/dts/am335x-bone-common.dtsi).

Create the pins. For more detailed explaination of the values, see the BeagleBone System Reference Manual.

spi1_pins: spi1_pins_s0 {
   pinctrl-single,pins = <
     0x190 0x33      /* mcasp0_aclkx.spi1_sclk, INPUT_PULLUP | MODE3 */
     0x194 0x33      /* mcasp0_fsx.spi1_d0, INPUT_PULLUP | MODE3 */
     0x198 0x13      /* mcasp0_axr0.spi1_d1, OUTPUT_PULLUP | MODE3 */
     0x19c 0x13      /* mcasp0_ahclkr.spi1_cs0, OUTPUT_PULLUP | MODE3 */

Then override the spi1 entry and create an instance of our device driver. The driver will have the name "ili9341-fb".

 status = "okay";
 pinctrl-names = "default";
 pinctrl-0 = <&spi1_pins>;
 ili9341: ili9341@0 {
  compatible = "ili9341-fb";
  reg = <0>;
  spi-max-frequency = <16000000>;
  dc-gpio = <&gpio3 19 GPIO_ACTIVE_HIGH>;

Create an entry in the Kbuild system

I always integrate the modules into the kbuild system as the first step. This for several reasons:
– I use one kernel for all of my projects, just different branches
– It is simple to jump around with cscope/ctags
– It gives you control when the kernel version and your driver follow eachother
– Out-of-tree modules is evil (gives you a tainted kernel and everyone will spit on you)

Those who don’t know how to put a module into the kbuild system – get ready to be surprised how simple it is!

Every directory in the kernel structure contains at least two files, a Makefile and a Kconfig. The Makefile tells the make buildsystem which files to compile and the Kconfig file is interpreted by (menu|k|x|old|….)config.

Here is what’s needed:

diff --git a/drivers/video/fbdev/Kconfig b/drivers/video/fbdev/Kconfig
index e1f4727..be4ec8f 100644
--- a/drivers/video/fbdev/Kconfig
+++ b/drivers/video/fbdev/Kconfig
@@ -163,6 +163,18 @@ config FB_DEFERRED_IO
        depends on FB
+config FB_ILI9341
+       tristate "ILI9341 TFT driver"
+       depends on FB
+       select FB_SYS_FILLRECT
+       select FB_SYS_COPYAREA
+       select FB_SYS_IMAGEBLIT
+       select FB_SYS_READ
+       select FB_DEFERRED_IO
+       ---help---
+       This enables functions for handling video modes using the ili9341 controller
 config FB_HECUBA
        depends on FB
diff --git a/drivers/video/fbdev/Makefile b/drivers/video/fbdev/Makefile
index 0284f2a..105166a 100644
--- a/drivers/video/fbdev/Makefile
+++ b/drivers/video/fbdev/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_FB_ATARI)            += atafb.o c2p_iplan2.o atafb_mfb.o
                                      atafb_iplan2p2.o atafb_iplan2p4.o atafb_iplan2p8.o
 obj-$(CONFIG_FB_MAC)              += macfb.o
 obj-$(CONFIG_FB_HECUBA)           += hecubafb.o
+obj-$(CONFIG_FB_ILI9341)          += ili9341.o
 obj-$(CONFIG_FB_N411)             += n411.o
 obj-$(CONFIG_FB_HGA)              += hgafb.o
 obj-$(CONFIG_FB_XVR500)           += sunxvr500.o
diff --git a/drivers/video/fbdev/ili9341.c b/drivers/video/fbdev/ili9341.c

Deferred IO

Deferred IO is a way to delay and repurpose IO. It uses host memory as a buffer and the MMU pagefault as a pretrigger for when to perform the device IO.
You simple tell the kernel the minimum delay between the triggers should occours, this allows you to do burst transfers to the device at a given framerate. This has the big benefit that if the userspace updates the framebuffer several times in this period, we will only write it once.

The interface is _really_ simple. All you need to follow is these four steps (see Documentation/fb/deferred_io.txt):

  1. Setup your structure.

    static struct fb_deferred_io hecubafb_defio = {
     .delay  = HZ,
     .deferred_io = hecubafb_dpy_deferred_io,

The delay is the minimum delay between when the page_mkwrite trigger occurs
and when the deferred_io callback is called. The deferred_io callback is
explained below.

  1. Setup your deferred IO callback.

    static void hecubafb_dpy_deferred_io(struct fb_info *info,
        struct list_head *pagelist)

The deferred_io callback is where you would perform all your IO to the display
device. You receive the pagelist which is the list of pages that were written
to during the delay. You must not modify this list. This callback is called
from a workqueue.

  1. Call init:

    info->fbdefio = &hecubafb_defio;
  2. Call cleanup:



The driver is quite straight forward and there was no really hard problem with the driver itself. However, I had problem to get a high framerate because the SPI communication took time. All SPI communication is asynchronious and all jobs is stacked on a queue before it gets scheduled. This takes time. One obvious solution is to write bigger chunks with each transfer, and that is what I did.

But the problem was that when I increased the chunk size, the kernel got panic with the DMA transfers.
After an half a hour of code-digging, the problem is derived to the spi-controller for the omap2 (drivers/spi/spi-omap2-mcspi.c). It defines the DMA_MIN_BYTES which is arbitrarily set to 160. The code then compare the data length to this constant and determine if it should use DMA or not. It shows up that the DMA-transfer-code itself is broken.

A temporary solution is to increase the DMA_MIN_BYTES to at least a full frame (240x320x2) bytes until I have looked at the DMA code and submitted a fix 🙂


Here is a shell started from Ubuntu

I have also tested to startup Qt and directfb applications. It all works like a charm.

The Deferred IO interface is really nice for such displays. I’m surprised that there is currently so few drivers using it.

(the not so cleaned up) Code:

 * linux/drivers/video/ili9341.c -- FB driver for ili9341 controller
 * Copyright (C) 2014, Marcus Folkesson
 * This file is subject to the terms and conditions of the GNU General Public
 * License. See the file COPYING in the main directory of this archive for
 * more details.
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/string.h>
#include <linux/mm.h>
#include <linux/vmalloc.h>
#include <linux/delay.h>
#include <linux/interrupt.h>
#include <linux/fb.h>
#include <linux/init.h>
#include <linux/list.h>
#include <linux/uaccess.h>
#include <linux/spi/spi.h>
#include <video/ili9341.h>
#include <linux/regmap.h>
#include <linux/gpio.h>
#include <linux/of.h>
#include <linux/gpio.h>
#include <linux/of_gpio.h>
#include <linux/debugfs.h>

/* Display specific information */
#define SCREEN_WIDTH (240)
#define SCREEN_HIGHT (320)
#define SCREEN_BPP  (16)
#define ID "cbdff49d683b"
#define ID_SZ 12

static unsigned int chunk_size;

struct ili9341_priv {
 struct spi_device *spi;
 struct regmap *regmap;
 struct fb_info *info;
 u32 vsize;
 int dc;
 char *fbmem;
 struct dentry *dir;
static struct fb_fix_screeninfo ili9341_fix = {
 .id =  "ili9341",
 /*.visual = FB_VISUAL_MONO01,*/
 .xpanstep = 0,
 .ypanstep = 0,
 .ywrapstep = 0,
 .line_length = SCREEN_WIDTH*2,
 .accel = FB_ACCEL_NONE,
static struct fb_var_screeninfo ili9341_var = {
 .xres  = SCREEN_WIDTH,
 .yres  = SCREEN_HIGHT,
 .xres_virtual = SCREEN_WIDTH,
 .yres_virtual = SCREEN_HIGHT,
 .bits_per_pixel = SCREEN_BPP,
 .nonstd  = 1,
 .red = {
  .offset = 11,
  .length = 5,
 .green = {
  .offset = 5,
  .length = 6,
 .blue = {
  .offset = 0,
  .length = 5,
 .transp = {
  .offset = 0,
  .length = 0,

static const struct regmap_config ili9341_regmap_config = {
 .reg_bits = 8,
 .val_bits = 8,
 .can_multi_write = 1,
static void fill(struct ili9341_priv *priv);
static void fill_area(struct ili9341_priv *priv, int y1, int y2);
/* main ili9341 functions */
static void apollo_send_data(struct ili9341_priv *par, unsigned char data)
 /* set data */
static void apollo_send_command(struct ili9341_priv *par, unsigned char data)
static void ili9341_dpy_update(struct ili9341_priv *par)
static void ili9341_dpy_update_area(struct ili9341_priv *par, int y1, int y2 )
 fill_area(par, y1, y2);
/* this is called back from the deferred io workqueue */
static void ili9341_dpy_deferred_io(struct fb_info *info,
    struct list_head *pagelist)
 struct page *cur;
 struct fb_deferred_io *fbdefio = info->fbdefio;
 struct ili9341_priv *par = info->par;

 struct page *page;
 unsigned long beg, end;
 int y1, y2, miny, maxy;
 miny = INT_MAX;
 maxy = 0;
 /* stop here if list is empty */
 if (list_empty(pagelist)){
  dev_err(&par->spi->dev, "pagelist is empty");
 list_for_each_entry(page, pagelist, lru) {
  beg = page->index << PAGE_SHIFT;
  end = beg + PAGE_SIZE - 1;
  y1 = beg / (info->fix.line_length);
  y2 = end / (info->fix.line_length);
  if (y2 >= info->var.yres)
   y2 = info->var.yres - 1;
  if (miny > y1)
   miny = y1;
  if (maxy < y2)
   maxy = y2;
 ili9341_dpy_update_area(info->par, miny, maxy);
 // dev_err(&par->spi->dev, ".");
static void ili9341_fillrect(struct fb_info *info,
       const struct fb_fillrect *rect)
 struct ili9341_priv *par = info->par;
 sys_fillrect(info, rect);
static void ili9341_copyarea(struct fb_info *info,
       const struct fb_copyarea *area)
 struct ili9341_priv *par = info->par;
 sys_copyarea(info, area);
static void ili9341_imageblit(struct fb_info *info,
    const struct fb_image *image)
 struct ili9341_priv *par = info->par;
 sys_imageblit(info, image);
 * this is the slow path from userspace. they can seek and write to
 * the fb. it's inefficient to do anything less than a full screen draw
static ssize_t ili9341_write(struct fb_info *info, const char __user *buf,
    size_t count, loff_t *ppos)
 struct ili9341_priv *par = info->par;
 unsigned long p = *ppos;
 void *dst;
 int err = 0;
 unsigned long total_size;
 if (info->state != FBINFO_STATE_RUNNING)
  return -EPERM;
 total_size = info->fix.smem_len;
 if (p > total_size)
  return -EFBIG;
 if (count > total_size) {
  err = -EFBIG;
  count = total_size;
 if (count + p > total_size) {
  if (!err)
   err = -ENOSPC;
  count = total_size - p;
 dst = (void __force *) (info->screen_base + p);
 if (copy_from_user(dst, buf, count))
  err = -EFAULT;
 if  (!err)
  *ppos += count;
 return (err) ? err : count;
static struct fb_ops ili9341_ops = {
 .owner   = THIS_MODULE,
 .fb_write  = ili9341_write,
 .fb_fillrect = ili9341_fillrect,
 .fb_copyarea = ili9341_copyarea,
 .fb_imageblit = ili9341_imageblit,
static struct fb_deferred_io ili9341_defio = {
 .delay  = HZ/60,
 .deferred_io = ili9341_dpy_deferred_io,

static void write_command(struct ili9341_priv *priv, u8 data)
 gpio_set_value(priv->dc, 0);
 spi_write(priv->spi, &data, 1);
 gpio_set_value(priv->dc, 1);
static void write_data(struct ili9341_priv *priv, u8 data)
 gpio_set_value(priv->dc, 1);
 spi_write(priv->spi, &data, 1);
static void write_data16(struct ili9341_priv *priv, u8 data)
 gpio_set_value(priv->dc, 1);
 spi_write(priv->spi, &data, 1);
static void init(struct ili9341_priv *priv)
 write_command(priv, 0xCB);
 write_data(priv, 0x39);
 write_data(priv, 0x2C);
 write_data(priv, 0x00);
 write_data(priv, 0x34);
 write_data(priv, 0x02);
 write_command(priv, 0xCF);
 write_data(priv, 0x00);
 write_data(priv, 0XC1);
 write_data(priv, 0X30);
 write_command(priv, 0xE8);
 write_data(priv, 0x85);
 write_data(priv, 0x00);
 write_data(priv, 0x78);
 write_command(priv, 0xEA);
 write_data(priv, 0x00);
 write_data(priv, 0x00);
 write_command(priv, 0xED);
 write_data(priv, 0x64);
 write_data(priv, 0x03);
 write_data(priv, 0X12);
 write_data(priv, 0X81);
 write_command(priv, 0xF7);
 write_data(priv, 0x20);
 write_command(priv, 0xC0);     //Power control
 write_data(priv, 0x23);    //VRH[5:0]
 write_command(priv, 0xC1);     //Power control
 write_data(priv, 0x10);    //SAP[2:0];BT[3:0]
 write_command(priv, 0xC5);     //VCM control
 write_data(priv, 0x3e);    //Contrast
 write_data(priv, 0x28);
 write_command(priv, 0xC7);     //VCM control2
 write_data(priv, 0x86);    //--
/* XXX: Hue?! */
 write_command(priv, 0x36);     // Memory Access Control
 write_data(priv, 0x48);   //C8    //48 68绔栧睆//28 E8 妯睆
 write_command(priv, 0x3A);
 write_data(priv, 0x55);
 write_command(priv, 0xB1);
 write_data(priv, 0x00);
 write_data(priv, 0x18);
 write_command(priv, 0xB6);     // Display Function Control
 write_data(priv, 0x08);
 write_data(priv, 0x82);
 write_data(priv, 0x27);

 write_command(priv, 0xF2);     // 3Gamma Function Disable
 write_data(priv, 0x00);
 write_command(priv, 0x26);     //Gamma curve selected
 write_data(priv, 0x01);
 write_command(priv, 0xE0);     //Set Gamma
 write_data(priv, 0x0F);
 write_data(priv, 0x31);
 write_data(priv, 0x2B);
 write_data(priv, 0x0C);
 write_data(priv, 0x0E);
 write_data(priv, 0x08);
 write_data(priv, 0x4E);
 write_data(priv, 0xF1);
 write_data(priv, 0x37);
 write_data(priv, 0x07);
 write_data(priv, 0x10);
 write_data(priv, 0x03);
 write_data(priv, 0x0E);
 write_data(priv, 0x09);
 write_data(priv, 0x00);
 write_command(priv, 0XE1);     //Set Gamma
 write_data(priv, 0x00);
 write_data(priv, 0x0E);
 write_data(priv, 0x14);
 write_data(priv, 0x03);
 write_data(priv, 0x11);
 write_data(priv, 0x07);
 write_data(priv, 0x31);
 write_data(priv, 0xC1);
 write_data(priv, 0x48);
 write_data(priv, 0x08);
 write_data(priv, 0x0F);
 write_data(priv, 0x0C);
 write_data(priv, 0x31);
 write_data(priv, 0x36);
 write_data(priv, 0x0F);
 write_command(priv, 0x11);     //Exit Sleep
 write_command(priv, 0x29);    //Display on
 write_command(priv, 0x2c);
static void setCol(struct ili9341_priv *priv, u16 start, u16 end)
 u8 tmp;
 write_command(priv, 0x2a);
 tmp = (start & 0xff00) >> 8;
 write_data(priv, tmp);
 tmp = (start & 0x00ff) >> 0;
 write_data(priv, tmp);

 tmp = (end & 0xff00) >> 8;
 write_data(priv, tmp);
 tmp = (end & 0x00ff) >> 0;
 write_data(priv, tmp);
static void setPage(struct ili9341_priv *priv, u16 start, u16 end)
 u8 tmp;
 write_command(priv, 0x2b);
 tmp = (start & 0xff00) >> 8;
 write_data(priv, tmp);
 tmp = (start & 0x00ff) >> 0;
 write_data(priv, tmp);

 tmp = (end & 0xff00) >> 8;
 write_data(priv, tmp);
 tmp = (end & 0x00ff) >> 0;
 write_data(priv, tmp);
static void setPos(struct ili9341_priv *priv, u16 x1, u16 x2, u16 y1, u16 y2)
 setPage(priv, y1, y2);
 setCol(priv, x1, x2);

static void fill_area(struct ili9341_priv *priv, int y1, int y2)
 int i = 0;
 char val = 0xaa;
 char *p = priv->fbmem;
 int ret;
 int start =y1*SCREEN_WIDTH*2 + 1;
 int stop = y2*SCREEN_WIDTH*2+1;
 int range = stop - start;

 if (!chunk_size)
  chunk_size = 10;
 if (start + range > priv->vsize)
  range = priv->vsize - start;
 setCol(priv, 0, 239);
 setPage(priv, y1, y2);
 write_command(priv, 0x2c);

 for(i = start; i < stop; i += chunk_size)
  if ( i + chunk_size > stop )
   chunk_size = stop - i;
  ret = spi_write(priv->spi, &priv->fbmem[i], chunk_size);
  if (ret != 0)
   dev_err(&priv->spi->dev, "Error code: %in", ret);
static void fill(struct ili9341_priv *priv)
 int i = 0;
 char val = 0xaa;
 char *p = priv->fbmem;
 setCol(priv, 0, 239);
 setPage(priv, 0, 319);
 write_command(priv, 0x2c);

 fill_area(priv, 0, 319);
static ssize_t id_show(struct device *dev, struct device_attribute *attr,
   char *buf)
 sprintf(buf, "%s", ID);
 return ID_SZ;
static ssize_t id_store(struct device *dev, struct device_attribute *attr,
    const char *buf, size_t count)
 char kbuf[ID_SZ];
 if (count != ID_SZ)
  return -EINVAL;
 memcpy(kbuf, buf, ID_SZ);
 if (memcmp(kbuf, ID, ID_SZ) != 0)
  return -EINVAL;
 return count;
DEVICE_ATTR(id, 0666, id_show, id_store);
static struct attribute *ili9341_attrs[] = {

static int ili9341_probe(struct spi_device *spi)
 struct fb_info *info;
 int retval = -ENOMEM;
 struct ili9341_priv *priv;
 struct device_node *np = spi->dev.of_node;
 int ret;
 dev_err(&spi->dev, "Hello from I!n");
 priv = kzalloc(sizeof(struct ili9341_priv), GFP_KERNEL);
  return -ENOMEM;
 priv->spi = spi;

/* TODO: better fail handling... */
 priv->dc = of_get_named_gpio(np, "dc-gpio", 0);
 if (priv->dc  == -EPROBE_DEFER)
  return -EPROBE_DEFER;
 if (gpio_is_valid(priv->dc)) {
  ret = devm_gpio_request(&spi->dev, priv->dc, "tft dc");
  if (ret)
   dev_err(&spi->dev, "could not request dcn");
  ret = gpio_direction_output(priv->dc, 1);
  if (ret)
   dev_err(&spi->dev, "could not set DC to output");
  dev_err(&spi->dev, "DC gpio is not valid");

 dev_err(&spi->dev, "Initialize regmap");
 priv->regmap = devm_regmap_init_spi(spi, &ili9341_regmap_config);
 if (IS_ERR(priv->regmap))
  goto err_regmap;
 dev_err(&spi->dev, "regmap OK");
 priv->fbmem = vzalloc(priv->vsize);
 if (!priv->fbmem)
  goto err_videomem_alloc;

 retval = sysfs_create_group(&spi->dev.kobj, *ili9341_groups);
 if (retval)

 dev_err(&spi->dev, "Allocate framebuffer");
 info = framebuffer_alloc(sizeof(struct fb_info), &spi->dev);
 if (!info)
  goto err_fballoc;

 info->par = priv;
 priv->info = info;
 info->screen_base = priv->fbmem;
 info->fbops = &ili9341_ops;
 info->var = ili9341_var;
 info->fix = ili9341_fix;
 info->fix.smem_len = priv->vsize;
 /* We are virtual as we only exists in memory */
 info->fbdefio = &ili9341_defio;
 retval = register_framebuffer(info);
 if (retval < 0)
  goto err_fbreg;
 spi_set_drvdata(spi, info);
 fb_info(info, "Hecuba frame buffer device, using %dK of video memoryn",
  priv->vsize >> 10);

 priv->dir = debugfs_create_dir("ili9341-fb", NULL);
 debugfs_create_u32("chunk_size", 0666, priv->dir, &chunk_size);

 return 0;
 return retval;
static int ili9341_remove(struct spi_device *spi)
 struct fb_info *info = spi_get_drvdata(spi);
 if (info) {
  struct ili9341_priv *priv = info->par;
 return 0;

static const struct spi_device_id ili9341_ids[] = {
 {"ili9341-fb", 0},
MODULE_DEVICE_TABLE(spi, ili9341_ids);
static struct spi_driver  ili9341_driver = {
 .probe = ili9341_probe,
 .remove = ili9341_remove,
 .id_table = ili9341_ids,
 .driver = {
  .owner = THIS_MODULE,
  .name = "ili9341-fb",
MODULE_DESCRIPTION("fbdev driver for ili9341 controller");
MODULE_AUTHOR("Marcus Folkesson <>");

Take control of your Buffalo Linkstation NAS

Take control of your Buffalo Linkstation NAS

I finally bought a NAS for all of my super-important stuff.
It became a Buffalo Linkstation LS200, most because of the price ($300 for 4TB). It supports all of the standard protocols such as FTP, SAMBA, ATP and so on.

However, it would be really useful to use some sane protocols like sftp so you could use rsync for your backup scripts.

Bring a big coffee mug and let the hacking begin….

I knew that the NAS was based on the ARM architecture and supports a whole set of high level protocols, so one qualified guess is that there lives a little penguin in the box.

Lets start with download the latest firmware from the Buffalo webpage.
When the firmware is unzipped we have these files:

marcus@tuxie:~/shared/buffalo$ ls -al
total 780176
drwxr-xr-x  4 marcus marcus      4096 Jun 13 00:06 .
drwxr-xr-x 18 marcus marcus      4096 Jun 12 23:27 ..
-rw-r--r--  1 marcus marcus 190409979 Jun 13 00:06 hddrootfs.img
drwxr-xr-x  2 marcus marcus      4096 Apr  8 13:25 img
-rw-r--r--  1 marcus marcus  12602325 Apr  1 15:25 initrd.img
-rw-r--r--  1 marcus marcus       656 Apr  1 15:25 linkstation_version.ini
-rw-r--r--  1 marcus marcus       198 Apr  1 15:25 linkstation_version.txt
-rw-r--r--  1 marcus marcus 205568610 Jun 12 23:13
-rw-r--r--  1 marcus marcus    350104 Apr  1 15:25 LSUpdater.exe
-rw-r--r--  1 marcus marcus       327 Apr  1 15:25 LSUpdater.ini
-rw-r--r--  1 marcus marcus    674181 Apr  1 15:25 u-boot.img
-rw-r--r--  1 marcus marcus   2861933 Apr  1 15:25 uImage.img
-rw-r--r--  1 marcus marcus      4880 Apr  8 13:23 update.html

Ok, u-boot.img, uImage.img, initrd.img and hddrootfs.img tells us that I have a black little penguin cage
in front of me.
First of all, find out what kind of file these *.img files really are.:

System Message: WARNING/2 (<stdin>, line 34); backlink

Inline emphasis start-string without end-string.

marcus@tuxie:~/shared/buffalo$ file ./hddrootfs.img
./hddrootfs.img: Zip archive data, at least v2.0 to extract

Really? It is just an zip-file. Lets extract it then.:

marcus@tuxie:~/shared/buffalo$ unzip hddrootfs.img
Archive:  hddrootfs.img
[hddrootfs.img] hddrootfs.buffalo.updated password:

Of course, it is protected with a password. I will let my old friend John the Ripper take a look at it (I guess I had a great luck, the brute force attack only took 2.5 hours).
The password for the file is: aAhvlM1Yp7_2VSm6BhgkmTOrCN1JyE0C5Q6cB3oBB

marcus@tuxie:~/shared/buffalo$ unzip hddrootfs.img
Archive:  hddrootfs.img
[hddrootfs.img] hddrootfs.buffalo.updated password:
  inflating: hddrootfs.buffalo.updated

Terrific. We got a hddrootfs.buffalo.updated file. What is it anyway?:

marcus@tuxie:~/shared/buffalo$ file hddrootfs.buffalo.updated
hddrootfs.buffalo.updated: gzip compressed data, was "rootfs.tar", from Unix, last modified: Tue Apr  1 08:24:05 2014, max compression

It is just a gzip compressed tar archive, couldn’t be better! Extract it.:

marcus@tuxie:~/shared/buffalo$ mkdir rootfs
marcus@tuxie:~/shared/buffalo$ tar -xz --numeric-owner -f hddrootfs.buffalo.updated  -C ./rootfs/
marcus@tuxie:~/shared/buffalo$ ls -1l rootfs/
total 80
drwxr-xr-x  2 root root 4096 Apr  1 08:23 bin
-rwxr-xr-x  1 root root 1140 Feb  3 07:51
drwxr-xr-x  2 root root 4096 Apr  1 08:23 debugtool
drwxr-xr-x  5 root root 4096 Apr  1 08:23 dev
drwxr-xr-x 33 root root 4096 Jun 12 23:17 etc
drwxr-xr-x  4 root root 4096 Apr  1 08:23 home
drwxr-xr-x  9 root root 4096 Apr  1 08:23 lib
drwxr-xr-x  3 root root 4096 Feb  3 07:51 mnt
drwxr-xr-x  2 root root 4096 Apr  1 07:35 opt
-rwxr-xr-x  1 root root 2741 Feb  3 07:51
drwxr-xr-x  2 root root 4096 Apr  1 07:35 proc
drwxr-xr-x  3 root root 4096 Apr  1 08:23 root
drwxr-xr-x  3 root root 4096 Apr  1 08:22 run
drwxr-xr-x  2 root root 4096 Apr  1 08:23 sbin
drwxr-xr-x  2 root root 4096 Apr  1 07:35 sys
-rwxr-xr-x  1 root root 3751 Feb  3 07:51
drwxrwxrwt  3 root root 4096 Apr  1 08:23 tmp
drwxr-xr-x 11 root root 4096 Apr  1 08:23 usr
drwxr-xr-x  9 root root 4096 Apr  1 08:22 var
drwxrwxrwx  6 root root 4096 Apr  1 07:53 www

Here we go!

Modify the root filesystem

First of all, I really would like to have SSH access to the box, and I found that there is a SSH daemon in here (/usr/bin/sshd), but why is it not activated?
Take a look in one of the scripts that seems to be related to ssh:

marcus@tuxie:~/shared/buffalo/rootfs$ head -n 20 etc/init.d/
[ -f /etc/nas_feature ] && . /etc/nas_feature
SSHD=`which sshd`
if [ "${SSHD}" = "" -o ! -x ${SSHD} ] ; then
 echo "sshd is not supported on this platform!!!"
if [ "${SUPPORT_SFTP}" = "0" ] ; then
        echo "Not support sftp on this model." > /dev/console
        exit 0
umask 000

What about the second if-statement? SUPPORT_SFTP? And what is the /etc/nas_feature file? It does not exist in the package. Is it auto generated at boot?
Anyway, I remove the second statement, it seems evil.
So, if this starts up the ssh daemon, we would like to login as root, uncomment PermitRootLogin in sshd_config:

marcus@tuxie:~/shared/buffalo/rootfs$ sed -i 's/#PermitRootLogin/ PermitRootLogin/' etc/sshd_config

Then copy your public rsa-key to /root/.ssh/authorized_keys. If you don’t have a key, generate it with:

marcus@tuxie:~/shared/buffalo/rootfs$ ssh-keygen

Copy the key to target:

marcus@tuxie:~/shared/buffalo/rootfs$ mkdir ./root/.ssh
marcus@tuxie:~/shared/buffalo/rootfs$ cat ~/.ssh/ > ./root/.ssh/authorized_keys

This will let you to login without give any password.

Lets try to re-pack the whole thing.:

marcus@tuxie:~/shared/buffalo/rootfs$ mv ../hddrootfs.buffalo.updated{,.old}
marcus@tuxie:~/shared/buffalo/rootfs$ tar -czf ../hddrootfs.buffalo.updated *
marcus@tuxie:~/shared/buffalo/rootfs$ cd ..
marcus@tuxie:~/shared/buffalo$ zip -e hddrootfs.img hddrootfs.buffalo.updated

I encrypt the file with the same password as before, I dare not think about what happens if I don’t.

Time to update firmware

I execute the LSUpdater.exe from a virtual Windows machine and hold my thumbs…
The update process takes about 8 minutes and is a real pain, would it brick my NAS..?

After a while the power LED is indicating that the NAS is up and running. Wow.
Quick! Do a portscan!:

marcus@tuxie:~/shared/buffalo$ sudo nmap -sS
[sudo] password for marcus:
Starting Nmap 5.21 ( ) at 2014-06-13 11:59 CEST
Nmap scan report for nas (
Host is up (0.00049s latency).
Not shown: 990 closed ports
21/tcp    open  ftp
22/tcp    open  ssh
80/tcp    open  http
139/tcp   open  netbios-ssn
443/tcp   open  https
445/tcp   open  microsoft-ds
548/tcp   open  afp
873/tcp   open  rsync
8873/tcp  open  unknown
22939/tcp open  unknown
MAC Address: DC:FB:02:EB:06:A8 (Unknown)
Nmap done: 1 IP address (1 host up) scanned in 0.27 seconds

And there it is! The SSH daemon is running on port 22.:

marcus@tuxie:~$ ssh admin@
admin@'s password:
[admin@LS220D6A8 ~]$ ls /
bin/        debugtool/  home/       mnt/        proc/       sbin/       tmp/        www/
boot/       dev/        lib/        opt/        root/       sys/        usr/*  etc/        lost+found/* run/*    var/
[admin@LS220D6A8 ~]$

It is just beautiful!

Wait, what about the /etc/nas_features file?

[admin@LS220D6A8 ~]$ cat /etc/nas_feature

It seems that the files is generated. It also has the SUPPORT_SFTP config that we saw in

What about the kernel

In the current vanilla kernel, there is devicetrees that seems to be for the Buffalo linkstation.:

marcus@tuxie:~/marcus/linux/linux$ grep -i buffalo arch/arm/boot/dts/*.dts
arch/arm/boot/dts/kirkwood-lschlv2.dts: model = "Buffalo Linkstation LS-CHLv2";
arch/arm/boot/dts/kirkwood-lschlv2.dts: compatible = "buffalo,lschlv2", "buffalo,lsxl", "marvell,kirkwood-88f6281", "marvell,kirkwood";
arch/arm/boot/dts/kirkwood-lsxhl.dts:   model = "Buffalo Linkstation LS-XHL";
arch/arm/boot/dts/kirkwood-lsxhl.dts:   compatible = "buffalo,lsxhl", "buffalo,lsxl", "marvell,kirkwood-88f6281", "marvell,kirkwood";

The device tree seems to match the current CPU:

[admin@LS220D6A8 ~]$ cat /proc/cpuinfo
Processor : Marvell PJ4Bv7 Processor rev 1 (v7l)
BogoMIPS : 795.44
Features : swp half thumb fastmult vfp edsp vfpv3 vfpv3d16 tls
CPU implementer : 0x56
CPU architecture: 7
CPU variant : 0x1
CPU part : 0x581
CPU revision : 1
Hardware : Marvell Armada-370
Revision : 0000
Serial  : 0000000000000000

Also, the kernel is not tainted indicating that there is no out-of-tree modules.:

[admin@LS220D6A8 ~]$ cat /proc/sys/kernel/tainted

It could therefor be possible to compile a custom kernel with support for more USB-devices that may be plugged into the NAS.

Other tips

The LSUpdater.exe will refuse to update if the same version of software is already on the target. This means that you cannot upload the same firmware again… unless you change the version!

Together with the uploader application, there is a linkstation_version.ini file that contains information about each of the *.img.
The first thing I tried was just to increase the VERSION by one. This makes the LSUpdater.exe move a little forward, It stops complain about the same version, instead it complains about that this firmware is already on the target.
However, I needed to change the timestamp of each binary (increased the day by one), then it updates the firmware without any problem.

System Message: WARNING/2 (<stdin>, line 435); backlink

Inline emphasis start-string without end-string.

marcus@tuxie:~/shared/buffalo$ cat linkstation_version.ini
KERNEL=2014/04/01 14:33:46
INITRD=2014/04/01 14:35:10
ROOTFS=2014/04/01 15:23:41
FILE_BOOT = u-boot.img
FILE_KERNEL = uImage.img
FILE_INITRD = initrd.img
FILE_ROOTFS = hddrootfs.img

KERNEL=2014/04/01 14:33:46
KERNEL=2014/04/01 14:33:46
KERNEL=2014/04/01 14:33:46

High resolution timers

High resolution timers

Nearly all systems has some kind of Programmable Interrupt Timer (PIT) or High Precision Event Timer (HPET) that is programmed to periodically interrupt the operating system (if not configured with CONFIG_NO_HZl). The kernel performs several tasks in every of these ticks, such as timekeeping, calculate statistics for the currently running process, schedule a new process and so on.
The interrupt occurs at regular intervals – exactly HZ times per second. HZ is architecture specific and defined in asm-arch/param.h.

Jiffies is a central concept when talking about time management in the Linux kernel. A jiffy is simple the time between the ticks. More exactly 1/HZ seconds.
HZ has a typical value of 250 on IA-32/AMD64 architectures, and 100 on smaller systems such as ARM.

Most of the time management in the Linux-kernel is based on jiffies, even the timer_list (also known as low-resolution-timers).

High resolution timers (hrtimers) in the Linux kernel is timers that do not use a time specification based on jiffies, but employ nanosecond time stamps. In fact, the low resolution timers are implemented on top of the high-resolution mechanism, but that is another story.
Components of the hrtimer framework that are not universally applicable (not used by the low-resolution timers) is selected by CONFIG_HIGH_RES_TIMERS in the kernel configuration.

Setting up a timer

The usage of the hrtimers is really simple.

  1. Initialize a struct hrtimer with

    hrtimer_init(struct hrtimer *timer, clockid_t which_clock, enum hrtimer_mode mode);

timer is a pointer to the instance of the struct hrtimer.
clock is the clock to bind the timer to, often CLOCK_MONOTONIC or CLOCK_REALTIME.
mode specifies if the timer is working with absolute or relative time values. Two constants are available: HRTIMER_MODE_ABS and HRTIMER_MODE_REL.

Set a callback function with:

mytimer.function = my_callback;

Where my_callback is declared as:

enum hrtimer_restart my_callback(struct hrtimer *timer)

Start the timer with hrtimer_start:

struct ktime_t delay = ktime_set(5,0);
hrtimer_start(&mytimer, delay, CLOCK_MONOTONIC);

ktime_set initialize delay with 5 seconds and 0 nanoseconds.

Wait. The callback function will be called after 5s!

A full example

struct hrtimer mytimer;
ktime_t delay = ktime_set(5, 0);

enum hrtimer_restart my_callback(struct hrtimer *timer)
    printk("Hello from timer!n");

void ....()
    hrtimer_init(&mytimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
    mytimer.function = my_callback;

    hrtimer_start(&mytimer, delay, CLOCK_MONOTONIC);

Further reading

There is more functions related to the hrtimers. See incude/linux/hrtimer.h for a full list.
Other useful functions are:

int hrtimer_cancel(struct hrtimer *timer)
int hrtimer_try_to_cancel(struct hrtimer *timer)
int hrtimer_restart(struct hrtimer *timer)

Modules with parameters

Modules with parameters

Everybody knows that modules can take parameters, either via /sys/modules/<module>/parameters or via cmdline to the kernel, but how are these parameters created?

Parameters without callbacks

The Linux kernel provides the module_param() macro. The syntax is:

module_param(name, type, perm)

Which will simply create the module parameter and expose it as an entry in /sys/modules/<module>/parameters.

Code example

int debug_flag;
module_param(debug_flag, bool, S_IRUSR | S_IWUSR | S_IRGRP)
MODULE_PARM_DESC(debug_flag, "Set to 1 if debug should be enabled, 0 otherwise");

MODULE_PARM_DESC() is a short description of the parameter. Modinfo will read the description and present it for you.

Parameters with callbacks

Sometimes it may be useful to actually notify the driver that the value of a parameter has changed, which not the regular module_param() macro does.

module_param_cb is the way to go. The macro takes two callbacks functions, set and get, that is called when the user (or kernel if in cmdline) interact with the parameters. This is done by passing a struct kernel_param_ops to the macro. The syntax is:

module_param_cb(name, ops, arg, perm)
The module_param_cb is not heavily used in the kernel if we look in the drivers:
[06:40:35]marcus@tuxie:~/kernel$ git grep module_param_cb drivers/
drivers/acpi/sysfs.c:module_param_cb(debug_layer, &param_ops_debug_layer, &acpi_dbg_layer, 0644);
drivers/acpi/sysfs.c:module_param_cb(debug_level, &param_ops_debug_level, &acpi_dbg_level, 0644);
drivers/char/ipmi/ipmi_watchdog.c:module_param_cb(action, &param_ops_str, action_op, 0644);
drivers/char/ipmi/ipmi_watchdog.c:module_param_cb(preaction, &param_ops_str, preaction_op, 0644);
drivers/char/ipmi/ipmi_watchdog.c:module_param_cb(preop, &param_ops_str, preop_op, 0644);

In fact, there are just 5 entries, don’t ask me why, I think the macro is terrific.
The interface is really simple, just fill the kernel_param_ops struct and pass it to the module_param_cb macro.
I think the code is quite self-explained, so I just post an example taken from drivers/acpi/sysfs.c.

Code example

static int param_get_debug_level(char *buffer, const struct kernel_param *kp)
 int result = 0;
 int i;
 result = sprintf(buffer, "%-25stHex        SETn", "Description");
 for (i = 0; i < ARRAY_SIZE(acpi_debug_levels); i++) {
  result += sprintf(buffer + result, "%-25st0x%08lX [%c]n",
      (acpi_dbg_level & acpi_debug_levels[i].value)
      ? '*' : ' ');
 result +=
     sprintf(buffer + result, "--ndebug_level = 0x%08X (* = enabled)n",
 return result;

static struct kernel_param_ops param_ops_debug_level = {
 .set = param_set_uint,
 .get = param_get_debug_level,
module_param_cb(debug_level, &param_ops_debug_level, &acpi_dbg_level, 0644);

There is also a set of standard set/get-functions (the code above use param_set_uint for example).
These are called param_(set|get)_XXX where XXX is byte, short, int, long and so on.

Take a look in include/linux/moduleparam.h for further reading!

Linux memory overcommit

Linux memory overcommit

Linux is generous in terms of memory, it will almost never fail on requests from malloc(3) with friends. What does this mean in practice and how may it be a potential issue?

In short, overcommit memory means that the system will give the application so much memory it is asking for, even if the physical memory is not available. How may this work?
Well, the requested memory comes with one small restriction; the application is given as much memory it demands if it not going to use it. Seriously?
Yes, and it is pretty clever too.

The main purpose is to optimize memory handling by avoid swapping out memory as much as possible. The application does not _really_ need the memory before it is going to use it anyway (if it’s going to be used it at all), and it is not unlikely that an other application has freed memory before the allocated memory is used. A swap has been avoided.

Now, think about an embedded system without a swap area and with a limited amount of memory. Is memory overcommit still a good thing? It could be. It could also be a treacherous, unpredictable demon that haunts seemingly random devices.

In a case that the application uses a library that allocates tons of memory but never going to use it all, memory overcommit is pretty good because the application may not even start without it. A weird example? Not at all, let me just say three words; Qt with QML.

On the other hand, if the application really intend to use the memory we have a problem
It is even worse if the application only use the memory under specific circumstances that is hard to track down.

If a system is running out of memory, the unforgiving Out Of Memory- (OOM-) killer will terminate a (almost) random application in desperation.
This randomness makes it a little bit tricky. The victim may be your SSH server, your logging server, your application or whatever stands in the way of the OOM-killer.

The Linux kernel supports the following overcommit handling modes (refer to Documentation/vm/overcommit-accounting)

In practice

The overcommit policy is set via the sysctl vm.overcommit_memory or by writing to /proc/sys/vm/overcommit_memory.

For example:

echo 2 > /proc/sys/vm/overcommit_memory