So, a week in Prague has come to its end. The Embedded Linux Conference Europe was this year co-located with Open Source Summit and offered a lot of interesting talks on various topics.

One of the hottest topics this year was about our most beloved debugging function – prink().
What is so hard with printing? It turns out that printk is quite deadlock-prone and that is not an easy thing to work around in the current infrastructure of the kernel.

A common misconception is that printk() is a fast operation that simply writes the message to the global __log_buf variable. It is not.

A printk() may involve many different subsystems, different contexts or nesting, just to mention a few parts that needs to be handled.
For example:

  1. The output needs to go over some output medium (consoles)
    * The monitor
    * Frame buffers
    * UART / Serial console
    * Network console
    * Braille
    * …
  2. Uses different locking mechanismes
    * The console_lock (described below)
    * The logbuf_lock spinlock
    * Consoles often have their own locks
  3. Wake up waiting applications
    * syslogd
    * journald
    * …

Besides that, printk() is expected to work in every context, whether it is process, softirq, IRQ or NMI context.
With all these locking mechanisms involved, what happens if a printk in process context is interrupted by an NMI, and the NMI also calls printk?
In other words, there is a lot of special cases that needs to be handled.

How it works


Lets look back on how the printing was handled in a pre-history kernel.

SMP (Symmetric Multi Processing) SoCs became common in the late 1990s. Before that, everything was easy and everyone was happy. No NMIs. No races between multiple cores. Simple locking. No Facebook.
As a response to SMP systems, Linux v2.1.80 introduced a spin_lock to printk to avoid race conditions between multiple cores.
The solution we came up with was to serialize all prints to the console. If two CPUs called printk() at the same time, the second core has to wait for the first core to finish.

This does not scale well. In fact, it does not scale at all. What about a modern system with 100+ CPUs that all calls printk at the same time? Depending on the console, the printing may take milliseconds and you will surely end up with an unresponsive system.


Now we are doing things differently.
The first core that grabs the console_lock is responsible to print all messages in the __log_buf. If another core is calling printk() in meanwhile, it puts its data into __log_buf , tries to grab the lock which is busy, and then simple returns.
As __log_buf continues getting new data, the unlucky core that grabbed the console_lock may end up doing nothing but printing.

The good thing is that we only locks up a single core instead of all cores.
The bad thing is that we locks up a single core.

The code


printk() is defined in kernel/printk/printk.c and does not look much to the world

asmlinkage __visible int printk(const char *fmt, ...)
    va_list args;
    int r;

    va_start(args, fmt);
    r = vprintk_func(fmt, args);

    return r;

It simple calls vprintk_function with its own arguments.


vprintk_func() is a function that forward the arguments to different print-functions depending on the current context

__printf(1, 0) int vprintk_func(const char *fmt, va_list args)
    if (this_cpu_read(printk_context) & PRINTK_NMI_CONTEXT_MASK)
        return vprintk_nmi(fmt, args);

    if (this_cpu_read(printk_context) & PRINTK_SAFE_CONTEXT_MASK)
        return vprintk_safe(fmt, args);

    if (this_cpu_read(printk_context) & PRINTK_NMI_DEFERRED_CONTEXT_MASK)
        return vprintk_deferred(fmt, args);

    return vprintk_default(fmt, args);

The different contexts we consider are:

Normal context

If we are on normal context, there is nothing to consider at all, go for the vprintk_default() and just do our thing.

NMI context

In the case that the CPU supports NMIs (Non-Maskable Interrupts, (look for CONFIG_HAVE_NMI and CONFIG_PRINTK_NMI in your .config ), we go for vprintk_nmi(). vprintk_nmi() do a safe copy to a per-CPU buffer, not the global __log_buf.
Since NMIs are not nested by its nature, there is always only one write running. However, NMIs is only for the local CPU, and the buffer might get flushed from another CPU, so we still need to be careful.

"Recursive" context

If the printk() routine is interrupted and we end up in another call to printk from somewhere else, we go for the lock-less vprintk_safe() to prevent a recursion deadlock. vprintk_safe() is using a per-CPU buffer to store the message, just like NMI.

Deferred context

As already said, multiple locks is involved in the call chain of printk(). vprintk_deferred() is using the main logbuf_lock but avoid calling console drivers that might have their own locks. The actual printing is deferred to klogd_work kernel thread.


vprintk_emit() is responsible to write to __log_buf, (but not the only function, cont_flush() also write to __log_buf) and print out the content to all consoles.

asmlinkage int vprintk_emit(int facility, int level,
                const char *dict, size_t dictlen,
                const char *fmt, va_list args)


    <<<<< Strip kernel syslog prefix >>>>>


    <<<<< log_output() does the actual printing to __log_buf >>>>>
    printed_len = log_output(facility, level, lflags, dict, dictlen, text, text_len);


    if (!in_sched) {
         * Try to acquire and then immediately release the console
         * semaphore.  The release will print out buffers and wake up
         * /dev/kmsg and syslog() users.
        if (console_trylock())

    return printed_len;

The function is quite straight forward. The only thing that looks a little bit strange is

if (console_trylock())

Really? Grab the console_lock and immediately unlock it?
The thing is that all magic happens in console_unlock().


The CPU that is grabbing the console_lock is responsible to print to all registered consoles until all new data in __log_buf is printed. This regardless if other CPUs keeps filling the buffer with new data.

In the worst case, this CPU is doing nothing but printing and will never leave this function.

void console_unlock(void)

    <<<<< Endless loop? >>>>>
    for (;;) {

        <<<<< Go through all new messages >>>>>


        <<<<< Print to all consoles >>>><
        call_console_drivers(ext_text, ext_len, text, len);



    <<<<<  Release the exclusive_console once it is used >>>>>
    console_locked = 0;


    <<<<< Wake up klogd >>>>>
    if (wake_klogd)

The function is looping until all new messages is printed. For each new message, a call to call_console_drivers() is made.
The last thing that we do is waking up the klogd kernel thread that will signal to all userspace application that is waiting on klogctl(2).


call_console_drivers() is asking all registered consoles to print out a message. The console_lock must be held when calling this function.

static void call_console_drivers(const char *ext_text, size_t ext_len,
                 const char *text, size_t len)
    struct console *con;

    trace_console_rcuidle(text, len);

    if (!console_drivers)

    for_each_console(con) {
        if (exclusive_console && con != exclusive_console)
        if (!(con->flags & CON_ENABLED))
        if (!con->write)
        if (!cpu_online(smp_processor_id()) &&
            !(con->flags & CON_ANYTIME))
        if (con->flags & CON_EXTENDED)
            con->write(con, ext_text, ext_len);
            con->write(con, text, len);


As we see, there is a lot of logic involved in a simple call to printk() and you should not be surprised if all your printing has impact on your systems performance or timing.
But how do we debug if printk() is a no-no? The answer is trace_printk().

This function write (almost) directly to a trace buffer and is therefore a fairly fast operation.
The trace buffer is exposed from tracefs, usually mounted at /sys/kernel/tracing.

As a bonus, the messages is merged with other output from ftrace when doing a function trace.

Other things that is good to know about __log_buf


The kernel log buffer is exported as a global symbol called __log_buf. If you have an systems that deadlocks without any output on the console and you may reboot the system without resetting RAM, then you may print the content of __log_buf from the bootloader.

Determine the physical address of __log_buf

[09:59:31]marcus@little:~/git/linux$ grep __log_buf
c14cfba8 b __log_buf

The 0xc14cfba8 is the virtual address of __log_buf.
This kernel is compiled for a 32bit ARM with the CONFIG_VMSPLIT_3G set, so the kernel virtual address space start at 0xc0000000. To get the physical address out of the virtual, subtract the offset (0xc14cfba8 – 0xc0000000) and you will end up with 0x014cfba8. Dump this address from your bootloader and you will see your kernel log.


The size of __log_buf is set at compile-time with CONFIG_LOG_BUF_SHIFT. The value defines the size as a power of 2 and is usually set to 16 (64K).

There is also a CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT that is the per-CPU buffer where messages printed from unsafe context are temporary stored. Examples on unsafe context would be NMI and printk recursions. The messages are copied to the main log buffer in a safe context to avoid a deadlock.

This buffer is rarely used but has to be there to avoid the nasty deadlocks.
The CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT was introduced in v4.11 and is also expressed as a power of 2.

libostree and $OSTREE_REPO environment path

libostree is a great tool to handle incremental or full updates for an Linux filesystem.
But virtually all commands of ostree requires the –repo argument to override the default system repository path. This is really annoying after a while so my first attempt to get rid of this was to create an alias

alias ost='ostree --repo=/tmp/repo'

It works but is not good. To solve it once and for all I was about to implement support for getting the repo path from an environment variable.

When digging around in the code, I found that the support is already there but completely undocumented.

It is therefor already possible to

export OSTREE_REPO=/tmp/repo
ostree <cmd>

So the patch for this time is simply a line that mention the option to use $OSTREE_REPO environment variable to point out a directory as system repository path.

commit 075e676eb63a3abaf647c789e22d8d1afe3a1dd5
Author: Marcus Folkesson <>
Date:   Tue Oct 17 21:03:23 2017 +0200

docs: mention the $OSTREE_REPO environment variable

$OSTREE_REPO may be set to override the default location
of the repository.


Signed-off-by: Marcus Folkesson <>

Closes: #1282
Approved by: cgwalters


If /ostree/repo is not the location of your system repository, it is really preferable to set the $OSTREE_REPO environment variable instead of provide the –repo to every single command.

FIT vs legacy image format

U-Boot supports several image formats when booting a kernel. However, a Linux system usually need multiple files for booting.
Such files may be the kernel itself, an initrd and a device tree blob.

A typical embedded Linux system have all these files in at least two-three different configurations. It is not uncommon to have

  • Default configuration
  • Rescue configuration
  • Development configuration
  • Production configuration

Only these four configurations may involve twelve different files. Or maybe the devicetree is shared between two configurations. Or maybe the initrd is… Or maybe the….. you got the point. It has a farily good chance to end up quite messy.

This is a problem with the old legacy formats and is somehow addressed in the FIT (Flattened Image Tree) format.

Lets first look at the old formats before we get into FIT.


The most well known format for the Linux kernel is the zImage. The zImage contains a small header followed by a self-extracting code and finally the payload itself.
U-Boot support this format by the bootz command.

Image layout

zImage format
Decompressing code
Compressed Data


When talking about U-Boot, the zImage is usually encapsulated in a file called uImage created with the mkimage utility. Beside image data, the uImage also contains information such as OS type, loader information, compression type and so on.
Both data and header is checksumed with CRC32.

The uImage format also supports multiple images. The zImage, initrd and devicetree blob may therefor be included in one single monolithic uImage.
Drawbacks with this monolith is that it has no flexible indexing, no hash integrity and no support for security at all.

Image layout

uImage format
Header checksum
Data size
Data load address
Entry point address
Data CRC
Image type
Compression type
Image name
Image data


The FIT (Flattened Image Tree) has been around for a while but is for unknown reason rarely used in real systems, based on my experience. FIT is a tree structure like Device Tree (before they change format to YAML: and handles images of various types.

These images is then used in something called configurations. One image may be a used in several configurations.

This format has many benifits compared to the other:

  • Better solutions for multi component images
    • Multiple kernels (productions, feature, debug, rescue…)
    • Multiple devicetrees
    • Multiple initrds
  • Better hash integrity of images. Supports different hash algorithms like
    • SHA1
    • SHA256
    • MD5
  • Support signed images
    • Only boot verified images
    • Detect malware

Image Tree Source

The file describing the structure is called .its (Image Tree Source).
Here is an example of a .its file


/ {
    description = "Marcus FIT test";
    #address-cells = <1>;

    images {
        kernel@1 {
            description = "My default kenel";
            data = /incbin/("./zImage");
            type = "kernel";
            arch = "arm";
            os = "linux";
            compression = "none";
            load = <0x83800000>;
            entry = <0x83800000>;
            hash@1 {
                algo = "md5";

        kernel@2 {
            description = "Rescue image";
            data = /incbin/("./zImage");
            type = "kernel";
            arch = "arm";
            os = "linux";
            compression = "none";
            load = <0x83800000>;
            entry = <0x83800000>;
            hash@1 {
                algo = "crc32";

        fdt@1 {
            description = "FDT for my cool board";
            data = /incbin/("./devicetree.dtb");
            type = "flat_dt";
            arch = "arm";
            compression = "none";
            hash@1 {
                algo = "crc32";


    configurations {
        default = "config@1";

        config@1 {
            description = "Default configuration";
            kernel = "kernel@1";
            fdt = "fdt@1";

        config@2 {
            description = "Rescue configuration";
            kernel = "kernel@2";
            fdt = "fdt@1";


This .its file has two kernel images (default and rescue) and one FDT image.
Note that the default kernel is hashed with md5 and the other with crc32. It is also possible to use several hash-functions per image.

It also specifies two configuration, Default and Rescue. The two configurations is using different kernels (well, it is the same kernel in this case since both is pointing to ./zImage..) but share the same FDT. It is easy to build up new configurations on demand.

The .its file is then passed to mkimage to generate an itb (Image Tree Blob):

mkimage -f kernel.its kernel.itb

which is bootable from U-Boot.

Look at ./doc/uImage.FIT/ in the U-Boot source code for more examples on how a .its-file can look like.
It also contains examples on signed images which is worth a look.

Boot from U-Boot

To boot from U-Boot, use the bootm command and specify the physical address for the FIT image and the configuration you want to boot.

Example on booting config@1 from the .its above:

bootm 0x80800000#config@1

Full example with bootlog:

U-Boot 2015.04 (Oct 05 2017 - 14:25:09)

CPU:   Freescale i.MX6UL rev1.0 at 396 MHz
CPU:   Temperature 42 C
Reset cause: POR
Board: Marcus Cool board
fuse address == 21bc400
serialnr-low : e1fe012a
serialnr-high : 243211d4
       Watchdog enabled
I2C:   ready
DRAM:  512 MiB
Using default environment

In:    serial
Out:   serial
Err:   serial
Net:   FEC1
Boot from USB for mfgtools
Use default environment for                              mfgtools
Run bootcmd_mfg: run mfgtool_args;bootz ${loadaddr} - ${fdt_addr};
Hit any key to stop autoboot:  0
=> bootm 0x80800000#config@1
## Loading kernel from FIT Image at 80800000 ...
   Using 'config@1' configuration
   Trying 'kernel@1' kernel subimage
     Description:  My cool kernel
     Type:         Kernel Image
     Compression:  uncompressed
     Data Start:   0x808000d8
     Data Size:    7175328 Bytes = 6.8 MiB
     Architecture: ARM
     OS:           Linux
     Load Address: 0x83800000
     Entry Point:  0x83800000
     Hash algo:    crc32
     Hash value:   f236d022
   Verifying Hash Integrity ... crc32+ OK
## Loading fdt from FIT Image at 80800000 ...
   Using 'config@1' configuration
   Trying 'fdt@1' fdt subimage
     Description:  FDT for my cool board
     Type:         Flat Device Tree
     Compression:  uncompressed
     Data Start:   0x80ed7e4c
     Data Size:    27122 Bytes = 26.5 KiB
     Architecture: ARM
     Hash algo:    crc32
     Hash value:   1837f127
   Verifying Hash Integrity ... crc32+ OK
   Booting using the fdt blob at 0x80ed7e4c
   Loading Kernel Image ... OK
   Loading Device Tree to 9ef86000, end 9ef8f9f1 ... OK

Starting kernel ...

One thing to think about

Since this FIT image tends to grow in size, it is a good idea to set CONFIG_SYS_BOOTM_LEN in the U-Boot configuration.

        Normally compressed uImages are limited to an
        uncompressed size of 8 MBytes. If this is not enough,
        you can define CONFIG_SYS_BOOTM_LEN in your board config file
        to adjust this setting to your needs.


I advocate to use the FIT format because it solves many problems that can get really awkward. Just the bonus to program one image in production instead of many is a great benifit. The configurations is well defined and there is no chance that you will end up booting one kernel image with an incompatible devicetree or whatever.

Use FIT images should be part of the bring-up-board activity, not the last thing todo before closing a project. The "Use seperate kernel/dts/initrds for now and then move on to FIT when the platform is stable" does not work. It simple not gonna happend, and it is a mistake.

config utility for Buildroot

I’m using the ./scripts/config script in the Linux kernel tree a lot. The script is used to manipulate a .config file from the command line which is quite nice to be able to do.

I use it mostly to enable configurations from a script or as a part of automated tests.

Buildroot is also using KBuild as its configuration system so I adapted this script and submitted a patch. It is now part of the mainline.

commit f2c5c7a263d9d61faba80544e0f6ac6da03716ee
Author: Marcus Folkesson <>
Date:   Tue Sep 26 23:15:02 2017 +0200

utils/config: new script to manipulate .config files on the command line

Default prefix is set to `BR2_` but may be overidden by setting

Example usage:

./utils/config --package --enable GNUPG

Check state of config option `BR2_PACKAGE_GNUPG`:
./utils/config --package --state GNUPG

./utils/config --package --disable GNUPG

Set `BR2_TARGET_GENERIC_ISSUE` to "Welcome":
./utils/config --set-str  TARGET_GENERIC_ISSUE "Welcome"

Copied from the Linux kernel (4.13-rc6) source code and adapted to
Thanks to Andi Kleen who is the original author of this script.

Signed-off-by: Marcus Folkesson <>
[Arnout: merge the two tr invications]
Signed-off-by: Arnout Vandecappelle (Essensium/Mind) <>

In addition to all commands found in he Linux kernel version, I also addes the –package config as a shortcut to operate on packages.

PID 1 in containers

What is PID 1

The top-most process in a UNIX system has PID (Process ID) 1 and is usually the init process.
The Init process is the first userspace application started on a system and is started by the kernel at boottime.
The kernel is looking in a few predefined paths (and the init kernel parameter). If no such application is found, the system will panic().

See init/main.c:kernel_init

if (!try_to_run_init_process("/sbin/init") ||
    !try_to_run_init_process("/etc/init") ||
    !try_to_run_init_process("/bin/init") ||
    return 0;
panic("No working init found.  Try passing init= option to kernel. "
      "See Linux Documentation/admin-guide/init.rst for guidance.");

All processes in UNIX has a parent/child relationship which builds up a big relationship-tree.
Some resources and permissions are inherited from parent to child such as UID and cgroup restrictions.

As in the real world, with parenthood comes obligations.
For example: What is usually the last line of your main()-function? Hopfully something like


All processes exits with an exit code that tells us if the operation was sucessful or not.
Who is interested in this exit code anyway?
In the real world, the parents are interested in their children’s result, and so even here. The parent is responsible to wait(2) on their children to terminate just to fetch its exit code.
But what if the parent died before the child?

Lets go back to the init process.
The init process has several tasks, and one is to adopt "orphaned" (called zombie) child processes.
Why? Because all processes will return an exit code and will not terminate completely until someone is listen for what they have to say.
The init process is simply wait(2):ing on the exit code, throw it away and let the child die. Sad but true, but the child may not rest i peace otherwise.
The operating system expects the init process to reap adopted children. Otherwise the children will exist in the system as a zombie and taking up some kernel resources and consume a slot in the kernel process table.

PID 1 in containers

Containers is a concept that isolate processes in different namespaces. Example of such namespaces are PID, users, networking and filesystem.
To create a container is quite simple, just create a new process with clone(2) and provide relevant flags to create new namespaces for the process.

The flags related to namespaces are listed in include/uapi/linux/sched.h:

#define CLONE_NEWPID                0x20000000      // New pid namespace
#define CLONE_NEWCGROUP             0x02000000      // New cgroup namespace
#define CLONE_NEWUTS                0x04000000      // New utsname namespace
#define CLONE_NEWIPC                0x08000000      // New ipc namespace
#define CLONE_NEWUSER               0x10000000      // New user namespace
#define CLONE_NEWPID                0x20000000      // New pid namespace
#define CLONE_NEWNET                0x40000000      // New network namespace

All processes is running in a "container-context" because the processes allways executes in a namespace.
On a system "without containers", all processes still have one common namespace that all processes is using.

When using CLONE_NEWPID, the kernel will create a new PID namespace and let the newly created process has the PID 1.
As we already know, the PID 1 process has a very special task, namely to kill all orphaned children.
This PID 1 process could be any application (make, bash, nginx, ftp-server or whatever) that is missing this essential adopt-and-slay-mechanism.
If the reaping is not handled, it will result in zombie-processes. This was a real problem not long time ago for Docker containers (google Docker and zombies to see what I mean).
Nowadays we have the –init flag on docker run to tell the container to use tini (, a zombie-reaping init process to run with PID 1.

When PID 1 dies

This is the reason to why I’m writing this post. I was wondering who is killing PID 1 in a container since we learned that a PID 1 may not die under any circumstances.
PID 1 in cointainers is obviosly an exception from this golden rule, but how does the kernel differentiate between init processes in different PID namespaces?

Lets follow a process to its very last breath.

The call chain we will look at is the following:


kernel/exit.c:do_exit() is called when a process is going to be cleaned up from the system after it has exited or being terminated.
The function is collecting the exit code, delete timers, free up resources and so on.
Here is an extract of the function:


<<<<< Collect exit code >>>>>
tsk->exit_code = code;
taskstats_exit(tsk, group_dead);


if (group_dead)

<<<<< Free up resources >>>>>
if (group_dead)



<<<<< Notify tasks in the same group >>>>>
exit_notify(tsk, group_dead);


exit_notify() is to notifing our "dead group" that we are going down.
One important thing to notice is that almost all resources are freed at this point.
Even if the process is going into a zombie state, the footprint is relative small, but still, the zombie consumes a slot in the process table.

The size of the process table in Linux and defined by PID_MAX_LIMIT in include/linux/threads.h:

\* A maximum of 4 million PIDs should be enough for a while.
\* [NOTE: PID/TIDs are limited to 2^29 ~= 500+ million, see futex.h.]
(sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))

The process table is indeed quite big. But if you are running for example a webserver as PID 1 that is fork(2):ing on each HTTP request. All these forks will result in a zombie and the number will escalate quite fast.


kernel/exit.c:exit_notify() is sending signals to all the closest relatives so that they know to properly mourn this process.
In the beginning of this function, a call is made to forget_original_parent():

static void exit_notify(struct task_struct *tsk, int group_dead)
    bool autoreap;
    struct task_struct *p, *n;

  >>>>>  forget_original_parent(tsk, &dead);


This function simply does two things

  1. Make init (PID 1) inherit all the child processes
  2. Check to see if any process groups have become orphaned as a result of our exiting, and if they have any stopped jobs, send them a SIGHUP and then a SIGCONT.

find_child_reaper() will help us find a proper reaper:

>>>>> reaper = find_child_reaper(father);
if (list_empty(&father->children))


kernel/exit.c:find_child_reaper() is looking if a father is available.
If a father (or other relative) is not available at all, we must be the PID 1 process.

This is the interesting part:

if (unlikely(pid_ns == &init_pid_ns)) {
    panic("Attempted to kill init! exitcode=0x%08x\n",
        father->signal->group_exit_code ?: father->exit_code);

init_pid_ns refers (declared in kernel/pid.c) to our real init process.
If the real init process exits, panic the whole system since it cannot continue without an init process.
If it is not, call zap_pid_ns_processes(), here we have our PID1-cannot-be-killed-exception we are looking for!
We contiue following the call chain down to zap_pid_ns_processes().


zap_pid_ns_processes function is part of the PID namespace and is located in kernel/pid_namespace.c
The function iterates through all tasks in the same group and send signal SIGKILL to each of them.

nr = next_pidmap(pid_ns, 1);
while (nr > 0) {


    task = pid_task(find_vpid(nr), PIDTYPE_PID);
    if (task && !__fatal_signal_pending(task))
    >>>>> send_sig_info(SIGKILL, SEND_SIG_FORCED, task);


    nr = next_pidmap(pid_ns, nr);



The PID 1 in containers is handled in a seperate way than the real init process.
This is obvious, but now we know where the codeflow differ for PID 1 in different namespaces.

We also see that if the PID1 in a PID namespace dies, all the subprocesses will be terminated with SIGKILL.
This behavior reflects the fact that the init process is essential for the correct operation of any PID namespace.

Memory management in the kernel

Memory management in the kernel

Memory management is among the most complex parts in the Linux kernel. There is so many critical parts such as page allocator, slab allocator, virtual memory handling, memory mapping, MMU, IOMMU and so on. All these parts has to work perfect (or at least allmost perfect 🙂 ) because all system use them either they want to or not.
If there is a bug or performance issue you will be noticed quite soon.

My goal is to produce a few posts on the topic and try to sort out the different parts and describe how they work and the connection between.
I will begin from the physical bottom and work myself up to how userspace allocates memory in their little blue world with pink clouds. (Everything is so easy on the user side)

struct page

A page is the smallest unit that matters in terms of virtual memory. This is because the MMU (Memory Management Unit, described in a upcoming post) is only dealing with those pages. A typical size for a page is 4KB, at least for 32bit architectures. The most 64-bit architectures uses 8KB pages.

Every one of those physical pages is represented by a struct page that is defined in include/linux/mm_types.h. That is a lot of pages. If we do a simple calculation:
We have a 32-bit system that has 512MB of physical memory, this memory is divided into 131,072 4KB pages. Think of that 512MB is not even so much memory on a modern system today.

What I want to say is that this struct page should be kept as small as possible because it scales up a lot when physical memory increases.

Ok, so there is a struct page somewhere that got allocated for each physical page, which is a lot, but what does it do?
It does a lot of housekeeping, lets look at a few set of members that I think is most interresting:

struct page {
    unsigned long flags;
    unsigned long private;
    void    *virtual;
    atomic_t    _count;
    pgoff_t    index;
    spinlock_t  *ptl;
    spinlock_t  ptl;

flags is keeping track of the page status which could be dirty (need to be written to media), locked in memory (not allowed to be paged out), permissions and so on. See enum pageflags in include/linux/page-flags.h for more information.

private is not a defined field. May be used as a long or interpreted as a pointer. (Shared with ptl in a union!)

virtual is the virtual address of the page. In case that the page belongs to the high memory (memory that is not permanently mapped) this field will be NULL and require dynamic mapping.

_count is a simple reference counter to determine when the page is free for allocation.

index is the offset within a mapping.

ptl is a interresting one! I think it requires a special section in this post. (Shared with private in a union!)

Page Table Lock

PTL stands for Page Table Lock and is a per-page lock.

In the next part of these memory management posts I will describe the struct mm_struct, how PGD, PMD and PTE are related, but for now it’s enough that you just have heard the words.

Ok, there is one thing that is good to know. The struct mm_struct (also defined in mm_types.h) is a structure that represent a process’s address space and contains all information related to the process memory. The structure has a pointer to virtual memory areas that refers to one or more struct page.
This structure also has the member mm->page_table_lock that is a spinlock that protects all page tables of the mm_struct. This was the original approach and is still used by several architectures.

However, this mm->page_table_lock is a little bit clumsy since it lock all pages at once. This is no real problem on a single-cpu without SMP system. But nowdays that is not a very common scenario.

Instead, the split page table lock was introduced and has a separate per-table lock to allow concurrency access to pages in the same mm_struct. Remember that the mm_struct is per process? So this increases page-fault/page-access performance in multi-threaded applications only.

When is split page table locks enabed?
It is enabled in compile-time if CONFIG_SPLIT_PTLOCK_CPUS (I have never seen another value but 4 on this one) is less or equal to NR_CPUS.

Here is a few defines int the beginning of the mm_types.h header file:


The ALLOC_SPLIT_PTLOCKS is a little bit clever. If the sizeof a spinlock is less or equal to the size of a long, the spinlock is embedded in the struct page and can therefor save a cache line by avoiding indirect access.
If a spinlock does not fit into a long, then the page->ptl is used as a pointer that points to a dynamic allocated spinlock. As I said, this is a clever construction since it allow us to increase the size of a spinlock and there is no problem. Exemple when sizeof spinlock does not fit is when using DEBUG_SPINLOCK, DEBUG_LOCK_ALLOC or applying the PREEMPT_RT patchset.

The spinlock_t is allocated in pgtable_page_ctor() for PTE tables and in pgtable_pmd_page_ctor for PMD tables. These function (and the corresponding free-functions) should be called in *every place that allocated/freeing pages. This is already done in mainline, but I know there is evil hardware vendors out there that do not. For example, if you use their evil code and apply the preempt_rt patchset (that increases the sizeof spinlock_t), you have to verify that thier code behaves.

Also, pgtable_*page_ctor() can fail, this must be handled properly.

Remember that the page-ptl should never be accessed directly, use appropriate helper functions for that.

Example on such helper functions is..


MMAP memory between kernel and userspace

MMAP memory between kernel and userspace

Let kernel allocate memory and let userspace map is sounds like an easy task, and sure it is.
There are just a few things that is good to know about page mapping.

The MMU (Memory Management Unit) contains page tables with entries for mapping between virtual and physical addresses. These pages is the smallest unit that the MMU deals with.
The size of a page is given by the PAGE_SIZE macro in asm/page.h ans is typically 4k for most architectures.

There is a few more useful macros in asm/page.h:

PAGE_SHIFT: How many steps we should shift to left to get a PAGE_SIZE
PAGE_SIZE: Size of a page, defined as (1 << PAGE_SHIFT).
PAGE_ALIGN(len): Will round up the length to the closest alignment of PAGE_SIZE.

How does mmap(2) work?

Every page table entry has a bit that tells us if the entry is valid in supervisor mode (kernel mode) only. And sure, all memory allocated in kernel space will have this bit set.
What the mmap(2) system call do is simply creating a new page table entry with a different virtual address that points to the same physical memory page. The difference is that this supervisor-bit is not set.
This let userspace access the memory as if it was a part of the application, for now it is!
The kernel is not involved in those accesses at all, so it is really fast.

Magic? Kind of.
The magic is called remap_pfn_range().
What remap_pfn_range() do is just essentially to update the process’s specific page table with these new entries.

Example, please

Allocate memory

As we already know, the smallest unit that the MMU handle is the size of PAGE_SIZE and the mmap(2) only works with full pages. Even if you just want to share only 100 bytes, a whole page frame will be remapped and must therefor be allocated in the kernel.
The allocated memory must also be page aligned.


One way to allocate pages is with __get_free_pages().:

unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)

gft_mask is commonly set to GFP_KERNEL in process/kernel context and GFP_ATOMIC in interrupt context. The order is the number of pages to allocate expressed in 2^order.

For example::

u8 *vbuf = __get_free_pages(GFP_KERNEL, size >> PAGE_SHIFT);

System Message: WARNING/2 (<stdin>, line 48); backlink

Inline emphasis start-string without end-string.

Allocated memory is freed with __free_pages().


A more common (and preferred) way to allocate virtual continuous memory is with vmalloc().
vmalloc() will allways allocate whole set of pages, no matter what. This is exactly what we want!

Read about vmalloc() in kmalloc(9):

Allocated memory is freed with vfree().


If you need only one page, alloc_page() will give you that.
If this is the case, insead of using remap_pfn_range(), vm_insert_page() will do the work you for you.
Notice that vm_insert_page() apparently only works on order-0 (single-page) allocation. So if you want to allocate N pages, you will hace to call vm_insert_page() N times.

Now some code



/* page align */
priv->a_size = PAGE_ALIGN(priv->a_size);
priv->a_area =vmalloc(priv->a_size);


static int scan_mmap (struct file *file, struct vm_area_struct *vma)
struct mmap_priv *priv = file->private_data;
unsigned long start = vma->vm_start;
unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
unsigned long page;
size_t size = vma->vm_end - vma->vm_start;
if (size > priv->a_size)
           return -EINVAL;
page = vmalloc_to_pfn((void *)priv->a_area);
if (remap_pfn_range(vma, start, page, priv->a_size, PAGE_SHARED))
           return -EAGAIN;
vma->vm_flags |= VM_RESERVED; /* avoid to swap out this VMA */
return 0;

2.2″ TFT and BeagleBone

2.2" TFT display on Beaglebone

I recently bought a 2.2" TFT display on Ebay (come on, 7 bucks…) and was up to use it with my BeagleBone. Luckily for me there was no Linux driver for the ILI9341 controller so it is just to roll up my sleeves and get to work.

Boot up the BeagleBone

I haven’t booted up my bone for a while and support for the board seems to have reached the mainline in v3.8 (currently at v3.15), so the first step is just to get it boot with a custom kernel.

Clone the vanilla kernel from

git clone git://

Use the omap2plus_defconfig as base:

make ARCH=arm omap2plus_defconfig

I will still use my old U-boot version, which does not have support for devicetrees, so I have to make sure that


This simply tells the boot code to look for a device tree binary (DTB) appended to the zImage. Without this option, the kernel expects the address of a dtb in the r2 register (on ARM architectures), but that does not work on my ancient bootloader.

Next step is to compile the kernel. We are using U-Boot as bootloader, but we do not create an uImage since we have to append the dtb to the zImage before that.:

make ARCH=arm CROSS_COMPILE=arm-none-linux-gnueabi-

Next, create the device tree blob. We are using the arch/arm/dts/am335x-bone.dts as source.:

make ARCH=arm CROSS_COMPILE=arm-none-linux-gnueabi- am33x-bone.dtb

Now we are only two steps behind a booting kernel! First we need to append the dtb to the zImage, and then we need to create an U-boot-friendly kernel image with mkimage.:

cat arch/arm/boot/zImage arch/arm/boot/dts/am335x-bone.dtb > ./zImage_dtb
mkimage -A arm -O linux -T kernel -C none -a 0x80008000 -e 0x80008000 -n 'BeagleBone image' -d ./zImage_dtb uImage

Put the uImage on the uSD-card and boot it up. ..

BeagleBone login:


Enable SPI

First of all, we need to setup the pinmux for the spi-bus. This is done with the pinctrl subsystem in the devicetree interface file (arch/arm/boot/dts/am335x-bone-common.dtsi).

Create the pins. For more detailed explaination of the values, see the BeagleBone System Reference Manual.

spi1_pins: spi1_pins_s0 {
   pinctrl-single,pins = <
     0x190 0x33      /* mcasp0_aclkx.spi1_sclk, INPUT_PULLUP | MODE3 */
     0x194 0x33      /* mcasp0_fsx.spi1_d0, INPUT_PULLUP | MODE3 */
     0x198 0x13      /* mcasp0_axr0.spi1_d1, OUTPUT_PULLUP | MODE3 */
     0x19c 0x13      /* mcasp0_ahclkr.spi1_cs0, OUTPUT_PULLUP | MODE3 */

Then override the spi1 entry and create an instance of our device driver. The driver will have the name "ili9341-fb".

 status = "okay";
 pinctrl-names = "default";
 pinctrl-0 = <&spi1_pins>;
 ili9341: ili9341@0 {
  compatible = "ili9341-fb";
  reg = <0>;
  spi-max-frequency = <16000000>;
  dc-gpio = <&gpio3 19 GPIO_ACTIVE_HIGH>;

Create an entry in the Kbuild system

I always integrate the modules into the kbuild system as the first step. This for several reasons:
– I use one kernel for all of my projects, just different branches
– It is simple to jump around with cscope/ctags
– It gives you control when the kernel version and your driver follow eachother
– Out-of-tree modules is evil (gives you a tainted kernel and everyone will spit on you)

Those who don’t know how to put a module into the kbuild system – get ready to be surprised how simple it is!

Every directory in the kernel structure contains at least two files, a Makefile and a Kconfig. The Makefile tells the make buildsystem which files to compile and the Kconfig file is interpreted by (menu|k|x|old|….)config.

Here is what’s needed:

diff --git a/drivers/video/fbdev/Kconfig b/drivers/video/fbdev/Kconfig
index e1f4727..be4ec8f 100644
--- a/drivers/video/fbdev/Kconfig
+++ b/drivers/video/fbdev/Kconfig
@@ -163,6 +163,18 @@ config FB_DEFERRED_IO
        depends on FB
+config FB_ILI9341
+       tristate "ILI9341 TFT driver"
+       depends on FB
+       select FB_SYS_FILLRECT
+       select FB_SYS_COPYAREA
+       select FB_SYS_IMAGEBLIT
+       select FB_SYS_READ
+       select FB_DEFERRED_IO
+       ---help---
+       This enables functions for handling video modes using the ili9341 controller
 config FB_HECUBA
        depends on FB
diff --git a/drivers/video/fbdev/Makefile b/drivers/video/fbdev/Makefile
index 0284f2a..105166a 100644
--- a/drivers/video/fbdev/Makefile
+++ b/drivers/video/fbdev/Makefile
@@ -60,6 +60,7 @@ obj-$(CONFIG_FB_ATARI)            += atafb.o c2p_iplan2.o atafb_mfb.o
                                      atafb_iplan2p2.o atafb_iplan2p4.o atafb_iplan2p8.o
 obj-$(CONFIG_FB_MAC)              += macfb.o
 obj-$(CONFIG_FB_HECUBA)           += hecubafb.o
+obj-$(CONFIG_FB_ILI9341)          += ili9341.o
 obj-$(CONFIG_FB_N411)             += n411.o
 obj-$(CONFIG_FB_HGA)              += hgafb.o
 obj-$(CONFIG_FB_XVR500)           += sunxvr500.o
diff --git a/drivers/video/fbdev/ili9341.c b/drivers/video/fbdev/ili9341.c

Deferred IO

Deferred IO is a way to delay and repurpose IO. It uses host memory as a buffer and the MMU pagefault as a pretrigger for when to perform the device IO.
You simple tell the kernel the minimum delay between the triggers should occours, this allows you to do burst transfers to the device at a given framerate. This has the big benefit that if the userspace updates the framebuffer several times in this period, we will only write it once.

The interface is _really_ simple. All you need to follow is these four steps (see Documentation/fb/deferred_io.txt):

  1. Setup your structure.

    static struct fb_deferred_io hecubafb_defio = {
     .delay  = HZ,
     .deferred_io = hecubafb_dpy_deferred_io,

The delay is the minimum delay between when the page_mkwrite trigger occurs
and when the deferred_io callback is called. The deferred_io callback is
explained below.

  1. Setup your deferred IO callback.

    static void hecubafb_dpy_deferred_io(struct fb_info *info,
        struct list_head *pagelist)

The deferred_io callback is where you would perform all your IO to the display
device. You receive the pagelist which is the list of pages that were written
to during the delay. You must not modify this list. This callback is called
from a workqueue.

  1. Call init:

    info->fbdefio = &hecubafb_defio;
  2. Call cleanup:



The driver is quite straight forward and there was no really hard problem with the driver itself. However, I had problem to get a high framerate because the SPI communication took time. All SPI communication is asynchronious and all jobs is stacked on a queue before it gets scheduled. This takes time. One obvious solution is to write bigger chunks with each transfer, and that is what I did.

But the problem was that when I increased the chunk size, the kernel got panic with the DMA transfers.
After an half a hour of code-digging, the problem is derived to the spi-controller for the omap2 (drivers/spi/spi-omap2-mcspi.c). It defines the DMA_MIN_BYTES which is arbitrarily set to 160. The code then compare the data length to this constant and determine if it should use DMA or not. It shows up that the DMA-transfer-code itself is broken.

A temporary solution is to increase the DMA_MIN_BYTES to at least a full frame (240x320x2) bytes until I have looked at the DMA code and submitted a fix 🙂


Here is a shell started from Ubuntu

I have also tested to startup Qt and directfb applications. It all works like a charm.

The Deferred IO interface is really nice for such displays. I’m surprised that there is currently so few drivers using it.

(the not so cleaned up) Code:

 * linux/drivers/video/ili9341.c -- FB driver for ili9341 controller
 * Copyright (C) 2014, Marcus Folkesson
 * This file is subject to the terms and conditions of the GNU General Public
 * License. See the file COPYING in the main directory of this archive for
 * more details.
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/string.h>
#include <linux/mm.h>
#include <linux/vmalloc.h>
#include <linux/delay.h>
#include <linux/interrupt.h>
#include <linux/fb.h>
#include <linux/init.h>
#include <linux/list.h>
#include <linux/uaccess.h>
#include <linux/spi/spi.h>
#include <video/ili9341.h>
#include <linux/regmap.h>
#include <linux/gpio.h>
#include <linux/of.h>
#include <linux/gpio.h>
#include <linux/of_gpio.h>
#include <linux/debugfs.h>

/* Display specific information */
#define SCREEN_WIDTH (240)
#define SCREEN_HIGHT (320)
#define SCREEN_BPP  (16)
#define ID "cbdff49d683b"
#define ID_SZ 12

static unsigned int chunk_size;

struct ili9341_priv {
 struct spi_device *spi;
 struct regmap *regmap;
 struct fb_info *info;
 u32 vsize;
 int dc;
 char *fbmem;
 struct dentry *dir;
static struct fb_fix_screeninfo ili9341_fix = {
 .id =  "ili9341",
 /*.visual = FB_VISUAL_MONO01,*/
 .xpanstep = 0,
 .ypanstep = 0,
 .ywrapstep = 0,
 .line_length = SCREEN_WIDTH*2,
 .accel = FB_ACCEL_NONE,
static struct fb_var_screeninfo ili9341_var = {
 .xres  = SCREEN_WIDTH,
 .yres  = SCREEN_HIGHT,
 .xres_virtual = SCREEN_WIDTH,
 .yres_virtual = SCREEN_HIGHT,
 .bits_per_pixel = SCREEN_BPP,
 .nonstd  = 1,
 .red = {
  .offset = 11,
  .length = 5,
 .green = {
  .offset = 5,
  .length = 6,
 .blue = {
  .offset = 0,
  .length = 5,
 .transp = {
  .offset = 0,
  .length = 0,

static const struct regmap_config ili9341_regmap_config = {
 .reg_bits = 8,
 .val_bits = 8,
 .can_multi_write = 1,
static void fill(struct ili9341_priv *priv);
static void fill_area(struct ili9341_priv *priv, int y1, int y2);
/* main ili9341 functions */
static void apollo_send_data(struct ili9341_priv *par, unsigned char data)
 /* set data */
static void apollo_send_command(struct ili9341_priv *par, unsigned char data)
static void ili9341_dpy_update(struct ili9341_priv *par)
static void ili9341_dpy_update_area(struct ili9341_priv *par, int y1, int y2 )
 fill_area(par, y1, y2);
/* this is called back from the deferred io workqueue */
static void ili9341_dpy_deferred_io(struct fb_info *info,
    struct list_head *pagelist)
 struct page *cur;
 struct fb_deferred_io *fbdefio = info->fbdefio;
 struct ili9341_priv *par = info->par;

 struct page *page;
 unsigned long beg, end;
 int y1, y2, miny, maxy;
 miny = INT_MAX;
 maxy = 0;
 /* stop here if list is empty */
 if (list_empty(pagelist)){
  dev_err(&par->spi->dev, "pagelist is empty");
 list_for_each_entry(page, pagelist, lru) {
  beg = page->index << PAGE_SHIFT;
  end = beg + PAGE_SIZE - 1;
  y1 = beg / (info->fix.line_length);
  y2 = end / (info->fix.line_length);
  if (y2 >= info->var.yres)
   y2 = info->var.yres - 1;
  if (miny > y1)
   miny = y1;
  if (maxy < y2)
   maxy = y2;
 ili9341_dpy_update_area(info->par, miny, maxy);
 // dev_err(&par->spi->dev, ".");
static void ili9341_fillrect(struct fb_info *info,
       const struct fb_fillrect *rect)
 struct ili9341_priv *par = info->par;
 sys_fillrect(info, rect);
static void ili9341_copyarea(struct fb_info *info,
       const struct fb_copyarea *area)
 struct ili9341_priv *par = info->par;
 sys_copyarea(info, area);
static void ili9341_imageblit(struct fb_info *info,
    const struct fb_image *image)
 struct ili9341_priv *par = info->par;
 sys_imageblit(info, image);
 * this is the slow path from userspace. they can seek and write to
 * the fb. it's inefficient to do anything less than a full screen draw
static ssize_t ili9341_write(struct fb_info *info, const char __user *buf,
    size_t count, loff_t *ppos)
 struct ili9341_priv *par = info->par;
 unsigned long p = *ppos;
 void *dst;
 int err = 0;
 unsigned long total_size;
 if (info->state != FBINFO_STATE_RUNNING)
  return -EPERM;
 total_size = info->fix.smem_len;
 if (p > total_size)
  return -EFBIG;
 if (count > total_size) {
  err = -EFBIG;
  count = total_size;
 if (count + p > total_size) {
  if (!err)
   err = -ENOSPC;
  count = total_size - p;
 dst = (void __force *) (info->screen_base + p);
 if (copy_from_user(dst, buf, count))
  err = -EFAULT;
 if  (!err)
  *ppos += count;
 return (err) ? err : count;
static struct fb_ops ili9341_ops = {
 .owner   = THIS_MODULE,
 .fb_write  = ili9341_write,
 .fb_fillrect = ili9341_fillrect,
 .fb_copyarea = ili9341_copyarea,
 .fb_imageblit = ili9341_imageblit,
static struct fb_deferred_io ili9341_defio = {
 .delay  = HZ/60,
 .deferred_io = ili9341_dpy_deferred_io,

static void write_command(struct ili9341_priv *priv, u8 data)
 gpio_set_value(priv->dc, 0);
 spi_write(priv->spi, &data, 1);
 gpio_set_value(priv->dc, 1);
static void write_data(struct ili9341_priv *priv, u8 data)
 gpio_set_value(priv->dc, 1);
 spi_write(priv->spi, &data, 1);
static void write_data16(struct ili9341_priv *priv, u8 data)
 gpio_set_value(priv->dc, 1);
 spi_write(priv->spi, &data, 1);
static void init(struct ili9341_priv *priv)
 write_command(priv, 0xCB);
 write_data(priv, 0x39);
 write_data(priv, 0x2C);
 write_data(priv, 0x00);
 write_data(priv, 0x34);
 write_data(priv, 0x02);
 write_command(priv, 0xCF);
 write_data(priv, 0x00);
 write_data(priv, 0XC1);
 write_data(priv, 0X30);
 write_command(priv, 0xE8);
 write_data(priv, 0x85);
 write_data(priv, 0x00);
 write_data(priv, 0x78);
 write_command(priv, 0xEA);
 write_data(priv, 0x00);
 write_data(priv, 0x00);
 write_command(priv, 0xED);
 write_data(priv, 0x64);
 write_data(priv, 0x03);
 write_data(priv, 0X12);
 write_data(priv, 0X81);
 write_command(priv, 0xF7);
 write_data(priv, 0x20);
 write_command(priv, 0xC0);     //Power control
 write_data(priv, 0x23);    //VRH[5:0]
 write_command(priv, 0xC1);     //Power control
 write_data(priv, 0x10);    //SAP[2:0];BT[3:0]
 write_command(priv, 0xC5);     //VCM control
 write_data(priv, 0x3e);    //Contrast
 write_data(priv, 0x28);
 write_command(priv, 0xC7);     //VCM control2
 write_data(priv, 0x86);    //--
/* XXX: Hue?! */
 write_command(priv, 0x36);     // Memory Access Control
 write_data(priv, 0x48);   //C8    //48 68绔栧睆//28 E8 妯睆
 write_command(priv, 0x3A);
 write_data(priv, 0x55);
 write_command(priv, 0xB1);
 write_data(priv, 0x00);
 write_data(priv, 0x18);
 write_command(priv, 0xB6);     // Display Function Control
 write_data(priv, 0x08);
 write_data(priv, 0x82);
 write_data(priv, 0x27);

 write_command(priv, 0xF2);     // 3Gamma Function Disable
 write_data(priv, 0x00);
 write_command(priv, 0x26);     //Gamma curve selected
 write_data(priv, 0x01);
 write_command(priv, 0xE0);     //Set Gamma
 write_data(priv, 0x0F);
 write_data(priv, 0x31);
 write_data(priv, 0x2B);
 write_data(priv, 0x0C);
 write_data(priv, 0x0E);
 write_data(priv, 0x08);
 write_data(priv, 0x4E);
 write_data(priv, 0xF1);
 write_data(priv, 0x37);
 write_data(priv, 0x07);
 write_data(priv, 0x10);
 write_data(priv, 0x03);
 write_data(priv, 0x0E);
 write_data(priv, 0x09);
 write_data(priv, 0x00);
 write_command(priv, 0XE1);     //Set Gamma
 write_data(priv, 0x00);
 write_data(priv, 0x0E);
 write_data(priv, 0x14);
 write_data(priv, 0x03);
 write_data(priv, 0x11);
 write_data(priv, 0x07);
 write_data(priv, 0x31);
 write_data(priv, 0xC1);
 write_data(priv, 0x48);
 write_data(priv, 0x08);
 write_data(priv, 0x0F);
 write_data(priv, 0x0C);
 write_data(priv, 0x31);
 write_data(priv, 0x36);
 write_data(priv, 0x0F);
 write_command(priv, 0x11);     //Exit Sleep
 write_command(priv, 0x29);    //Display on
 write_command(priv, 0x2c);
static void setCol(struct ili9341_priv *priv, u16 start, u16 end)
 u8 tmp;
 write_command(priv, 0x2a);
 tmp = (start & 0xff00) >> 8;
 write_data(priv, tmp);
 tmp = (start & 0x00ff) >> 0;
 write_data(priv, tmp);

 tmp = (end & 0xff00) >> 8;
 write_data(priv, tmp);
 tmp = (end & 0x00ff) >> 0;
 write_data(priv, tmp);
static void setPage(struct ili9341_priv *priv, u16 start, u16 end)
 u8 tmp;
 write_command(priv, 0x2b);
 tmp = (start & 0xff00) >> 8;
 write_data(priv, tmp);
 tmp = (start & 0x00ff) >> 0;
 write_data(priv, tmp);

 tmp = (end & 0xff00) >> 8;
 write_data(priv, tmp);
 tmp = (end & 0x00ff) >> 0;
 write_data(priv, tmp);
static void setPos(struct ili9341_priv *priv, u16 x1, u16 x2, u16 y1, u16 y2)
 setPage(priv, y1, y2);
 setCol(priv, x1, x2);

static void fill_area(struct ili9341_priv *priv, int y1, int y2)
 int i = 0;
 char val = 0xaa;
 char *p = priv->fbmem;
 int ret;
 int start =y1*SCREEN_WIDTH*2 + 1;
 int stop = y2*SCREEN_WIDTH*2+1;
 int range = stop - start;

 if (!chunk_size)
  chunk_size = 10;
 if (start + range > priv->vsize)
  range = priv->vsize - start;
 setCol(priv, 0, 239);
 setPage(priv, y1, y2);
 write_command(priv, 0x2c);

 for(i = start; i < stop; i += chunk_size)
  if ( i + chunk_size > stop )
   chunk_size = stop - i;
  ret = spi_write(priv->spi, &priv->fbmem[i], chunk_size);
  if (ret != 0)
   dev_err(&priv->spi->dev, "Error code: %in", ret);
static void fill(struct ili9341_priv *priv)
 int i = 0;
 char val = 0xaa;
 char *p = priv->fbmem;
 setCol(priv, 0, 239);
 setPage(priv, 0, 319);
 write_command(priv, 0x2c);

 fill_area(priv, 0, 319);
static ssize_t id_show(struct device *dev, struct device_attribute *attr,
   char *buf)
 sprintf(buf, "%s", ID);
 return ID_SZ;
static ssize_t id_store(struct device *dev, struct device_attribute *attr,
    const char *buf, size_t count)
 char kbuf[ID_SZ];
 if (count != ID_SZ)
  return -EINVAL;
 memcpy(kbuf, buf, ID_SZ);
 if (memcmp(kbuf, ID, ID_SZ) != 0)
  return -EINVAL;
 return count;
DEVICE_ATTR(id, 0666, id_show, id_store);
static struct attribute *ili9341_attrs[] = {

static int ili9341_probe(struct spi_device *spi)
 struct fb_info *info;
 int retval = -ENOMEM;
 struct ili9341_priv *priv;
 struct device_node *np = spi->dev.of_node;
 int ret;
 dev_err(&spi->dev, "Hello from I!n");
 priv = kzalloc(sizeof(struct ili9341_priv), GFP_KERNEL);
  return -ENOMEM;
 priv->spi = spi;

/* TODO: better fail handling... */
 priv->dc = of_get_named_gpio(np, "dc-gpio", 0);
 if (priv->dc  == -EPROBE_DEFER)
  return -EPROBE_DEFER;
 if (gpio_is_valid(priv->dc)) {
  ret = devm_gpio_request(&spi->dev, priv->dc, "tft dc");
  if (ret)
   dev_err(&spi->dev, "could not request dcn");
  ret = gpio_direction_output(priv->dc, 1);
  if (ret)
   dev_err(&spi->dev, "could not set DC to output");
  dev_err(&spi->dev, "DC gpio is not valid");

 dev_err(&spi->dev, "Initialize regmap");
 priv->regmap = devm_regmap_init_spi(spi, &ili9341_regmap_config);
 if (IS_ERR(priv->regmap))
  goto err_regmap;
 dev_err(&spi->dev, "regmap OK");
 priv->fbmem = vzalloc(priv->vsize);
 if (!priv->fbmem)
  goto err_videomem_alloc;

 retval = sysfs_create_group(&spi->dev.kobj, *ili9341_groups);
 if (retval)

 dev_err(&spi->dev, "Allocate framebuffer");
 info = framebuffer_alloc(sizeof(struct fb_info), &spi->dev);
 if (!info)
  goto err_fballoc;

 info->par = priv;
 priv->info = info;
 info->screen_base = priv->fbmem;
 info->fbops = &ili9341_ops;
 info->var = ili9341_var;
 info->fix = ili9341_fix;
 info->fix.smem_len = priv->vsize;
 /* We are virtual as we only exists in memory */
 info->fbdefio = &ili9341_defio;
 retval = register_framebuffer(info);
 if (retval < 0)
  goto err_fbreg;
 spi_set_drvdata(spi, info);
 fb_info(info, "Hecuba frame buffer device, using %dK of video memoryn",
  priv->vsize >> 10);

 priv->dir = debugfs_create_dir("ili9341-fb", NULL);
 debugfs_create_u32("chunk_size", 0666, priv->dir, &chunk_size);

 return 0;
 return retval;
static int ili9341_remove(struct spi_device *spi)
 struct fb_info *info = spi_get_drvdata(spi);
 if (info) {
  struct ili9341_priv *priv = info->par;
 return 0;

static const struct spi_device_id ili9341_ids[] = {
 {"ili9341-fb", 0},
MODULE_DEVICE_TABLE(spi, ili9341_ids);
static struct spi_driver  ili9341_driver = {
 .probe = ili9341_probe,
 .remove = ili9341_remove,
 .id_table = ili9341_ids,
 .driver = {
  .owner = THIS_MODULE,
  .name = "ili9341-fb",
MODULE_DESCRIPTION("fbdev driver for ili9341 controller");
MODULE_AUTHOR("Marcus Folkesson <>");

Take control of your Buffalo Linkstation NAS

Take control of your Buffalo Linkstation NAS

I finally bought a NAS for all of my super-important stuff.
It became a Buffalo Linkstation LS200, most because of the price ($300 for 4TB). It supports all of the standard protocols such as FTP, SAMBA, ATP and so on.

However, it would be really useful to use some sane protocols like sftp so you could use rsync for your backup scripts.

Bring a big coffee mug and let the hacking begin….

I knew that the NAS was based on the ARM architecture and supports a whole set of high level protocols, so one qualified guess is that there lives a little penguin in the box.

Lets start with download the latest firmware from the Buffalo webpage.
When the firmware is unzipped we have these files:

marcus@tuxie:~/shared/buffalo$ ls -al
total 780176
drwxr-xr-x  4 marcus marcus      4096 Jun 13 00:06 .
drwxr-xr-x 18 marcus marcus      4096 Jun 12 23:27 ..
-rw-r--r--  1 marcus marcus 190409979 Jun 13 00:06 hddrootfs.img
drwxr-xr-x  2 marcus marcus      4096 Apr  8 13:25 img
-rw-r--r--  1 marcus marcus  12602325 Apr  1 15:25 initrd.img
-rw-r--r--  1 marcus marcus       656 Apr  1 15:25 linkstation_version.ini
-rw-r--r--  1 marcus marcus       198 Apr  1 15:25 linkstation_version.txt
-rw-r--r--  1 marcus marcus 205568610 Jun 12 23:13
-rw-r--r--  1 marcus marcus    350104 Apr  1 15:25 LSUpdater.exe
-rw-r--r--  1 marcus marcus       327 Apr  1 15:25 LSUpdater.ini
-rw-r--r--  1 marcus marcus    674181 Apr  1 15:25 u-boot.img
-rw-r--r--  1 marcus marcus   2861933 Apr  1 15:25 uImage.img
-rw-r--r--  1 marcus marcus      4880 Apr  8 13:23 update.html

Ok, u-boot.img, uImage.img, initrd.img and hddrootfs.img tells us that I have a black little penguin cage
in front of me.
First of all, find out what kind of file these *.img files really are.:

System Message: WARNING/2 (<stdin>, line 34); backlink

Inline emphasis start-string without end-string.

marcus@tuxie:~/shared/buffalo$ file ./hddrootfs.img
./hddrootfs.img: Zip archive data, at least v2.0 to extract

Really? It is just an zip-file. Lets extract it then.:

marcus@tuxie:~/shared/buffalo$ unzip hddrootfs.img
Archive:  hddrootfs.img
[hddrootfs.img] hddrootfs.buffalo.updated password:

Of course, it is protected with a password. I will let my old friend John the Ripper take a look at it (I guess I had a great luck, the brute force attack only took 2.5 hours).
The password for the file is: aAhvlM1Yp7_2VSm6BhgkmTOrCN1JyE0C5Q6cB3oBB

marcus@tuxie:~/shared/buffalo$ unzip hddrootfs.img
Archive:  hddrootfs.img
[hddrootfs.img] hddrootfs.buffalo.updated password:
  inflating: hddrootfs.buffalo.updated

Terrific. We got a hddrootfs.buffalo.updated file. What is it anyway?:

marcus@tuxie:~/shared/buffalo$ file hddrootfs.buffalo.updated
hddrootfs.buffalo.updated: gzip compressed data, was "rootfs.tar", from Unix, last modified: Tue Apr  1 08:24:05 2014, max compression

It is just a gzip compressed tar archive, couldn’t be better! Extract it.:

marcus@tuxie:~/shared/buffalo$ mkdir rootfs
marcus@tuxie:~/shared/buffalo$ tar -xz --numeric-owner -f hddrootfs.buffalo.updated  -C ./rootfs/
marcus@tuxie:~/shared/buffalo$ ls -1l rootfs/
total 80
drwxr-xr-x  2 root root 4096 Apr  1 08:23 bin
-rwxr-xr-x  1 root root 1140 Feb  3 07:51
drwxr-xr-x  2 root root 4096 Apr  1 08:23 debugtool
drwxr-xr-x  5 root root 4096 Apr  1 08:23 dev
drwxr-xr-x 33 root root 4096 Jun 12 23:17 etc
drwxr-xr-x  4 root root 4096 Apr  1 08:23 home
drwxr-xr-x  9 root root 4096 Apr  1 08:23 lib
drwxr-xr-x  3 root root 4096 Feb  3 07:51 mnt
drwxr-xr-x  2 root root 4096 Apr  1 07:35 opt
-rwxr-xr-x  1 root root 2741 Feb  3 07:51
drwxr-xr-x  2 root root 4096 Apr  1 07:35 proc
drwxr-xr-x  3 root root 4096 Apr  1 08:23 root
drwxr-xr-x  3 root root 4096 Apr  1 08:22 run
drwxr-xr-x  2 root root 4096 Apr  1 08:23 sbin
drwxr-xr-x  2 root root 4096 Apr  1 07:35 sys
-rwxr-xr-x  1 root root 3751 Feb  3 07:51
drwxrwxrwt  3 root root 4096 Apr  1 08:23 tmp
drwxr-xr-x 11 root root 4096 Apr  1 08:23 usr
drwxr-xr-x  9 root root 4096 Apr  1 08:22 var
drwxrwxrwx  6 root root 4096 Apr  1 07:53 www

Here we go!

Modify the root filesystem

First of all, I really would like to have SSH access to the box, and I found that there is a SSH daemon in here (/usr/bin/sshd), but why is it not activated?
Take a look in one of the scripts that seems to be related to ssh:

marcus@tuxie:~/shared/buffalo/rootfs$ head -n 20 etc/init.d/
[ -f /etc/nas_feature ] && . /etc/nas_feature
SSHD=`which sshd`
if [ "${SSHD}" = "" -o ! -x ${SSHD} ] ; then
 echo "sshd is not supported on this platform!!!"
if [ "${SUPPORT_SFTP}" = "0" ] ; then
        echo "Not support sftp on this model." > /dev/console
        exit 0
umask 000

What about the second if-statement? SUPPORT_SFTP? And what is the /etc/nas_feature file? It does not exist in the package. Is it auto generated at boot?
Anyway, I remove the second statement, it seems evil.
So, if this starts up the ssh daemon, we would like to login as root, uncomment PermitRootLogin in sshd_config:

marcus@tuxie:~/shared/buffalo/rootfs$ sed -i 's/#PermitRootLogin/ PermitRootLogin/' etc/sshd_config

Then copy your public rsa-key to /root/.ssh/authorized_keys. If you don’t have a key, generate it with:

marcus@tuxie:~/shared/buffalo/rootfs$ ssh-keygen

Copy the key to target:

marcus@tuxie:~/shared/buffalo/rootfs$ mkdir ./root/.ssh
marcus@tuxie:~/shared/buffalo/rootfs$ cat ~/.ssh/ > ./root/.ssh/authorized_keys

This will let you to login without give any password.

Lets try to re-pack the whole thing.:

marcus@tuxie:~/shared/buffalo/rootfs$ mv ../hddrootfs.buffalo.updated{,.old}
marcus@tuxie:~/shared/buffalo/rootfs$ tar -czf ../hddrootfs.buffalo.updated *
marcus@tuxie:~/shared/buffalo/rootfs$ cd ..
marcus@tuxie:~/shared/buffalo$ zip -e hddrootfs.img hddrootfs.buffalo.updated

I encrypt the file with the same password as before, I dare not think about what happens if I don’t.

Time to update firmware

I execute the LSUpdater.exe from a virtual Windows machine and hold my thumbs…
The update process takes about 8 minutes and is a real pain, would it brick my NAS..?

After a while the power LED is indicating that the NAS is up and running. Wow.
Quick! Do a portscan!:

marcus@tuxie:~/shared/buffalo$ sudo nmap -sS
[sudo] password for marcus:
Starting Nmap 5.21 ( ) at 2014-06-13 11:59 CEST
Nmap scan report for nas (
Host is up (0.00049s latency).
Not shown: 990 closed ports
21/tcp    open  ftp
22/tcp    open  ssh
80/tcp    open  http
139/tcp   open  netbios-ssn
443/tcp   open  https
445/tcp   open  microsoft-ds
548/tcp   open  afp
873/tcp   open  rsync
8873/tcp  open  unknown
22939/tcp open  unknown
MAC Address: DC:FB:02:EB:06:A8 (Unknown)
Nmap done: 1 IP address (1 host up) scanned in 0.27 seconds

And there it is! The SSH daemon is running on port 22.:

marcus@tuxie:~$ ssh admin@
admin@'s password:
[admin@LS220D6A8 ~]$ ls /
bin/        debugtool/  home/       mnt/        proc/       sbin/       tmp/        www/
boot/       dev/        lib/        opt/        root/       sys/        usr/*  etc/        lost+found/* run/*    var/
[admin@LS220D6A8 ~]$

It is just beautiful!

Wait, what about the /etc/nas_features file?

[admin@LS220D6A8 ~]$ cat /etc/nas_feature

It seems that the files is generated. It also has the SUPPORT_SFTP config that we saw in

What about the kernel

In the current vanilla kernel, there is devicetrees that seems to be for the Buffalo linkstation.:

marcus@tuxie:~/marcus/linux/linux$ grep -i buffalo arch/arm/boot/dts/*.dts
arch/arm/boot/dts/kirkwood-lschlv2.dts: model = "Buffalo Linkstation LS-CHLv2";
arch/arm/boot/dts/kirkwood-lschlv2.dts: compatible = "buffalo,lschlv2", "buffalo,lsxl", "marvell,kirkwood-88f6281", "marvell,kirkwood";
arch/arm/boot/dts/kirkwood-lsxhl.dts:   model = "Buffalo Linkstation LS-XHL";
arch/arm/boot/dts/kirkwood-lsxhl.dts:   compatible = "buffalo,lsxhl", "buffalo,lsxl", "marvell,kirkwood-88f6281", "marvell,kirkwood";

The device tree seems to match the current CPU:

[admin@LS220D6A8 ~]$ cat /proc/cpuinfo
Processor : Marvell PJ4Bv7 Processor rev 1 (v7l)
BogoMIPS : 795.44
Features : swp half thumb fastmult vfp edsp vfpv3 vfpv3d16 tls
CPU implementer : 0x56
CPU architecture: 7
CPU variant : 0x1
CPU part : 0x581
CPU revision : 1
Hardware : Marvell Armada-370
Revision : 0000
Serial  : 0000000000000000

Also, the kernel is not tainted indicating that there is no out-of-tree modules.:

[admin@LS220D6A8 ~]$ cat /proc/sys/kernel/tainted

It could therefor be possible to compile a custom kernel with support for more USB-devices that may be plugged into the NAS.

Other tips

The LSUpdater.exe will refuse to update if the same version of software is already on the target. This means that you cannot upload the same firmware again… unless you change the version!

Together with the uploader application, there is a linkstation_version.ini file that contains information about each of the *.img.
The first thing I tried was just to increase the VERSION by one. This makes the LSUpdater.exe move a little forward, It stops complain about the same version, instead it complains about that this firmware is already on the target.
However, I needed to change the timestamp of each binary (increased the day by one), then it updates the firmware without any problem.

System Message: WARNING/2 (<stdin>, line 435); backlink

Inline emphasis start-string without end-string.

marcus@tuxie:~/shared/buffalo$ cat linkstation_version.ini
KERNEL=2014/04/01 14:33:46
INITRD=2014/04/01 14:35:10
ROOTFS=2014/04/01 15:23:41
FILE_BOOT = u-boot.img
FILE_KERNEL = uImage.img
FILE_INITRD = initrd.img
FILE_ROOTFS = hddrootfs.img

KERNEL=2014/04/01 14:33:46
KERNEL=2014/04/01 14:33:46
KERNEL=2014/04/01 14:33:46

Magic SysRq

Magic SysRq

Every kernel-hacker should knows about the magic sysrq already, so this post is kind of unnecessary.
To enable the magic, make sure CONFIG_MAGIC_SYSRQ is set in your Kernel hacking tab.

I use this feature… a lot. Mostly for set loglevel and reboot the system, but it is also very useful when debugging.
So, how to use it?

As everybody is using GNU Screen (what else) as their TTY terminal, the keyboard combination is:
ctrl+A B. And here the magic begins!
This combination simply sends a SysRq keycode to the target system.

To get information about available commands, press ctrl+A B h. h as in help.:

[669510.910125] SysRq : HELP : loglevel(0-9) reBoot Crash terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z)
And here we go! My favorites are:
0-9, set debug level
b, reboot the system (ingrained...)
p, show a register dump ( useful if the system hangs)
g, switch console to KGDB (I love KGDB)
1, Show all timers (good when looking for power-thieves)

The best part is that Magic SysRq works in allmost every situation, even if the system is frozen.

For further details see the kernel source file Documentation/sysrq.txt