The dynamic linker [1] in a Linux system is using several environment variables to customize it's behavior. The most commonly used is probably LD_LIBRARY_PATH which is a list of directories where it search for libraries at execution time. Another variable I use quite often is LD_TRACE_LOADED_OBJECTS to let the program list its dynamic dependencies, just like ldd(1).

For example, consider the following output

$ LD_TRACE_LOADED_OBJECTS=1 /bin/bash (0x00007ffece29e000) => /usr/lib/ (0x00007fc9b82d1000) => /usr/lib/ (0x00007fc9b80cd000) => /usr/lib/ (0x00007fc9b7d15000) => /usr/lib/ (0x00007fc9b7add000)
    /lib64/ (0x00007fc9b851f000) => /usr/lib/ (0x00007fc9b78b1000)


LD_PRELOAD is a list of additional shared objects that should be loaded before all other dynamic dependencies. When the loader is resolving symbols, it sequentially walk through the list of dynamic shared objects and takes the first match. This makes it possible to overide functions in other shared objects and change the behavior of the application completely.

Consider the following example

$ LD_PRELOAD=/usr/lib/ LD_TRACE_LOADED_OBJECTS=1 /bin/bash (0x00007ffc73f61000)
    /usr/lib/ (0x00007f131c234000) => /usr/lib/ (0x00007f131bfe6000) => /usr/lib/ (0x00007f131bde2000) => /usr/lib/ (0x00007f131ba2a000) => /usr/lib/ (0x00007f131b7f2000)
    /lib64/ (0x00007f131c439000) => /usr/lib/ (0x00007f131b5c6000)

Here we have preloaded libSegFault and it is listed in second place. In the first place we have which is a Virtual Dynamic Shared Object provided by the Linux kernel. The VDSO deserves it's own separate blog post, it is a cool feature that maps kernel code into the a process's context as a .text segment in a virtual library. is part of glibc [2] and comes with your toolchain. The library is for debugging purpose and is activated by preload it at runtime. It does not actually overrides functions but register signal handlers in a constructor (yes, you can execute code before main) for specified signals. By default only SIGEGV (see signal(7)) is registered. These registered handlers print a backtrace for the applicaton when the signal is delivered. See its implementation in [3].

Set the environment variable SEGFAULT_SIGNALS to explicit select signals you want to register a handler for.


This is an useful feature for debug purpose. The best part is that you don't have to recompile your code.

libSegFault in action

Our application

Consider the following in real life application taken directly from the local nuclear power plant:

void handle_uranium(char *rod)
    *rod = 0xAB;

void start_reactor()
    char *rod = 0x00;

int main()

The symptom

We are seeing a segmentation fault when operate on a particular uranium rod, but we don't know why.

Use libSegFault

Start the application with libSegFault preloaded and examine the dump:

$ LD_PRELOAD=/usr/lib/ ./powerplant
*** Segmentation fault
Register dump:

 RAX: 0000000000000000   RBX: 0000000000000000   RCX: 0000000000000000
 RDX: 00007ffdf6aba5a8   RSI: 00007ffdf6aba598   RDI: 0000000000000000
 RBP: 00007ffdf6aba480   R8 : 000055d2ad5e16b0   R9 : 00007f98534729d0
 R10: 0000000000000008   R11: 0000000000000246   R12: 000055d2ad5e14f0
 R13: 00007ffdf6aba590   R14: 0000000000000000   R15: 0000000000000000
 RSP: 00007ffdf6aba480

 RIP: 000055d2ad5e1606   EFLAGS: 00010206

 CS: 0033   FS: 0000   GS: 0000

 Trap: 0000000e   Error: 00000006   OldMask: 00000000   CR2: 00000000

 FPUCW: 0000037f   FPUSW: 00000000   TAG: 00000000
 RIP: 00000000   RDP: 00000000

 ST(0) 0000 0000000000000000   ST(1) 0000 0000000000000000
 ST(2) 0000 0000000000000000   ST(3) 0000 0000000000000000
 ST(4) 0000 0000000000000000   ST(5) 0000 0000000000000000
 ST(6) 0000 0000000000000000   ST(7) 0000 0000000000000000
 mxcsr: 1f80
 XMM0:  00000000000000000000000000000000 XMM1:  00000000000000000000000000000000
 XMM2:  00000000000000000000000000000000 XMM3:  00000000000000000000000000000000
 XMM4:  00000000000000000000000000000000 XMM5:  00000000000000000000000000000000
 XMM6:  00000000000000000000000000000000 XMM7:  00000000000000000000000000000000
 XMM8:  00000000000000000000000000000000 XMM9:  00000000000000000000000000000000
 XMM10: 00000000000000000000000000000000 XMM11: 00000000000000000000000000000000
 XMM12: 00000000000000000000000000000000 XMM13: 00000000000000000000000000000000
 XMM14: 00000000000000000000000000000000 XMM15: 00000000000000000000000000000000


Memory map:

55d2ad5e1000-55d2ad5e2000 r-xp 00000000 00:13 14897704                   /home/marcus/tmp/segfault/powerplant
55d2ad7e1000-55d2ad7e2000 r--p 00000000 00:13 14897704                   /home/marcus/tmp/segfault/powerplant
55d2ad7e2000-55d2ad7e3000 rw-p 00001000 00:13 14897704                   /home/marcus/tmp/segfault/powerplant
55d2ada9c000-55d2adabd000 rw-p 00000000 00:00 0                          [heap]
7f9852c8f000-7f9852ca5000 r-xp 00000000 00:13 13977863                   /usr/lib/
7f9852ca5000-7f9852ea4000 ---p 00016000 00:13 13977863                   /usr/lib/
7f9852ea4000-7f9852ea5000 r--p 00015000 00:13 13977863                   /usr/lib/
7f9852ea5000-7f9852ea6000 rw-p 00016000 00:13 13977863                   /usr/lib/
7f9852ea6000-7f9853054000 r-xp 00000000 00:13 13975885                   /usr/lib/
7f9853054000-7f9853254000 ---p 001ae000 00:13 13975885                   /usr/lib/
7f9853254000-7f9853258000 r--p 001ae000 00:13 13975885                   /usr/lib/
7f9853258000-7f985325a000 rw-p 001b2000 00:13 13975885                   /usr/lib/
7f985325a000-7f985325e000 rw-p 00000000 00:00 0
7f985325e000-7f9853262000 r-xp 00000000 00:13 13975827                   /usr/lib/
7f9853262000-7f9853461000 ---p 00004000 00:13 13975827                   /usr/lib/
7f9853461000-7f9853462000 r--p 00003000 00:13 13975827                   /usr/lib/
7f9853462000-7f9853463000 rw-p 00004000 00:13 13975827                   /usr/lib/
7f9853463000-7f9853488000 r-xp 00000000 00:13 13975886                   /usr/lib/
7f9853649000-7f985364c000 rw-p 00000000 00:00 0
7f9853685000-7f9853687000 rw-p 00000000 00:00 0
7f9853687000-7f9853688000 r--p 00024000 00:13 13975886                   /usr/lib/
7f9853688000-7f9853689000 rw-p 00025000 00:13 13975886                   /usr/lib/
7f9853689000-7f985368a000 rw-p 00000000 00:00 0
7ffdf6a9b000-7ffdf6abc000 rw-p 00000000 00:00 0                          [stack]
7ffdf6bc7000-7ffdf6bc9000 r--p 00000000 00:00 0                          [vvar]
7ffdf6bc9000-7ffdf6bcb000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

At a first glance, the information may feel overwelming, but lets go through the most importat lines.

The backtrace lists the call chain when the the signal was delivered to the application. The first entry is on top of the stack


Here we can see that the last executed instruction is at address 0x55d2ad5e1606. The tricky part is that the address is not absolute in the application, but virtual for the whole process. In other words, we need to calculate the address to an offset within the application's .text segment. If we look at the Memory map we see three entries for the powerplant application:

55d2ad5e1000-55d2ad5e2000 r-xp 00000000 00:13 14897704                   /home/marcus/tmp/segfault/powerplant
55d2ad7e1000-55d2ad7e2000 r--p 00000000 00:13 14897704                   /home/marcus/tmp/segfault/powerplant
55d2ad7e2000-55d2ad7e3000 rw-p 00001000 00:13 14897704                   /home/marcus/tmp/segfault/powerplant

Why three? Most ELF files (application or library) has at least three memory mapped sections: - .text, The executable code - .rodata, read only data - .data, read/write data

With help of the permissions it is possible to figure out which mapping correspond to each section.

The last mapping has rw- as permissions and is probably our .data section as it allows both write and read. The middle mapping has r-- and is a read only mapping - probably our .rodata section. The first mapping has r-x which is read-only and executable. This must be our .text section!

Now we can take the address from our backtrace and subtract with the offset address for our .text section: 0x55d2ad5e1606 - 0x55d2ad5e1000 = 0x606

Use addr2line to get the corresponding line our source code

$ addr2line -e ./powerplant -a 0x606

If we go back to the source code, we see that line 3 in main.c is

*rod = 0xAB;

Here we have it. Nothing more to say.

Conclusion has been a great help over a long time. The biggest benefit is that you don't have to recompile your application when you want to use the feature. However, you cannot get the line number from addr2line if the application is not compiled with debug symbols, but often it is not that hard to figure out the context out from a dissassembly of your application.

Lund Linux Conference 2018

Lund Linux Conference 2018

It is just two weeks from now til the Lund Linux Conference (LLC) [1] begins! LLC is a two-day conference with the same layout as the bigger Linux conferences - just smaller, but just as nice.

There will be talks about PCIe, The serial device bus, security in cars and a few more topics. My highlights this year is to hear about the XDP (eXpress Data Path) [2] to get really fast packet processing with standard Linux. For the last six months, XDP has had great progress and is a technically cool feature.

Here we are back in 2017:




When the system is running out of memory, the Out-Of-Memory (OOM) killer picks a process to kill based on the current memory footprint. In case of OOM, we will calculate a badness score between 0 (never kill) and 1000 for each process in the system. The process with the highest score will be killed. A score of 0 is reserved for unkillable tasks such as the global init process (see [1]) or kernel threads (processes with PF_KTHREAD flag set).


The current score of a given process is exposed in procfs, see /proc/[pid]/oom_score, and may be adjusted by setting /proc/[pid]/oom_score_adj. The value of oom_score_adj is added to the score before it is used to determine which task to kill. The value may be set between OOM_SCORE_ADJ_MIN (-1000) and OOM_SCORE_DJ_MAX (+1000). This is useful if you want to guarantee that a process never is selected by the OOM killer.

The calculation is simple (nowadays), if a task is using all its allowed memory, the badness score will be calculated to 1000. If it is using half of its allowed memory, the badness score is calculated to 500 and so on. By setting oom_score_adj to -1000, the badness score sums up to <=0 and the task will never be killed by OOM.

There is one more thing that affects the calculation; if the process is running with the capability CAP_SYS_ADMIN, it gets a 3% discount, but that is simply it.

The old implementation

Before v2.6.36, the calculation of badness score tried to be smarter, besides looking for the total memory usage (task->mm->total_vm), it also considered: - Whether the process creates a lot of children - Whether the process has been running for a long time, or has used a lot of CPU time - Whether the process has a low nice value - Whether the process is privileged (CAP_SYS_ADMIN or CAP_SYS_RESOURCE set) - Whether the process is making direct hardware access

At first glance, all these criteria looks valid, but if you think about it a bit, there is a lot of pitfalls here which makes the selection not so fair. For example: A process that creates a lot of children and consumes some memory could be a leaky webserver. Another process that fits into the description is your session manager for your desktop environment which naturally creates a lot of child processes.

The new implementation

This heuristic selection has evolved over time, instead of looking on mm->total_vm for each task, the task's RSS (resident set size, [2]) and swap space is used instead. RSS and Swap space gives a better indication of the amount that we will be able to free if we chose this task. The drawback with using mm->total_vm is that it includes overcommitted memory ( see [3] for more information ) which is pages that the process has claimed but has not been physically allocated.

The process is now only counted as privileged if CAP_SYS_ADMIN is set, not CAP_SYS_RESOURCE as before.

The code

The whole implementation of OOM killer is located in mm/oom_kill.c. The function oom_badness() will be called for each task in the system and returns the calculated badness score.

Let's go through the function.

unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
              const nodemask_t *nodemask, unsigned long totalpages)
    long points;
    long adj;

    if (oom_unkillable_task(p, memcg, nodemask))
        return 0;

Looking for unkillable tasks such as the global init process.

p = find_lock_task_mm(p);
if (!p)
    return 0;

adj = (long)p->signal->oom_score_adj;
if (adj == OOM_SCORE_ADJ_MIN ||
        test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
        in_vfork(p)) {
    return 0;

If proc/[pid]/oom_score_adj is set to OOM_SCORE_ADJ_MIN (-1000), do not even consider this task

points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
    atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm);

Calculate a score based on RSS, pagetables and used swap space

if (has_capability_noaudit(p, CAP_SYS_ADMIN))
    points -= (points * 3) / 100;

If it is root process, give it a 3% discount. We are no mean people after all

adj *= totalpages / 1000;
points += adj;

Normalize and add the oom_score_adj value

return points > 0 ? points : 1;

At last, never return 0 for an eligible task as it is reserved for non killable tasks



The OOM logic is quite straightforward and seems to have been stable for a long time (v2.6.36 was released in october 2010). The reason why I was looking at the code was that I did not think the behavior I saw when experimenting corresponds to what was written in the man page for oom_score. It turned out that the manpage was not updated when the new calculation was introduced back in 2010.

I have updated the manpage and it is available in v4.14 of the Linux manpage project [4].

commit 5753354a3af20c8b361ec3d53caf68f7217edf48
Author: Marcus Folkesson <>
Date:   Fri Nov 17 13:09:44 2017 +0100

    proc.5: Update description of /proc/<pid>/oom_score

    After Linux 2.6.36, the heuristic calculation of oom_score
    has changed to only consider used memory and CAP_SYS_ADMIN.

    See kernel commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10.

    Signed-off-by: Marcus Folkesson <>
    Signed-off-by: Michael Kerrisk <>

diff --git a/man5/proc.5 b/man5/proc.5
index 82d4a0646..4e44b8fba 100644
--- a/man5/proc.5
+++ b/man5/proc.5
@@ -1395,7 +1395,9 @@ Since Linux 2.6.36, use of this file is deprecated in favor of
 .IR /proc/[pid]/oom_score_adj .
 .IR /proc/[pid]/oom_score " (since Linux 2.6.11)"
-.\" See mm/oom_kill.c::badness() in the 2.6.25 sources
+.\" See mm/oom_kill.c::badness() in pre 2.6.36 sources
+.\" See mm/oom_kill.c::oom_badness() after 2.6.36
+.\" commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10
 This file displays the current score that the kernel gives to
 this process for the purpose of selecting a process
 for the OOM-killer.
@@ -1403,7 +1405,16 @@ A higher score means that the process is more likely to be
 selected by the OOM-killer.
 The basis for this score is the amount of memory used by the process,
 with increases (+) or decreases (\-) for factors including:
-.\" See mm/oom_kill.c::badness() in the 2.6.25 sources
+.\" See mm/oom_kill.c::badness() in pre 2.6.36 sources
+.\" See mm/oom_kill.c::oom_badness() after 2.6.36
+.\" commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10
+.IP * 2
+whether the process is privileged (\-);
+.\" More precisely, if it has CAP_SYS_ADMIN or (pre 2.6.36) CAP_SYS_RESOURCE
+Before kernel 2.6.36 the following factors were also used in the calculation of oom_score:
 .IP * 2
 whether the process creates a lot of children using
@@ -1413,10 +1424,7 @@ whether the process creates a lot of children using
 whether the process has been running a long time,
 or has used a lot of CPU time (\-);
 .IP *
-whether the process has a low nice value (i.e., > 0) (+);
-.IP *
-whether the process is privileged (\-); and
-.\" More precisely, if it has CAP_SYS_ADMIN or CAP_SYS_RESOURCE
+whether the process has a low nice value (i.e., > 0) (+); and
 .IP *
 whether the process is making direct hardware access (\-).
 .\" More precisely, if it has CAP_SYS_RAWIO

Embedded Linux course in Linköping

Embedded Linux course in Linköping

I tech in our Embedded Linux course on a regular basis, this time in Linköping. It's a fun course with interesting labs where you will write your own linux device driver for a custom board.

The course itself is quite moderate, but with talented participants, we easily slip over to more interesting things like memory management, the MTD subsystem, how ftrace works internally and how to use it, different contexts, introduce perf and much more.

This time was no exception.


FIT vs legacy image format

FIT vs legacy image format

U-Boot supports several image formats when booting a kernel. However, a Linux system usually need multiple files for booting. Such files may be the kernel itself, an initrd and a device tree blob.

A typical embedded Linux system have all these files in at least two-three different configurations. It is not uncommon to have

  • Default configuration
  • Rescue configuration
  • Development configuration
  • Production configuration
  • ...

Only these four configurations may involve twelve different files. Or maybe the devicetree is shared between two configurations. Or maybe the initrd is... Or maybe the..... you got the point. It has a farily good chance to end up quite messy.

This is a problem with the old legacy formats and is somehow addressed in the FIT (Flattened Image Tree) format.

Lets first look at the old formats before we get into FIT.


The most well known format for the Linux kernel is the zImage. The zImage contains a small header followed by a self-extracting code and finally the payload itself. U-Boot support this format by the bootz command.

Image layout

zImage format
Decompressing code
Compressed Data


When talking about U-Boot, the zImage is usually encapsulated in a file called uImage created with the mkimage utility. Beside image data, the uImage also contains information such as OS type, loader information, compression type and so on. Both data and header is checksumed with CRC32.

The uImage format also supports multiple images. The zImage, initrd and devicetree blob may therefor be included in one single monolithic uImage. Drawbacks with this monolith is that it has no flexible indexing, no hash integrity and no support for security at all.

Image layout

uImage format
Header checksum
Data size
Data load address
Entry point address
Data CRC
Image type
Compression type
Image name
Image data


The FIT (Flattened Image Tree) has been around for a while but is for unknown reason rarely used in real systems, based on my experience. FIT is a tree structure like Device Tree (before they change format to YAML: and handles images of various types.

These images is then used in something called configurations. One image may be a used in several configurations.

This format has many benifits compared to the other:

  • Better solutions for multi component images
    • Multiple kernels (productions, feature, debug, rescue...)
    • Multiple devicetrees
    • Multiple initrds
  • Better hash integrity of images. Supports different hash algorithms like
    • SHA1
    • SHA256
    • MD5
  • Support signed images
    • Only boot verified images
    • Detect malware

Image Tree Source

The file describing the structure is called .its (Image Tree Source). Here is an example of a .its file


/ {
    description = "Marcus FIT test";
    #address-cells = <1>;

    images {
        kernel@1 {
            description = "My default kenel";
            data = /incbin/("./zImage");
            type = "kernel";
            arch = "arm";
            os = "linux";
            compression = "none";
            load = <0x83800000>;
            entry = <0x83800000>;
            hash@1 {
                algo = "md5";

        kernel@2 {
            description = "Rescue image";
            data = /incbin/("./zImage");
            type = "kernel";
            arch = "arm";
            os = "linux";
            compression = "none";
            load = <0x83800000>;
            entry = <0x83800000>;
            hash@1 {
                algo = "crc32";

        fdt@1 {
            description = "FDT for my cool board";
            data = /incbin/("./devicetree.dtb");
            type = "flat_dt";
            arch = "arm";
            compression = "none";
            hash@1 {
                algo = "crc32";


    configurations {
        default = "config@1";

        config@1 {
            description = "Default configuration";
            kernel = "kernel@1";
            fdt = "fdt@1";

        config@2 {
            description = "Rescue configuration";
            kernel = "kernel@2";
            fdt = "fdt@1";


This .its file has two kernel images (default and rescue) and one FDT image. Note that the default kernel is hashed with md5 and the other with crc32. It is also possible to use several hash-functions per image.

It also specifies two configuration, Default and Rescue. The two configurations is using different kernels (well, it is the same kernel in this case since both is pointing to ./zImage..) but share the same FDT. It is easy to build up new configurations on demand.

The .its file is then passed to mkimage to generate an itb (Image Tree Blob):

mkimage -f kernel.its kernel.itb

which is bootable from U-Boot.

Look at ./doc/uImage.FIT/ in the U-Boot source code for more examples on how a .its-file can look like. It also contains examples on signed images which is worth a look.

Boot from U-Boot

To boot from U-Boot, use the bootm command and specify the physical address for the FIT image and the configuration you want to boot.

Example on booting config@1 from the .its above:

bootm 0x80800000#config@1

Full example with bootlog:

U-Boot 2015.04 (Oct 05 2017 - 14:25:09)

CPU:   Freescale i.MX6UL rev1.0 at 396 MHz
CPU:   Temperature 42 C
Reset cause: POR
Board: Marcus Cool board
fuse address == 21bc400
serialnr-low : e1fe012a
serialnr-high : 243211d4
       Watchdog enabled
I2C:   ready
DRAM:  512 MiB
Using default environment

In:    serial
Out:   serial
Err:   serial
Net:   FEC1
Boot from USB for mfgtools
Use default environment for                              mfgtools
Run bootcmd_mfg: run mfgtool_args;bootz ${loadaddr} - ${fdt_addr};
Hit any key to stop autoboot:  0
=> bootm 0x80800000#config@1
## Loading kernel from FIT Image at 80800000 ...
   Using 'config@1' configuration
   Trying 'kernel@1' kernel subimage
     Description:  My cool kernel
     Type:         Kernel Image
     Compression:  uncompressed
     Data Start:   0x808000d8
     Data Size:    7175328 Bytes = 6.8 MiB
     Architecture: ARM
     OS:           Linux
     Load Address: 0x83800000
     Entry Point:  0x83800000
     Hash algo:    crc32
     Hash value:   f236d022
   Verifying Hash Integrity ... crc32+ OK
## Loading fdt from FIT Image at 80800000 ...
   Using 'config@1' configuration
   Trying 'fdt@1' fdt subimage
     Description:  FDT for my cool board
     Type:         Flat Device Tree
     Compression:  uncompressed
     Data Start:   0x80ed7e4c
     Data Size:    27122 Bytes = 26.5 KiB
     Architecture: ARM
     Hash algo:    crc32
     Hash value:   1837f127
   Verifying Hash Integrity ... crc32+ OK
   Booting using the fdt blob at 0x80ed7e4c
   Loading Kernel Image ... OK
   Loading Device Tree to 9ef86000, end 9ef8f9f1 ... OK

Starting kernel ...

One thing to think about

Since this FIT image tends to grow in size, it is a good idea to set CONFIG_SYS_BOOTM_LEN in the U-Boot configuration.

        Normally compressed uImages are limited to an
        uncompressed size of 8 MBytes. If this is not enough,
        you can define CONFIG_SYS_BOOTM_LEN in your board config file
        to adjust this setting to your needs.


I advocate to use the FIT format because it solves many problems that can get really awkward. Just the bonus to program one image in production instead of many is a great benifit. The configurations is well defined and there is no chance that you will end up booting one kernel image with an incompatible devicetree or whatever.

Use FIT images should be part of the bring-up-board activity, not the last thing todo before closing a project. The "Use seperate kernel/dts/initrds for now and then move on to FIT when the platform is stable" does not work. It simple not gonna happend, and it is a mistake.

Magic SysRq

Magic SysRq

Every kernel-hacker should knows about the magic sysrq already, so this post is kind of unnecessary. To enable the magic, make sure CONFIG_MAGIC_SYSRQ is set in your Kernel hacking tab.

I use this feature... a lot. Mostly for set loglevel and reboot the system, but it is also very useful when debugging. So, how to use it?

As everybody is using GNU Screen (what else) as their TTY terminal, the keyboard combination is: ctrl+A B. And here the magic begins! This combination simply sends a SysRq keycode to the target system.

To get information about available commands, press ctrl+A B h. h as in help.:

[669510.910125] SysRq : HELP : loglevel(0-9) reBoot Crash terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) thaw-filesystems(J) saK show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z)
And here we go! My favorites are:
0-9, set debug level
b, reboot the system (ingrained...)
p, show a register dump ( useful if the system hangs)
g, switch console to KGDB (I love KGDB)
1, Show all timers (good when looking for power-thieves)

The best part is that Magic SysRq works in allmost every situation, even if the system is frozen.

For further details see the kernel source file Documentation/sysrq.txt

Take control of your Buffalo Linkstation NAS

Take control of your Buffalo Linkstation NAS

I finally bought a NAS for all of my super-important stuff. It became a Buffalo Linkstation LS200, most because of the price ($300 for 4TB). It supports all of the standard protocols such as FTP, SAMBA, ATP and so on.

However, it would be really useful to use some sane protocols like sftp so you could use rsync for your backup scripts.

Bring a big coffee mug and let the hacking begin....

I knew that the NAS was based on the ARM architecture and supports a whole set of high level protocols, so one qualified guess is that there lives a little penguin in the box.

Lets start with download the latest firmware from the Buffalo webpage. When the firmware is unzipped we have these files:

marcus@tuxie:~/shared/buffalo\$ ls -al
total 780176
drwxr-xr-x  4 marcus marcus      4096 Jun 13 00:06 .
drwxr-xr-x 18 marcus marcus      4096 Jun 12 23:27 ..
-rw-r--r--  1 marcus marcus 190409979 Jun 13 00:06 hddrootfs.img
drwxr-xr-x  2 marcus marcus      4096 Apr  8 13:25 img
-rw-r--r--  1 marcus marcus  12602325 Apr  1 15:25 initrd.img
-rw-r--r--  1 marcus marcus       656 Apr  1 15:25 linkstation_version.ini
-rw-r--r--  1 marcus marcus       198 Apr  1 15:25 linkstation_version.txt
-rw-r--r--  1 marcus marcus 205568610 Jun 12 23:13
-rw-r--r--  1 marcus marcus    350104 Apr  1 15:25 LSUpdater.exe
-rw-r--r--  1 marcus marcus       327 Apr  1 15:25 LSUpdater.ini
-rw-r--r--  1 marcus marcus    674181 Apr  1 15:25 u-boot.img
-rw-r--r--  1 marcus marcus   2861933 Apr  1 15:25 uImage.img
-rw-r--r--  1 marcus marcus      4880 Apr  8 13:23 update.html

Ok, u-boot.img, uImage.img, initrd.img and hddrootfs.img tells us that I have a black little penguin cage in front of me. First of all, find out what kind of file these *.img files really are.:

marcus@tuxie:~/shared/buffalo$ file ./hddrootfs.img
./hddrootfs.img: Zip archive data, at least v2.0 to extract

Really? It is just an zip-file. Lets extract it then.:

marcus@tuxie:~/shared/buffalo$ unzip hddrootfs.img
Archive:  hddrootfs.img
[hddrootfs.img] hddrootfs.buffalo.updated password:

Of course, it is protected with a password. I will let my old friend John the Ripper take a look at it (I guess I had a great luck, the brute force attack only took 2.5 hours). The password for the file is: aAhvlM1Yp7_2VSm6BhgkmTOrCN1JyE0C5Q6cB3oBB

marcus@tuxie:~/shared/buffalo$ unzip hddrootfs.img
Archive:  hddrootfs.img
[hddrootfs.img] hddrootfs.buffalo.updated password:
  inflating: hddrootfs.buffalo.updated

Terrific. We got a hddrootfs.buffalo.updated file. What is it anyway?:

marcus@tuxie:~/shared/buffalo$ file hddrootfs.buffalo.updated
hddrootfs.buffalo.updated: gzip compressed data, was "rootfs.tar", from Unix, last modified: Tue Apr  1 08:24:05 2014, max compression

It is just a gzip compressed tar archive, couldn't be better! Extract it.:

marcus@tuxie:~/shared/buffalo$ mkdir rootfs
marcus@tuxie:~/shared/buffalo$ tar -xz --numeric-owner -f hddrootfs.buffalo.updated  -C ./rootfs/
marcus@tuxie:~/shared/buffalo$ ls -1l rootfs/
total 80
drwxr-xr-x  2 root root 4096 Apr  1 08:23 bin
-rwxr-xr-x  1 root root 1140 Feb  3 07:51
drwxr-xr-x  2 root root 4096 Apr  1 08:23 debugtool
drwxr-xr-x  5 root root 4096 Apr  1 08:23 dev
drwxr-xr-x 33 root root 4096 Jun 12 23:17 etc
drwxr-xr-x  4 root root 4096 Apr  1 08:23 home
drwxr-xr-x  9 root root 4096 Apr  1 08:23 lib
drwxr-xr-x  3 root root 4096 Feb  3 07:51 mnt
drwxr-xr-x  2 root root 4096 Apr  1 07:35 opt
-rwxr-xr-x  1 root root 2741 Feb  3 07:51
drwxr-xr-x  2 root root 4096 Apr  1 07:35 proc
drwxr-xr-x  3 root root 4096 Apr  1 08:23 root
drwxr-xr-x  3 root root 4096 Apr  1 08:22 run
drwxr-xr-x  2 root root 4096 Apr  1 08:23 sbin
drwxr-xr-x  2 root root 4096 Apr  1 07:35 sys
-rwxr-xr-x  1 root root 3751 Feb  3 07:51
drwxrwxrwt  3 root root 4096 Apr  1 08:23 tmp
drwxr-xr-x 11 root root 4096 Apr  1 08:23 usr
drwxr-xr-x  9 root root 4096 Apr  1 08:22 var
drwxrwxrwx  6 root root 4096 Apr  1 07:53 www

Here we go!

Modify the root filesystem

First of all, I really would like to have SSH access to the box, and I found that there is a SSH daemon in here (/usr/bin/sshd), but why is it not activated? Take a look in one of the scripts that seems to be related to ssh:

marcus@tuxie:~/shared/buffalo/rootfs$ head -n 20 etc/init.d/
[ -f /etc/nas_feature ] && . /etc/nas_feature
SSHD=`which sshd`
if [ "${SSHD}" = "" -o ! -x ${SSHD} ] ; then
 echo "sshd is not supported on this platform!!!"
if [ "${SUPPORT_SFTP}" = "0" ] ; then
        echo "Not support sftp on this model." > /dev/console
        exit 0
umask 000

What about the second if-statement? SUPPORT_SFTP? And what is the /etc/nas_feature file? It does not exist in the package. Is it auto generated at boot? Anyway, I remove the second statement, it seems evil. So, if this starts up the ssh daemon, we would like to login as root, uncomment PermitRootLogin in sshd_config:

marcus@tuxie:~/shared/buffalo/rootfs$ sed -i 's/#PermitRootLogin/ PermitRootLogin/' etc/sshd_config

Then copy your public rsa-key to /root/.ssh/authorized_keys. If you don't have a key, generate it with:

marcus@tuxie:~/shared/buffalo/rootfs$ ssh-keygen

Copy the key to target:

marcus@tuxie:~/shared/buffalo/rootfs$ mkdir ./root/.ssh
marcus@tuxie:~/shared/buffalo/rootfs$ cat ~/.ssh/ > ./root/.ssh/authorized_keys

This will let you to login without give any password.

Lets try to re-pack the whole thing.:

marcus@tuxie:~/shared/buffalo/rootfs$ mv ../hddrootfs.buffalo.updated{,.old}
marcus@tuxie:~/shared/buffalo/rootfs$ tar -czf ../hddrootfs.buffalo.updated *
marcus@tuxie:~/shared/buffalo/rootfs$ cd ..
marcus@tuxie:~/shared/buffalo$ zip -e hddrootfs.img hddrootfs.buffalo.updated

I encrypt the file with the same password as before, I dare not think about what happens if I don't.

Time to update firmware

I execute the LSUpdater.exe from a virtual Windows machine and hold my thumbs... The update process takes about 8 minutes and is a real pain, would it brick my NAS..?

After a while the power LED is indicating that the NAS is up and running. Wow. Quick! Do a portscan!:

marcus@tuxie:~/shared/buffalo$ sudo nmap -sS
[sudo] password for marcus:
Starting Nmap 5.21 ( ) at 2014-06-13 11:59 CEST
Nmap scan report for nas (
Host is up (0.00049s latency).
Not shown: 990 closed ports
21/tcp    open  ftp
22/tcp    open  ssh
80/tcp    open  http
139/tcp   open  netbios-ssn
443/tcp   open  https
445/tcp   open  microsoft-ds
548/tcp   open  afp
873/tcp   open  rsync
8873/tcp  open  unknown
22939/tcp open  unknown
MAC Address: DC:FB:02:EB:06:A8 (Unknown)
Nmap done: 1 IP address (1 host up) scanned in 0.27 seconds

And there it is! The SSH daemon is running on port 22.:

marcus@tuxie:~$ ssh admin@
admin@'s password:
[admin@LS220D6A8 ~]$ ls /
bin/        debugtool/  home/       mnt/        proc/       sbin/       tmp/        www/
boot/       dev/        lib/        opt/        root/       sys/        usr/*  etc/        lost+found/* run/*    var/
[admin@LS220D6A8 ~]$

It is just beautiful!

Wait, what about the /etc/nas_features file?

[admin@LS220D6A8 ~]$ cat /etc/nas_feature

It seems that the files is generated. It also has the SUPPORT_SFTP config that we saw in

What about the kernel

In the current vanilla kernel, there is devicetrees that seems to be for the Buffalo linkstation.:

marcus@tuxie:~/marcus/linux/linux$ grep -i buffalo arch/arm/boot/dts/*.dts
arch/arm/boot/dts/kirkwood-lschlv2.dts: model = "Buffalo Linkstation LS-CHLv2";
arch/arm/boot/dts/kirkwood-lschlv2.dts: compatible = "buffalo,lschlv2", "buffalo,lsxl", "marvell,kirkwood-88f6281", "marvell,kirkwood";
arch/arm/boot/dts/kirkwood-lsxhl.dts:   model = "Buffalo Linkstation LS-XHL";
arch/arm/boot/dts/kirkwood-lsxhl.dts:   compatible = "buffalo,lsxhl", "buffalo,lsxl", "marvell,kirkwood-88f6281", "marvell,kirkwood";

The device tree seems to match the current CPU:

[admin@LS220D6A8 ~]$ cat /proc/cpuinfo
Processor : Marvell PJ4Bv7 Processor rev 1 (v7l)
BogoMIPS : 795.44
Features : swp half thumb fastmult vfp edsp vfpv3 vfpv3d16 tls
CPU implementer : 0x56
CPU architecture: 7
CPU variant : 0x1
CPU part : 0x581
CPU revision : 1
Hardware : Marvell Armada-370
Revision : 0000
Serial  : 0000000000000000

Also, the kernel is not tainted indicating that there is no out-of-tree modules.:

[admin@LS220D6A8 ~]$ cat /proc/sys/kernel/tainted

It could therefor be possible to compile a custom kernel with support for more USB-devices that may be plugged into the NAS.

Other tips

The LSUpdater.exe will refuse to update if the same version of software is already on the target. This means that you cannot upload the same firmware again... unless you change the version!

Together with the uploader application, there is a linkstation_version.ini file that contains information about each of the *.img. The first thing I tried was just to increase the VERSION by one. This makes the LSUpdater.exe move a little forward, It stops complain about the same version, instead it complains about that this firmware is already on the target. However, I needed to change the timestamp of each binary (increased the day by one), then it updates the firmware without any problem.

marcus@tuxie:~/shared/buffalo$ cat linkstation_version.ini
KERNEL=2014/04/01 14:33:46
INITRD=2014/04/01 14:35:10
ROOTFS=2014/04/01 15:23:41
FILE_BOOT = u-boot.img
FILE_KERNEL = uImage.img
FILE_INITRD = initrd.img
FILE_ROOTFS = hddrootfs.img

KERNEL=2014/04/01 14:33:46
KERNEL=2014/04/01 14:33:46
KERNEL=2014/04/01 14:33:46

NAT with Linux

NAT with Linux

To share an internet connection may sometimes be very practical when working with embedded devices. The network may have restrictions/authentications that stops you from plug in your device into the network of the big company you are working for.

But what about creating your own network and use your computer as NAT (Network Address Translation)? I was surprised how easy it is to set up your Linux host as a NAT, it is just a few command lines.

OK, here is the setup on Host: - eth0 has ip address and is connected to the company network - eth1 has ip address ans is connected to the target

Setup on Target: - eth0 has ip address ans is connected to host

First of all, we need to setup a default gateway on our target, do this as you allways do - with route.:

Target$ route add default gw eth0

Next, we need to create a post-routing rule in the to the NAT table that masquerades all traffic to the eth0 interface. iptables is your friend:

Host$ sudo iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

Thats it! Well, allmost. We just need to enable ip-forwarding.:

Host$ echo 1 | sudo tee /proc/sys/net/ipv4/ip_forward

-ENOENT, but believe me, it's there

-ENOENT, but believe me, it's there

Almost every ELF-file in a Linux environment is dynamically linked, and the operating system has to locate all dynamic libraries in order to execute the file. To its help, it has the runtime dynamic linker, whose only job is to interpret the ELF file format, load the shared objects with unresolved references, and at last, execute and pass over the control to the ELF file.

The expected path to the dynamic linker is hardcoded into the ELF itself, so if the linker is missing, it leads to an obvious but maybe confusing error.

If I want to run an executable that has its path to a dynamic linker that simple does not exists, we got this:

root@cgtqmx6:/# ./01_SimpleTriangle -sh: ./01_SimpleTriangle: No such file or directory

Hue? But the file does exist! OK, lets find out which system calls is made.:

root@cgtqmx6:/# ./01_SimpleTriangle -sh: ./01_SimpleTriangle: No such file or directoryroot@cgtqmx6:/# strace ./01_SimpleTriangle execve("./01_SimpleTriangle", ["./01_SimpleTriangle"], [/* 15 vars */]) = -1 ENOENT (No such file or directory)write(2, "strace: exec: No such fil40strace: exec: No such file or directory) = 40exit_group(1)                           = ?+++ exited with 1 +++

Not much valuable information. Strace calls execve that says the file does not exist. We are talking about the file, but which file is it? As told before, the runtime dynamic linker is always executed first to load libraries for unresolved symbols. So one qualified guess is that it is the dynamic linker that is missing.

Readelf is a handy tool to read out information from an ELF file such as sections, which architecture it is compiled for and so on. It also shows us the expected path to the dynamic linker.:

root@cgtqmx6:/# readelf -l 01_SimpleTriangle Elf file type is EXEC (Executable file)Entry point 0x9288There are 8 program headers, starting at offset 52
Program Headers:  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align  EXIDX          0x0022ec 0x0000a2ec 0x0000a2ec 0x00088 0x00088 R   0x4  PHDR           0x000034 0x00008034 0x00008034 0x00100 0x00100 R E 0x4  INTERP         0x000134 0x00008134 0x00008134 0x00013 0x00013 R   0x1    >>>>>>  [Requesting program interpreter: /lib/] <<<<<<<  LOAD           0x000000 0x00008000 0x00008000 0x02378 0x02378 R E 0x8000         0x002378 0x00012378 0x00012378 0x002a0 0x00358 RW  0x8000  DYNAMIC        0x002384 0x00012384 0x00012384 0x00118 0x00118 RW  0x4  NOTE           0x000148 0x00008148 0x00008148 0x00020 0x00020 R   0x4  GNU_STACK      0x000000 0x00000000 0x00000000 0x00000 0x00000 RW  0x4
 Section to Segment mapping:  Segment Sections...   00     .ARM.exidx    01        02     .interp    03     .interp .note.ABI-tag .hash .dynsym .dynstr .gnu.version .gnu.version_r .rel.dyn .rel.plt .init .plt .text .fini .rodata .ARM.extab .ARM.exidx .eh_frame    04     .init_array .fini_array .jcr .dynamic .got .data .bss    05     .dynamic    06     .note.ABI-tag    07

Here we go! The expected path is /lib/ Our system is currently using /lib/, so lets create a symbolic link for now.:

ln -s /lib/ /lib/

And finally, the program is now executing!:

root@cgtqmx6:/# ./01_SimpleTriangle root@cgtqmx6:/#

Anyway, who decide which runtime dynamic linker to use? It is decided in the linker-stage, usually by ld or gcc and may vary between different toolchains,

Modules with parameters

Modules with parameters

Everybody knows that modules can take parameters, either via /sys/modules/<module>/parameters or via cmdline to the kernel, but how are these parameters created?

Parameters without callbacks

The Linux kernel provides the module_param() macro. The syntax is:

module_param(name, type, perm)

Which will simply create the module parameter and expose it as an entry in /sys/modules/<module>/parameters.

Code example

int debug_flag;
module_param(debug_flag, bool, S_IRUSR | S_IWUSR | S_IRGRP)
MODULE_PARM_DESC(debug_flag, "Set to 1 if debug should be enabled, 0 otherwise");

MODULE_PARM_DESC() is a short description of the parameter. Modinfo will read the description and present it for you.

Parameters with callbacks

Sometimes it may be useful to actually notify the driver that the value of a parameter has changed, which not the regular module_param() macro does.

module_param_cb is the way to go. The macro takes two callbacks functions, set and get, that is called when the user (or kernel if in cmdline) interact with the parameters. This is done by passing a struct kernel_param_ops to the macro. The syntax is:

module_param_cb(name, ops, arg, perm)
The module_param_cb is not heavily used in the kernel if we look in the drivers:
[06:40:35]marcus@tuxie:~/kernel$ git grep module_param_cb drivers/
drivers/acpi/sysfs.c:module_param_cb(debug_layer, &param_ops_debug_layer, &acpi_dbg_layer, 0644);
drivers/acpi/sysfs.c:module_param_cb(debug_level, &param_ops_debug_level, &acpi_dbg_level, 0644);
drivers/char/ipmi/ipmi_watchdog.c:module_param_cb(action, &param_ops_str, action_op, 0644);
drivers/char/ipmi/ipmi_watchdog.c:module_param_cb(preaction, &param_ops_str, preaction_op, 0644);
drivers/char/ipmi/ipmi_watchdog.c:module_param_cb(preop, &param_ops_str, preop_op, 0644);

In fact, there are just 5 entries, don't ask me why, I think the macro is terrific. The interface is really simple, just fill the kernel_param_ops struct and pass it to the module_param_cb macro. I think the code is quite self-explained, so I just post an example taken from drivers/acpi/sysfs.c.

Code example

static int param_get_debug_level(char *buffer, const struct kernel_param *kp)
 int result = 0;
 int i;
 result = sprintf(buffer, "%-25s\tHex        SET\n", "Description");
 for (i = 0; i < ARRAY_SIZE(acpi_debug_levels); i++) {
  result += sprintf(buffer + result, "%-25s\t0x%08lX [%c]\n",
      (acpi_dbg_level & acpi_debug_levels[i].value)
      ? '*' : ' ');
 result +=
     sprintf(buffer + result, "--\ndebug_level = 0x%08X (* = enabled)\n",
 return result;

static struct kernel_param_ops param_ops_debug_level = {
 .set = param_set_uint,
 .get = param_get_debug_level,
module_param_cb(debug_level, &param_ops_debug_level, &acpi_dbg_level, 0644);

There is also a set of standard set/get-functions (the code above use param_set_uint for example). These are called param_(set|get)_XXX where XXX is byte, short, int, long and so on.

Take a look in include/linux/moduleparam.h for further reading!