BEER2RST – my first attempt with golang

I’m using Beersmith [1] a (non-free…) software to create my beer recipes. It is written in Java and runs well on Linux. One of the biggest benefits with using Beersmith is that my brewing house [2] is taking exported recipes as input and setup mash schemes automatically – really comfortable.

I brew beer several times a month and always takes notes of each brewing, both methods and results. The goal is to get reproducible results each time or to make small improvements.
Instead of having all these notes in separate locations, it would be nice if I instead collected all of my brewing as blog posts.

With that said, this blog will probably evolve to contain non-technical posts as well in the near future.

In these posts, I also want to publish my recipes somehow.
Beersmith is able to export recipes in a non overcomplicaded XML format and is quite straight forward to parse.
All posts that I’m writing is in reStructuredText [3] format, so I need to create a tool that read the XML and export the recipes in reStructuredText format.

First glance at Golang

http://www.marcusfolkesson.se/wp-content/uploads/2018/07/golang.png

To be honest, I’m not really a fan of high-level programming languages as I can’t take them seriously. But I guess it is time to learn something new.
What I have tested with Go so far is rather impressive. For example, I cannot imagine a simpler XML parsing (I’m used to libxml2).

I also like the gofmt [4] tool to make the code properly formatted. Every language should have such a tool.
It is also easy so cross compile the application to different architectures by specify $GOOS and $GOARCH. I have only tested with ARM and it just works.

Imports

Most languages has a way to tell what functionality it should import and be usable to your file.
Go is using import and has the following syntax

import (
    "encoding/xml"
    "flag"
    "fmt"
    "io/ioutil"
    "os"
    "strings"
)

What makes it really interesting is when you do this

import (
    "github.com/marcusfolkesson/tablewriter"
)

I.e. point to a repository.
You only need to download the repository to your $GOPATH location and then it is useable

go get github.com/marcusfolkesson/tablewriter
Notes:
I need to create tables for print the recipes properly. I found tablewriter [5] that is printing ASCII-tables.
Unfortunately, it does not create tables in reStructuredText format so I had to fork the project and implement support for that.
Hopefully the changes will make it back to the original project. There is a pending pull request for that.

Unmarshal XML

I really liked how easy it was to unmarshal XML. Consider the following snip from the exported recipe

<HOP>
 <NAME>Northern Brewer</NAME>
 <VERSION>1</VERSION>
 <ORIGIN>Germany</ORIGIN>
 <ALPHA>8.5000000</ALPHA>
 <AMOUNT>0.0038459</AMOUNT>
 <USE>Boil</USE>
 <TIME>60.0000000</TIME>
 <NOTES>Also called Hallertauer Northern Brewers
Used for: Bittering and finishing both ales and lagers of all kinds
Aroma: Fine, dry, clean bittering hop.  Unique flavor.
Substitutes: Hallertauer Mittelfrueh, Hallertauer
Examples: Anchor Steam, Old Peculiar, </NOTES>
 <TYPE>Both</TYPE>
 <FORM>Pellet</FORM>
 <BETA>4.0000000</BETA>
 <HSI>35.0000000</HSI>
 <DISPLAY_AMOUNT>3.85 g</DISPLAY_AMOUNT>
 <INVENTORY>0.00 g</INVENTORY>
 <DISPLAY_TIME>60.0 min</DISPLAY_TIME>
</HOP>

The first step is to create a structure that should hold the values

type Hop struct {
    Name   string  `xml:"NAME"`
    Origin string  `xml:"ORIGIN"`
    Alpha  float64 `xml:"ALPHA"`
    Amount float64 `xml:"AMOUNT"`
    Use    string  `xml:"USE"`
    Time   float64 `xml:"TIME"`
    Notes  string  `xml:"NOTES"`
}

There is no need to create variables for all tags, just the ones you are interesting in.

Later on, read the file and unmarshal the XML

content, err := ioutil.ReadFile("beer.xml")
if err != nil {
        panic(err)
}

err = xml.Unmarshal(content, &hops)
if err != nil {
        panic(err)
}

The structure will now be populated with values from the XML file. Magical, isn’t it?

Conclusion

I have only used tested Golang for a working day approximately, and I like it.
I used to use Python for all kind of fast prototyping, but I think I will consider Golang the next time.

What I really like in comparison with Python are:

  • The fact that it compiles to a single binary is really nice, especially when you cross compile to a different architecture.
  • No need for interpreter
  • Static types! Dynamic type languages makes my brain hurt

The result can be found on my GitHub [6] account.

Lund Linux Conference 2018

It is just two weeks from now til the Lund Linux Conference (LLC) [1] begins!
LLC is a two-day conference with the same layout as the bigger Linux conferences – just smaller, but just as nice.

There will be talks about PCIe, The serial device bus, security in cars and a few more topics.
My highlights this year is to hear about the XDP (eXpress Data Path) [2] to get really fast packet processing with standard Linux.
For the last six months, XDP has had great progress and is a technically cool feature.

Here we are back in 2017:

http://www.marcusfolkesson.se/wp-content/uploads/2018/04/lund-linuxcon-2018_half.jpg

Linux driver for PhoenixRC adapter

Update: Michael Larabel on Phoronix has written a post [3] about this driver. Go ahead and read it as well!

A few years ago I used to build multirotors, mostly quadcopters and tricopters. It is a fun hobby, both building and flying is incredible satisfying.
The first multirotors i built was nicely made with CNC cutted details. They looked really nice and robust.
However, with more than 100+ crashes under the belt, the last ones was made out of sticks and a food box. Easy to repair and just as fun to fly.

This hobby requires practice, and even if the most fun way to practice is by flying, it is really time consuming.
A more time efficient way to practice is by a simulator, so I bought PhoenixRC [1]_, which is a flight simulator. It comes with an USB adapter that let you connect and fly with your own RC controller.
I did not run the simulator so much. PhoenixRC is a Windows software and there was no driver for the adapter for Linux. The only instance of Windows I had was on a separate disk that layed on the shelf, but switching disk on your laptop each time you want to fly is simply not going to happened.

This new year eve (2017), my wife became ill and I got some time for my own.
Far down into a box I found the adapter and plugged it into my Linux computer. Still no driver.

Reverse engineering the adapter

The reverse engineering was quite simple. It turns out that the adapter only has 1 configuration, 1 interface and 1 endpoint of in-interrupt type.
This simply means that it only has an unidirectional communication path, initiated by sending an interrupt URB (USB Request Block) to the device.
If you are not familiar with what configurations, interfaces and endpoints are in terms of USB, please google the USB standard specification.

The data from our URB was only 8 bytes. After some testing with my RC controller I got the following mapping between data and the channels on the controller:

data[0] = channel 1
data[1] = ? (Possibly a switch)
data[2] = channel 2
data[3] = channel 3
data[4] = channel 4
data[5] = channel 5
data[6] = channel 6
data[7] = channel 7

So I created a device driver that registered an input device with the following events:

Channel Event
1 ABS_X
2 ABS_Y
3 ABS_RX
4 ABS_RY
5 ABS_RUDDER
6 ABS_THROTTLE
7 ABS_MISC

Using a simulator

Heli-X [2]_ is an excellent cross platform flight simulator that runs perfect on Linux.
Now I have spent several hours with a Goblin 700 Helicopter and it is just as fun as I rembembered.

http://www.marcusfolkesson.se/wp-content/uploads/2018/03/phoenixrc.jpg

Available in Linux kernel 4.17

Of course all code is submitted to the Linux kernel and should be merged in v4.17.

References

[1] http://www.phoenix-sim.com/
[2] http://www.heli-x.info/cms/
[3] https://www.phoronix.com/scan.php?page=news_item&px=Phoenix-RC-Flight-Adapter-Linux

System Message: WARNING/2 (<stdin>, line 74)

Explicit markup ends without a blank line; unexpected unindent.

||||||| parent of e848d92… blog: driver-for-phenixrc-adaper: new post

System Message: WARNING/2 (<stdin>, line 75)

Title underline too short.

||||||| parent of e848d92... blog: driver-for-phenixrc-adaper: new post
=======

A few years ago I used to build multirotors, mostly quadcopters and tricopters. It is a fun hobby, both building and flying is incredible satisfying.
The first multirotors i built was nicely made with CNC cutted details. They looked really nice and robust.
However, with more than 100+ crashes under the belt, the last ones was made out of sticks and a food box. Easy to repair and just as fun to fly.

This hobby requires practice, and even if the most fun way to practice is by flying, it is really time consuming.
A more time efficient way to practice is by a simulator, so I bought PhoenixRC [1]_, which is a flight simulator. It comes with an USB adapter that let you connect and fly with your own RC controller.
I did not run the simulator so much. PhoenixRC is a Windows software and there was no driver for the adapter for Linux. The only instance of Windows I had was on a separate disk that layed on the shelf, but switching disk on your laptop each time you want to fly is simply not going to happened.

This new year eve (2017), my wife became ill and I got some time for my own.
Far down into a box I found the adapter and plugged it into my Linux computer. Still no driver.

Reverse engineering the adapter

The reverse engineering was quite simple. It turns out that the adapter only has 1 configuration, 1 interface and 1 endpoint of in-interrupt type.
This simply means that it only has an unidirectional communication path, initiated by sending an interrupt URB (USB Request Block) to the device.
If you are not familiar with what configurations, interfaces and endpoints are in terms of USB, please google the USB standard specification.

The data from our URB was only 8 bytes. After some testing with my RC controller I got the following mapping between data and the channels on the controller:

data[0] = channel 1
data[1] = ? (Possibly a switch)
data[2] = channel 2
data[3] = channel 3
data[4] = channel 4
data[5] = channel 5
data[6] = channel 6
data[7] = channel 7

So I created a device driver that registered an input device with the following events:

Channel Event
1 ABS_X
2 ABS_Y
3 ABS_RX
4 ABS_RY
5 ABS_RUDDER
6 ABS_THROTTLE
7 ABS_MISC

Using a simulator

Heli-X [2]_ is an excellent cross platform flight simulator that runs perfect on Linux.
Now I have spent several hours with a Goblin 700 Helicopter and it is just as fun as I rembembered.

http://www.marcusfolkesson.se/wp-content/uploads/2018/03/phoenixrc.jpg

Available in Linux kernel 4.17

Of course all code is submitted to the Linux kernel and should is merged into v4.17.

References

[1]

System Message: WARNING/2 (<stdin>, line 144); backlink

Duplicate explicit target name: "1".

http://www.phoenix-sim.com/

[2]

System Message: WARNING/2 (<stdin>, line 145); backlink

Duplicate explicit target name: "2".

http://www.heli-x.info/cms/

Docutils System Messages

System Message: ERROR/3 (<stdin>, line 7); backlink

Duplicate target name, cannot be used as a unique reference: "1".

System Message: ERROR/3 (<stdin>, line 56); backlink

Duplicate target name, cannot be used as a unique reference: "2".

System Message: ERROR/3 (<stdin>, line 81); backlink

Duplicate target name, cannot be used as a unique reference: "1".

System Message: ERROR/3 (<stdin>, line 130); backlink

Duplicate target name, cannot be used as a unique reference: "2".

get_maintainer.pl and git send-email

Many with me prefer email as communication channel, especially for patches.
Github, Gerrit and all other "nice" and "userfriendly" tools that tries to "help" you to manage your submissions does not simply fit my workflow.

As you may already know, all patches to the Linux kernel is by email. scripts/get_maintainer.pl (see [1] for more info about the process) is a handy tool that takes a patch as input and gives back a bunch of emails addresses.
These email addresses is usually passed to git send-email [2] for submission.

I have used various scripts to make the output from get_maintainer.pl to fit git send-email, but was not completely satisfied until I found the –to-cmd and –cc-cmd parameters to git send-email:

--to-cmd=<command>
  Specify a command to execute once per patch file which should generate patch file specific "To:" entries. Output of this command must be single email address per line. Default is the value of sendemail.tocmd configuration value.
--cc-cmd=<command>
  Specify a command to execute once per patch file which should generate patch file specific "Cc:" entries. Output of this command must be single email address per line. Default is the value of sendemail.ccCmd configuration value.

I’m very pleased with these parameters. All I have to to is to put these extra lines into my ~/.gitconfig (or use git config):

[sendemail.linux]
    tocmd ="`pwd`/scripts/get_maintainer.pl --nogit --nogit-fallback --norolestats --nol"
    cccmd ="`pwd`/scripts/get_maintainer.pl --nogit --nogit-fallback --norolestats --nom"

To submit a patch, I just type:

git send-email --identity=linux ./0001-my-fancy-patch.patch

and let –to and –cc to be populated automatically.

libSegFault.so

The dynamic linker [1] in a Linux system is using several environment variables to customize it’s behavior. The most commonly used is probably LD_LIBRARY_PATH which is a list of directories where it search for libraries at execution time.
Another variable I use quite often is LD_TRACE_LOADED_OBJECTS to let the program list its dynamic dependencies, just like ldd(1).

For example, consider the following output

$ LD_TRACE_LOADED_OBJECTS=1 /bin/bash
    linux-vdso.so.1 (0x00007ffece29e000)
    libreadline.so.7 => /usr/lib/libreadline.so.7 (0x00007fc9b82d1000)
    libdl.so.2 => /usr/lib/libdl.so.2 (0x00007fc9b80cd000)
    libc.so.6 => /usr/lib/libc.so.6 (0x00007fc9b7d15000)
    libncursesw.so.6 => /usr/lib/libncursesw.so.6 (0x00007fc9b7add000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fc9b851f000)
    libtinfo.so.6 => /usr/lib/libtinfo.so.6 (0x00007fc9b78b1000)

LD_PRELOAD

LD_PRELOAD is a list of additional shared objects that should be loaded before all other dynamic dependencies. When the loader is resolving symbols, it sequentially walk through the list of dynamic shared objects and takes the first match. This makes it possible to overide functions in other shared objects and change the behavior of the application completely.

Consider the following example

$ LD_PRELOAD=/usr/lib/libSegFault.so LD_TRACE_LOADED_OBJECTS=1 /bin/bash
    linux-vdso.so.1 (0x00007ffc73f61000)
    /usr/lib/libSegFault.so (0x00007f131c234000)
    libreadline.so.7 => /usr/lib/libreadline.so.7 (0x00007f131bfe6000)
    libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f131bde2000)
    libc.so.6 => /usr/lib/libc.so.6 (0x00007f131ba2a000)
    libncursesw.so.6 => /usr/lib/libncursesw.so.6 (0x00007f131b7f2000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f131c439000)
    libtinfo.so.6 => /usr/lib/libtinfo.so.6 (0x00007f131b5c6000)

Here we have preloaded libSegFault and it is listed in second place. In the first place we have linux-vdso.so.1 which is a Virtual Dynamic Shared Object provided by the Linux kernel.
The VDSO deserves it’s own separate blog post, it is a cool feature that maps kernel code into the a process’s context as a .text segment in a virtual library.

libSegFault.so

libSegFault.so is part of glibc [2] and comes with your toolchain. The library is for debugging purpose and is activated by preload it at runtime. It does not actually overrides functions but register signal handlers in a constructor (yes, you can execute code before main) for specified signals. By default only SIGEGV (see signal(7)) is registered. These registered handlers print a backtrace for the applicaton when the signal is delivered. See its implementation in [3].

Set the environment variable SEGFAULT_SIGNALS to explicit select signals you want to register a handler for.

http://www.marcusfolkesson.se/wp-content/uploads/2017/12/libsegfault.png

This is an useful feature for debug purpose. The best part is that you don’t have to recompile your code.

libSegFault in action

Our application

Consider the following in real life application taken directly from the local nuclear power plant:

void handle_uranium(char *rod)
{
    *rod = 0xAB;
}

void start_reactor()
{
    char *rod = 0x00;
    handle_uranium(rod);
}

int main()
{
    start_reactor();
}

The symptom

We are seeing a segmentation fault when operate on a particular uranium rod, but we don’t know why.

Use libSegFault

Start the application with libSegFault preloaded and examine the dump:

$ LD_PRELOAD=/usr/lib/libSegFault.so ./powerplant
*** Segmentation fault
Register dump:

 RAX: 0000000000000000   RBX: 0000000000000000   RCX: 0000000000000000
 RDX: 00007ffdf6aba5a8   RSI: 00007ffdf6aba598   RDI: 0000000000000000
 RBP: 00007ffdf6aba480   R8 : 000055d2ad5e16b0   R9 : 00007f98534729d0
 R10: 0000000000000008   R11: 0000000000000246   R12: 000055d2ad5e14f0
 R13: 00007ffdf6aba590   R14: 0000000000000000   R15: 0000000000000000
 RSP: 00007ffdf6aba480

 RIP: 000055d2ad5e1606   EFLAGS: 00010206

 CS: 0033   FS: 0000   GS: 0000

 Trap: 0000000e   Error: 00000006   OldMask: 00000000   CR2: 00000000

 FPUCW: 0000037f   FPUSW: 00000000   TAG: 00000000
 RIP: 00000000   RDP: 00000000

 ST(0) 0000 0000000000000000   ST(1) 0000 0000000000000000
 ST(2) 0000 0000000000000000   ST(3) 0000 0000000000000000
 ST(4) 0000 0000000000000000   ST(5) 0000 0000000000000000
 ST(6) 0000 0000000000000000   ST(7) 0000 0000000000000000
 mxcsr: 1f80
 XMM0:  00000000000000000000000000000000 XMM1:  00000000000000000000000000000000
 XMM2:  00000000000000000000000000000000 XMM3:  00000000000000000000000000000000
 XMM4:  00000000000000000000000000000000 XMM5:  00000000000000000000000000000000
 XMM6:  00000000000000000000000000000000 XMM7:  00000000000000000000000000000000
 XMM8:  00000000000000000000000000000000 XMM9:  00000000000000000000000000000000
 XMM10: 00000000000000000000000000000000 XMM11: 00000000000000000000000000000000
 XMM12: 00000000000000000000000000000000 XMM13: 00000000000000000000000000000000
 XMM14: 00000000000000000000000000000000 XMM15: 00000000000000000000000000000000

Backtrace:
./powerplant(+0x606)[0x55d2ad5e1606]
./powerplant(+0x628)[0x55d2ad5e1628]
./powerplant(+0x639)[0x55d2ad5e1639]
/usr/lib/libc.so.6(__libc_start_main+0xea)[0x7f9852ec6f6a]
./powerplant(+0x51a)[0x55d2ad5e151a]

Memory map:

55d2ad5e1000-55d2ad5e2000 r-xp 00000000 00:13 14897704                   /home/marcus/tmp/segfault/powerplant
55d2ad7e1000-55d2ad7e2000 r--p 00000000 00:13 14897704                   /home/marcus/tmp/segfault/powerplant
55d2ad7e2000-55d2ad7e3000 rw-p 00001000 00:13 14897704                   /home/marcus/tmp/segfault/powerplant
55d2ada9c000-55d2adabd000 rw-p 00000000 00:00 0                          [heap]
7f9852c8f000-7f9852ca5000 r-xp 00000000 00:13 13977863                   /usr/lib/libgcc_s.so.1
7f9852ca5000-7f9852ea4000 ---p 00016000 00:13 13977863                   /usr/lib/libgcc_s.so.1
7f9852ea4000-7f9852ea5000 r--p 00015000 00:13 13977863                   /usr/lib/libgcc_s.so.1
7f9852ea5000-7f9852ea6000 rw-p 00016000 00:13 13977863                   /usr/lib/libgcc_s.so.1
7f9852ea6000-7f9853054000 r-xp 00000000 00:13 13975885                   /usr/lib/libc-2.26.so
7f9853054000-7f9853254000 ---p 001ae000 00:13 13975885                   /usr/lib/libc-2.26.so
7f9853254000-7f9853258000 r--p 001ae000 00:13 13975885                   /usr/lib/libc-2.26.so
7f9853258000-7f985325a000 rw-p 001b2000 00:13 13975885                   /usr/lib/libc-2.26.so
7f985325a000-7f985325e000 rw-p 00000000 00:00 0
7f985325e000-7f9853262000 r-xp 00000000 00:13 13975827                   /usr/lib/libSegFault.so
7f9853262000-7f9853461000 ---p 00004000 00:13 13975827                   /usr/lib/libSegFault.so
7f9853461000-7f9853462000 r--p 00003000 00:13 13975827                   /usr/lib/libSegFault.so
7f9853462000-7f9853463000 rw-p 00004000 00:13 13975827                   /usr/lib/libSegFault.so
7f9853463000-7f9853488000 r-xp 00000000 00:13 13975886                   /usr/lib/ld-2.26.so
7f9853649000-7f985364c000 rw-p 00000000 00:00 0
7f9853685000-7f9853687000 rw-p 00000000 00:00 0
7f9853687000-7f9853688000 r--p 00024000 00:13 13975886                   /usr/lib/ld-2.26.so
7f9853688000-7f9853689000 rw-p 00025000 00:13 13975886                   /usr/lib/ld-2.26.so
7f9853689000-7f985368a000 rw-p 00000000 00:00 0
7ffdf6a9b000-7ffdf6abc000 rw-p 00000000 00:00 0                          [stack]
7ffdf6bc7000-7ffdf6bc9000 r--p 00000000 00:00 0                          [vvar]
7ffdf6bc9000-7ffdf6bcb000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

At a first glance, the information may feel overwelming, but lets go through the most importat lines.

The backtrace lists the call chain when the the signal was delivered to the application. The first entry is on top of the stack

Backtrace:
./powerplant(+0x606)[0x55d2ad5e1606]
./powerplant(+0x628)[0x55d2ad5e1628]
./powerplant(+0x639)[0x55d2ad5e1639]
/usr/lib/libc.so.6(__libc_start_main+0xea)[0x7f9852ec6f6a]
./powerplant(+0x51a)[0x55d2ad5e151a]

Here we can see that the last executed instruction is at address 0x55d2ad5e1606. The tricky part is that the address is not absolute in the application, but virtual for the whole process.
In other words, we need to calculate the address to an offset within the application’s .text segment.
If we look at the Memory map we see three entries for the powerplant application:

55d2ad5e1000-55d2ad5e2000 r-xp 00000000 00:13 14897704                   /home/marcus/tmp/segfault/powerplant
55d2ad7e1000-55d2ad7e2000 r--p 00000000 00:13 14897704                   /home/marcus/tmp/segfault/powerplant
55d2ad7e2000-55d2ad7e3000 rw-p 00001000 00:13 14897704                   /home/marcus/tmp/segfault/powerplant

Why three?
Most ELF files (application or library) has at least three memory mapped sections:
– .text, The executable code
– .rodata, read only data
– .data, read/write data

With help of the permissions it is possible to figure out which mapping correspond to each section.

The last mapping has rw- as permissions and is probably our .data section as it allows both write and read.
The middle mapping has r– and is a read only mapping – probably our .rodata section.
The first mapping has r-x which is read-only and executable. This must be our .text section!

Now we can take the address from our backtrace and subtract with the offset address for our .text section:
0x55d2ad5e1606 – 0x55d2ad5e1000 = 0x606

Use addr2line to get the corresponding line our source code

$ addr2line -e ./powerplant -a 0x606
    0x0000000000000606
    /home/marcus/tmp/segfault/main.c:3

If we go back to the source code, we see that line 3 in main.c is

*rod = 0xAB;

Here we have it. Nothing more to say.

Conclusion

libSegFault.so has been a great help over a long time. The biggest benefit is that you don’t have to recompile your application when you want to use the feature. However, you cannot get the line number from addr2line if the application is not compiled with debug symbols, but often it is not that hard to figure out the context out from a dissassembly of your application.

OOM-killer

When the system is running out of memory, the Out-Of-Memory (OOM) killer picks a process to kill based on the current memory footprint.
In case of OOM, we will calculate a badness score between 0 (never kill) and 1000 for each process in the system. The process with the highest score will be killed. A score of 0 is reserved for unkillable tasks such as the global init process (see [1]) or kernel threads (processes with PF_KTHREAD flag set).

http://www.marcusfolkesson.se/wp-content/uploads/2017/12/oomkiller.jpg

The current score of a given process is exposed in procfs, see /proc/[pid]/oom_score, and may be adjusted by setting /proc/[pid]/oom_score_adj.
The value of oom_score_adj is added to the score before it is used to determine which task to kill. The value may be set between OOM_SCORE_ADJ_MIN (-1000) and OOM_SCORE_DJ_MAX (+1000).
This is useful if you want to guarantee that a process never is selected by the OOM killer.

The calculation is simple (nowadays), if a task is using all its allowed memory, the badness score will be calculated to 1000. If it is using half of its allowed memory, the badness score is calculated to 500 and so on.
By setting oom_score_adj to -1000, the badness score sums up to <=0 and the task will never be killed by OOM.

There is one more thing that affects the calculation; if the process is running with the capability CAP_SYS_ADMIN, it gets a 3% discount, but that is simply it.

The old implementation

Before v2.6.36, the calculation of badness score tried to be smarter, besides looking for the total memory usage (task->mm->total_vm), it also considered:
– Whether the process creates a lot of children
– Whether the process has been running for a long time, or has used a lot of CPU time
– Whether the process has a low nice value
– Whether the process is privileged (CAP_SYS_ADMIN or CAP_SYS_RESOURCE set)
– Whether the process is making direct hardware access

At first glance, all these criteria looks valid, but if you think about it a bit, there is a lot of pitfalls here which makes the selection not so fair.
For example: A process that creates a lot of children and consumes some memory could be a leaky webserver. Another process that fits into the description is your session manager for your desktop environment which naturally creates a lot of child processes.

The new implementation

This heuristic selection has evolved over time, instead of looking on mm->total_vm for each task, the task’s RSS (resident set size, [2]) and swap space is used instead.
RSS and Swap space gives a better indication of the amount that we will be able to free if we chose this task.
The drawback with using mm->total_vm is that it includes overcommitted memory ( see [3] for more information ) which is pages that the process has claimed but has not been physically allocated.

The process is now only counted as privileged if CAP_SYS_ADMIN is set, not CAP_SYS_RESOURCE as before.

The code

The whole implementation of OOM killer is located in mm/oom_kill.c.
The function oom_badness() will be called for each task in the system and returns the calculated badness score.

Let’s go through the function.

unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
              const nodemask_t *nodemask, unsigned long totalpages)
{
    long points;
    long adj;

    if (oom_unkillable_task(p, memcg, nodemask))
        return 0;

Looking for unkillable tasks such as the global init process.

p = find_lock_task_mm(p);
if (!p)
    return 0;

adj = (long)p->signal->oom_score_adj;
if (adj == OOM_SCORE_ADJ_MIN ||
        test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
        in_vfork(p)) {
    task_unlock(p);
    return 0;
}

If proc/[pid]/oom_score_adj is set to OOM_SCORE_ADJ_MIN (-1000), do not even consider this task

points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
    atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm);
task_unlock(p);

Calculate a score based on RSS, pagetables and used swap space

if (has_capability_noaudit(p, CAP_SYS_ADMIN))
    points -= (points * 3) / 100;

If it is root process, give it a 3% discount. We are no mean people after all

adj *= totalpages / 1000;
points += adj;

Normalize and add the oom_score_adj value

return points > 0 ? points : 1;

At last, never return 0 for an eligible task as it is reserved for non killable tasks

}

Conclusion

The OOM logic is quite straightforward and seems to have been stable for a long time (v2.6.36 was released in october 2010).
The reason why I was looking at the code was that I did not think the behavior I saw when experimenting corresponds to what was written in the man page for oom_score.
It turned out that the manpage was not updated when the new calculation was introduced back in 2010.

I have updated the manpage and it is available in v4.14 of the Linux manpage project [4].

commit 5753354a3af20c8b361ec3d53caf68f7217edf48
Author: Marcus Folkesson <marcus.folkesson@gmail.com>
Date:   Fri Nov 17 13:09:44 2017 +0100

    proc.5: Update description of /proc/<pid>/oom_score

    After Linux 2.6.36, the heuristic calculation of oom_score
    has changed to only consider used memory and CAP_SYS_ADMIN.

    See kernel commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10.

    Signed-off-by: Marcus Folkesson <marcus.folkesson@gmail.com>
    Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>

diff --git a/man5/proc.5 b/man5/proc.5
index 82d4a0646..4e44b8fba 100644
--- a/man5/proc.5
+++ b/man5/proc.5
@@ -1395,7 +1395,9 @@ Since Linux 2.6.36, use of this file is deprecated in favor of
 .IR /proc/[pid]/oom_score_adj .
 .TP
 .IR /proc/[pid]/oom_score " (since Linux 2.6.11)"
-.\" See mm/oom_kill.c::badness() in the 2.6.25 sources
+.\" See mm/oom_kill.c::badness() in pre 2.6.36 sources
+.\" See mm/oom_kill.c::oom_badness() after 2.6.36
+.\" commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10
 This file displays the current score that the kernel gives to
 this process for the purpose of selecting a process
 for the OOM-killer.
@@ -1403,7 +1405,16 @@ A higher score means that the process is more likely to be
 selected by the OOM-killer.
 The basis for this score is the amount of memory used by the process,
 with increases (+) or decreases (\-) for factors including:
-.\" See mm/oom_kill.c::badness() in the 2.6.25 sources
+.\" See mm/oom_kill.c::badness() in pre 2.6.36 sources
+.\" See mm/oom_kill.c::oom_badness() after 2.6.36
+.\" commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10
+.RS
+.IP * 2
+whether the process is privileged (\-);
+.\" More precisely, if it has CAP_SYS_ADMIN or (pre 2.6.36) CAP_SYS_RESOURCE
+.RE
+.IP
+Before kernel 2.6.36 the following factors were also used in the calculation of oom_score:
 .RS
 .IP * 2
 whether the process creates a lot of children using
@@ -1413,10 +1424,7 @@ whether the process creates a lot of children using
 whether the process has been running a long time,
 or has used a lot of CPU time (\-);
 .IP *
-whether the process has a low nice value (i.e., > 0) (+);
-.IP *
-whether the process is privileged (\-); and
-.\" More precisely, if it has CAP_SYS_ADMIN or CAP_SYS_RESOURCE
+whether the process has a low nice value (i.e., > 0) (+); and
 .IP *
 whether the process is making direct hardware access (\-).
 .\" More precisely, if it has CAP_SYS_RAWIO

Embedded Linux course in Linköping

I tech in our Embedded Linux course on a regular basis, this time in Linköping. It’s a fun course with interesting labs where you will write your own linux device driver for a custom board.

The course itself is quite moderate, but with talented participants, we easily slip over to more interesting things like memory management, the MTD subsystem, how ftrace works internally and how to use it, different contexts, introduce perf and much more.

This time was no exception.

http://www.marcusfolkesson.se/wp-content/uploads/2017/11/embedded-linux-course.jpg

libostree and $OSTREE_REPO environment path

libostree is a great tool to handle incremental or full updates for an Linux filesystem.
But virtually all commands of ostree requires the –repo argument to override the default system repository path. This is really annoying after a while so my first attempt to get rid of this was to create an alias

alias ost='ostree --repo=/tmp/repo'

It works but is not good. To solve it once and for all I was about to implement support for getting the repo path from an environment variable.

When digging around in the code, I found that the support is already there but completely undocumented.

It is therefor already possible to

export OSTREE_REPO=/tmp/repo
ostree <cmd>

So the patch for this time is simply a line that mention the option to use $OSTREE_REPO environment variable to point out a directory as system repository path.

commit 075e676eb63a3abaf647c789e22d8d1afe3a1dd5
Author: Marcus Folkesson <marcus.folkesson@gmail.com>
Date:   Tue Oct 17 21:03:23 2017 +0200

docs: mention the $OSTREE_REPO environment variable

$OSTREE_REPO may be set to override the default location
of the repository.

Link: https://mail.gnome.org/archives/ostree-list/2017-October/msg00003.html

Signed-off-by: Marcus Folkesson <marcus.folkesson@gmail.com>

Closes: #1282
Approved by: cgwalters

Conclusion

If /ostree/repo is not the location of your system repository, it is really preferable to set the $OSTREE_REPO environment variable instead of provide the –repo to every single command.

FIT vs legacy image format

U-Boot supports several image formats when booting a kernel. However, a Linux system usually need multiple files for booting.
Such files may be the kernel itself, an initrd and a device tree blob.

A typical embedded Linux system have all these files in at least two-three different configurations. It is not uncommon to have

  • Default configuration
  • Rescue configuration
  • Development configuration
  • Production configuration

Only these four configurations may involve twelve different files. Or maybe the devicetree is shared between two configurations. Or maybe the initrd is… Or maybe the….. you got the point. It has a farily good chance to end up quite messy.

This is a problem with the old legacy formats and is somehow addressed in the FIT (Flattened Image Tree) format.

Lets first look at the old formats before we get into FIT.

zImage

The most well known format for the Linux kernel is the zImage. The zImage contains a small header followed by a self-extracting code and finally the payload itself.
U-Boot support this format by the bootz command.

Image layout

zImage format
Header
Decompressing code
Compressed Data

uImage

When talking about U-Boot, the zImage is usually encapsulated in a file called uImage created with the mkimage utility. Beside image data, the uImage also contains information such as OS type, loader information, compression type and so on.
Both data and header is checksumed with CRC32.

The uImage format also supports multiple images. The zImage, initrd and devicetree blob may therefor be included in one single monolithic uImage.
Drawbacks with this monolith is that it has no flexible indexing, no hash integrity and no support for security at all.

Image layout

uImage format
Header
Header checksum
Data size
Data load address
Entry point address
Data CRC
OS, CPU
Image type
Compression type
Image name
Image data

FIT

The FIT (Flattened Image Tree) has been around for a while but is for unknown reason rarely used in real systems, based on my experience. FIT is a tree structure like Device Tree (before they change format to YAML: https://lwn.net/Articles/730217/) and handles images of various types.

These images is then used in something called configurations. One image may be a used in several configurations.

This format has many benifits compared to the other:

  • Better solutions for multi component images
    • Multiple kernels (productions, feature, debug, rescue…)
    • Multiple devicetrees
    • Multiple initrds
  • Better hash integrity of images. Supports different hash algorithms like
    • SHA1
    • SHA256
    • MD5
  • Support signed images
    • Only boot verified images
    • Detect malware

Image Tree Source

The file describing the structure is called .its (Image Tree Source).
Here is an example of a .its file

/dts-v1/;

/ {
    description = "Marcus FIT test";
    #address-cells = <1>;

    images {
        kernel@1 {
            description = "My default kenel";
            data = /incbin/("./zImage");
            type = "kernel";
            arch = "arm";
            os = "linux";
            compression = "none";
            load = <0x83800000>;
            entry = <0x83800000>;
            hash@1 {
                algo = "md5";
            };
        };

        kernel@2 {
            description = "Rescue image";
            data = /incbin/("./zImage");
            type = "kernel";
            arch = "arm";
            os = "linux";
            compression = "none";
            load = <0x83800000>;
            entry = <0x83800000>;
            hash@1 {
                algo = "crc32";
            };
        };

        fdt@1 {
            description = "FDT for my cool board";
            data = /incbin/("./devicetree.dtb");
            type = "flat_dt";
            arch = "arm";
            compression = "none";
            hash@1 {
                algo = "crc32";
            };
        };


    };

    configurations {
        default = "config@1";

        config@1 {
            description = "Default configuration";
            kernel = "kernel@1";
            fdt = "fdt@1";
        };

        config@2 {
            description = "Rescue configuration";
            kernel = "kernel@2";
            fdt = "fdt@1";
        };

    };
};

This .its file has two kernel images (default and rescue) and one FDT image.
Note that the default kernel is hashed with md5 and the other with crc32. It is also possible to use several hash-functions per image.

It also specifies two configuration, Default and Rescue. The two configurations is using different kernels (well, it is the same kernel in this case since both is pointing to ./zImage..) but share the same FDT. It is easy to build up new configurations on demand.

The .its file is then passed to mkimage to generate an itb (Image Tree Blob):

mkimage -f kernel.its kernel.itb

which is bootable from U-Boot.

Look at ./doc/uImage.FIT/ in the U-Boot source code for more examples on how a .its-file can look like.
It also contains examples on signed images which is worth a look.

Boot from U-Boot

To boot from U-Boot, use the bootm command and specify the physical address for the FIT image and the configuration you want to boot.

Example on booting config@1 from the .its above:

bootm 0x80800000#config@1

Full example with bootlog:

U-Boot 2015.04 (Oct 05 2017 - 14:25:09)

CPU:   Freescale i.MX6UL rev1.0 at 396 MHz
CPU:   Temperature 42 C
Reset cause: POR
Board: Marcus Cool board
fuse address == 21bc400
serialnr-low : e1fe012a
serialnr-high : 243211d4
       Watchdog enabled
I2C:   ready
DRAM:  512 MiB
Using default environment

In:    serial
Out:   serial
Err:   serial
Net:   FEC1
Boot from USB for mfgtools
Use default environment for                              mfgtools
Run bootcmd_mfg: run mfgtool_args;bootz ${loadaddr} - ${fdt_addr};
Hit any key to stop autoboot:  0
=> bootm 0x80800000#config@1
## Loading kernel from FIT Image at 80800000 ...
   Using 'config@1' configuration
   Trying 'kernel@1' kernel subimage
     Description:  My cool kernel
     Type:         Kernel Image
     Compression:  uncompressed
     Data Start:   0x808000d8
     Data Size:    7175328 Bytes = 6.8 MiB
     Architecture: ARM
     OS:           Linux
     Load Address: 0x83800000
     Entry Point:  0x83800000
     Hash algo:    crc32
     Hash value:   f236d022
   Verifying Hash Integrity ... crc32+ OK
## Loading fdt from FIT Image at 80800000 ...
   Using 'config@1' configuration
   Trying 'fdt@1' fdt subimage
     Description:  FDT for my cool board
     Type:         Flat Device Tree
     Compression:  uncompressed
     Data Start:   0x80ed7e4c
     Data Size:    27122 Bytes = 26.5 KiB
     Architecture: ARM
     Hash algo:    crc32
     Hash value:   1837f127
   Verifying Hash Integrity ... crc32+ OK
   Booting using the fdt blob at 0x80ed7e4c
   Loading Kernel Image ... OK
   Loading Device Tree to 9ef86000, end 9ef8f9f1 ... OK

Starting kernel ...

One thing to think about

Since this FIT image tends to grow in size, it is a good idea to set CONFIG_SYS_BOOTM_LEN in the U-Boot configuration.

- CONFIG_SYS_BOOTM_LEN:
        Normally compressed uImages are limited to an
        uncompressed size of 8 MBytes. If this is not enough,
        you can define CONFIG_SYS_BOOTM_LEN in your board config file
        to adjust this setting to your needs.

Conclusion

I advocate to use the FIT format because it solves many problems that can get really awkward. Just the bonus to program one image in production instead of many is a great benifit. The configurations is well defined and there is no chance that you will end up booting one kernel image with an incompatible devicetree or whatever.

Use FIT images should be part of the bring-up-board activity, not the last thing todo before closing a project. The "Use seperate kernel/dts/initrds for now and then move on to FIT when the platform is stable" does not work. It simple not gonna happend, and it is a mistake.

Memory management in the kernel

Memory management in the kernel

Memory management is among the most complex parts in the Linux kernel. There is so many critical parts such as page allocator, slab allocator, virtual memory handling, memory mapping, MMU, IOMMU and so on. All these parts has to work perfect (or at least allmost perfect 🙂 ) because all system use them either they want to or not.
If there is a bug or performance issue you will be noticed quite soon.

My goal is to produce a few posts on the topic and try to sort out the different parts and describe how they work and the connection between.
I will begin from the physical bottom and work myself up to how userspace allocates memory in their little blue world with pink clouds. (Everything is so easy on the user side)

struct page

A page is the smallest unit that matters in terms of virtual memory. This is because the MMU (Memory Management Unit, described in a upcoming post) is only dealing with those pages. A typical size for a page is 4KB, at least for 32bit architectures. The most 64-bit architectures uses 8KB pages.

Every one of those physical pages is represented by a struct page that is defined in include/linux/mm_types.h. That is a lot of pages. If we do a simple calculation:
We have a 32-bit system that has 512MB of physical memory, this memory is divided into 131,072 4KB pages. Think of that 512MB is not even so much memory on a modern system today.

What I want to say is that this struct page should be kept as small as possible because it scales up a lot when physical memory increases.

Ok, so there is a struct page somewhere that got allocated for each physical page, which is a lot, but what does it do?
It does a lot of housekeeping, lets look at a few set of members that I think is most interresting:

struct page {
    unsigned long flags;
    unsigned long private;
    void    *virtual;
    atomic_t    _count;
    pgoff_t    index;
#if USE_SPLIT_PTE_PTLOCKS
#if ALLOC_SPLIT_PTLOCKS
    spinlock_t  *ptl;
#else
    spinlock_t  ptl;
#endif
#endif
};

flags is keeping track of the page status which could be dirty (need to be written to media), locked in memory (not allowed to be paged out), permissions and so on. See enum pageflags in include/linux/page-flags.h for more information.

private is not a defined field. May be used as a long or interpreted as a pointer. (Shared with ptl in a union!)

virtual is the virtual address of the page. In case that the page belongs to the high memory (memory that is not permanently mapped) this field will be NULL and require dynamic mapping.

_count is a simple reference counter to determine when the page is free for allocation.

index is the offset within a mapping.

ptl is a interresting one! I think it requires a special section in this post. (Shared with private in a union!)

Page Table Lock

PTL stands for Page Table Lock and is a per-page lock.

In the next part of these memory management posts I will describe the struct mm_struct, how PGD, PMD and PTE are related, but for now it’s enough that you just have heard the words.

Ok, there is one thing that is good to know. The struct mm_struct (also defined in mm_types.h) is a structure that represent a process’s address space and contains all information related to the process memory. The structure has a pointer to virtual memory areas that refers to one or more struct page.
This structure also has the member mm->page_table_lock that is a spinlock that protects all page tables of the mm_struct. This was the original approach and is still used by several architectures.

However, this mm->page_table_lock is a little bit clumsy since it lock all pages at once. This is no real problem on a single-cpu without SMP system. But nowdays that is not a very common scenario.

Instead, the split page table lock was introduced and has a separate per-table lock to allow concurrency access to pages in the same mm_struct. Remember that the mm_struct is per process? So this increases page-fault/page-access performance in multi-threaded applications only.

When is split page table locks enabed?
It is enabled in compile-time if CONFIG_SPLIT_PTLOCK_CPUS (I have never seen another value but 4 on this one) is less or equal to NR_CPUS.

Here is a few defines int the beginning of the mm_types.h header file:

#define USE_SPLIT_PTE_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
#define USE_SPLIT_PMD_PTLOCKS (USE_SPLIT_PTE_PTLOCKS &&
    IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK))
#define ALLOC_SPLIT_PTLOCKS (SPINLOCK_SIZE > BITS_PER_LONG/8)

The ALLOC_SPLIT_PTLOCKS is a little bit clever. If the sizeof a spinlock is less or equal to the size of a long, the spinlock is embedded in the struct page and can therefor save a cache line by avoiding indirect access.
If a spinlock does not fit into a long, then the page->ptl is used as a pointer that points to a dynamic allocated spinlock. As I said, this is a clever construction since it allow us to increase the size of a spinlock and there is no problem. Exemple when sizeof spinlock does not fit is when using DEBUG_SPINLOCK, DEBUG_LOCK_ALLOC or applying the PREEMPT_RT patchset.

The spinlock_t is allocated in pgtable_page_ctor() for PTE tables and in pgtable_pmd_page_ctor for PMD tables. These function (and the corresponding free-functions) should be called in *every place that allocated/freeing pages. This is already done in mainline, but I know there is evil hardware vendors out there that do not. For example, if you use their evil code and apply the preempt_rt patchset (that increases the sizeof spinlock_t), you have to verify that thier code behaves.

Also, pgtable_*page_ctor() can fail, this must be handled properly.

Remember that the page-ptl should never be accessed directly, use appropriate helper functions for that.

Example on such helper functions is..

pte_offset_map_lock()
pte_unmap_lock()
pte_alloc_map_lock()
pte_lockptr()
pmd_lock()
pmd_lockptr()