BEER2RST - my first attempt with golang

BEER2RST - my first attempt with golang

I'm using Beersmith [1] a (non-free...) software to create my beer recipes. It is written in Java and runs well on Linux. One of the biggest benefits with using Beersmith is that my brewing house [2] is taking exported recipes as input and setup mash schemes automatically - really comfortable.

I brew beer several times a month and always takes notes of each brewing, both methods and results. The goal is to get reproducible results each time or to make small improvements. Instead of having all these notes in separate locations, it would be nice if I instead collected all of my brewing as blog posts.

With that said, this blog will probably evolve to contain non-technical posts as well in the near future.

In these posts, I also want to publish my recipes somehow. Beersmith is able to export recipes in a non overcomplicaded XML format and is quite straight forward to parse. All posts that I'm writing is in reStructuredText [3] format, so I need to create a tool that read the XML and export the recipes in reStructuredText format.

First glance at Golang

/media/golang.png

To be honest, I'm not really a fan of high-level programming languages as I can't take them seriously. But I guess it is time to learn something new. What I have tested with Go so far is rather impressive. For example, I cannot imagine a simpler XML parsing (I'm used to libxml2).

I also like the gofmt [4] tool to make the code properly formatted. Every language should have such a tool. It is also easy so cross compile the application to different architectures by specify $GOOS and $GOARCH. I have only tested with ARM and it just works.

Imports

Most languages has a way to tell what functionality it should import and be usable to your file. Go is using import and has the following syntax

import (
    "encoding/xml"
    "flag"
    "fmt"
    "io/ioutil"
    "os"
    "strings"
)

What makes it really interesting is when you do this

import (
    "github.com/marcusfolkesson/tablewriter"
)

I.e. point to a repository. You only need to download the repository to your $GOPATH location and then it is useable

go get github.com/marcusfolkesson/tablewriter
Notes:
I need to create tables for print the recipes properly. I found tablewriter [5] that is printing ASCII-tables. Unfortunately, it does not create tables in reStructuredText format so I had to fork the project and implement support for that. Hopefully the changes will make it back to the original project. There is a pending pull request for that.

Unmarshal XML

I really liked how easy it was to unmarshal XML. Consider the following snip from the exported recipe

<HOP>
 <NAME>Northern Brewer</NAME>
 <VERSION>1</VERSION>
 <ORIGIN>Germany</ORIGIN>
 <ALPHA>8.5000000</ALPHA>
 <AMOUNT>0.0038459</AMOUNT>
 <USE>Boil</USE>
 <TIME>60.0000000</TIME>
 <NOTES>Also called Hallertauer Northern Brewers
Used for: Bittering and finishing both ales and lagers of all kinds
Aroma: Fine, dry, clean bittering hop.  Unique flavor.
Substitutes: Hallertauer Mittelfrueh, Hallertauer
Examples: Anchor Steam, Old Peculiar, </NOTES>
 <TYPE>Both</TYPE>
 <FORM>Pellet</FORM>
 <BETA>4.0000000</BETA>
 <HSI>35.0000000</HSI>
 <DISPLAY_AMOUNT>3.85 g</DISPLAY_AMOUNT>
 <INVENTORY>0.00 g</INVENTORY>
 <DISPLAY_TIME>60.0 min</DISPLAY_TIME>
</HOP>

The first step is to create a structure that should hold the values

type Hop struct {
    Name   string  `xml:"NAME"`
    Origin string  `xml:"ORIGIN"`
    Alpha  float64 `xml:"ALPHA"`
    Amount float64 `xml:"AMOUNT"`
    Use    string  `xml:"USE"`
    Time   float64 `xml:"TIME"`
    Notes  string  `xml:"NOTES"`
}

There is no need to create variables for all tags, just the ones you are interesting in.

Later on, read the file and unmarshal the XML

content, err := ioutil.ReadFile("beer.xml")
if err != nil {
        panic(err)
}

err = xml.Unmarshal(content, &hops)
if err != nil {
        panic(err)
}

The structure will now be populated with values from the XML file. Magical, isn't it?

Conclusion

I have only used tested Golang for a working day approximately, and I like it. I used to use Python for all kind of fast prototyping, but I think I will consider Golang the next time.

What I really like in comparison with Python are:

  • The fact that it compiles to a single binary is really nice, especially when you cross compile to a different architecture.
  • No need for interpreter
  • Static types! Dynamic type languages makes my brain hurt

The result can be found on my GitHub [6] account.

Lund Linux Conference 2018

Lund Linux Conference 2018

It is just two weeks from now til the Lund Linux Conference (LLC) [1] begins! LLC is a two-day conference with the same layout as the bigger Linux conferences - just smaller, but just as nice.

There will be talks about PCIe, The serial device bus, security in cars and a few more topics. My highlights this year is to hear about the XDP (eXpress Data Path) [2] to get really fast packet processing with standard Linux. For the last six months, XDP has had great progress and is a technically cool feature.

Here we are back in 2017:

/media/lund-linuxcon-2018.jpg

ath10k QCA6584 and Wireless network stack

ath10k QCA6584 and Wireless network stack

ATH10K is the mac80211 wireless driver for Qualcomm Atheros QCA988x family of chips, and I'm currently working [1] with the QCA6584 chip which is an automotive graded radio chip with PHY support for the abgn+ac modes. The connection interface to the chip is SDIO which is hardly supported for now, but my friend and kernelhacker, Erik Strömdahl [2] , has got his hands dirty and is currently working on it. There has been some progress, the chip now able to scan, connect, send and receive data. There is still some issues with the link speed but that is coming.

He is also the reason for why I got interested in the network part of the kernel which is quite... big.

Even only the wireless networking subsystem is quite big, and the first you meet when you start to dig is a bunch of terms thrown up in your face. I will try to briefly describe a few of these terms that is fundamental for wireless communication.

In this post will discuss the right side of the this figure:

/media/wireless-stack.png

IEEE 802.11

We will see 802.11 a lot of times, so the first thing is to know where these numbers comes from. IEEE 802.11 is a set of specifications for implementation of wireless networking over several frequency bands. The specifications cover layer 1 (Physical) and layer 2 (Data link) of the OSI model [3].

The Linux kernel MAC subsystem register ieee80211 compliant hardware device with

int ieee80211_register_hw(struct ieee80211_hw *hw)

found in .../net/mac80211/main.c

The Management Layer (MLME)

One more thing that we need to cover is the management layer, since all other layers somehow depend on it.

There are three components in the 802.11 management architecture: - The Physical Layer Management Entity (PLME) - The System Management Entity (SME) - The MAC Layer Management Entity (MLME)

The Management layer assist you in several ways. For instance, it handle things such as scanning, authentication, beacons, associations and much more.

Scanning

Scanning is simply looking for other 802.11 compliant devices in the air. There are two types of scanning; passive and active.

Passive scanning

When performing a passive scanning, the radio is listening passively for beacons, without transmitting packages, as it moves from channel to channel and records all devices that it receives beacons from. Higher frequency bands in the ieee802.11a standard does not allow to transmit anything unless you have heard an Access Point (AP) beacon. Passive scanning is therefore the only way to be aware of the surroundings.

Active scanning

Active scanning on the other hand, is transmitting Probe Request (IEEE80211_STYPE_PROBE_REQ) management packets. This type of scanning is also walking from channel to channel, sending these probe requests management packet for each channel.

These requests is handled by ieee80211_send_probe_req() in .../net/mac80211/util.c:

void ieee80211_send_probe_req(struct ieee80211_sub_if_data *sdata,
                  const u8 *src, const u8 *dst,
                  const u8 *ssid, size_t ssid_len,
                  const u8 *ie, size_t ie_len,
                  u32 ratemask, bool directed, u32 tx_flags,
                  struct ieee80211_channel *channel, bool scan)

Authentication

The authentication procedure sends a management frame of a authentication type (IEEE80211_STYPE_AUTH). There is not only one type of authentication but plenty of them. The ieee80211 specification does only specify one mandatory authentication type; the Open-system authentication (WLAN_AUTH_OPEN). Another common authentication type is Shared key authentication (WLAN_AUTH_SHARED_KEY).

These management frames is handled by ieee80211_send_auth() in .../net/mac80211/util.c:

void ieee80211_send_auth(struct ieee80211_sub_if_data *sdata,
             u16 transaction, u16 auth_alg, u16 status,
             const u8 *extra, size_t extra_len, const u8 *da,
             const u8 *bssid, const u8 *key, u8 key_len, u8 key_idx,
             u32 tx_flags)

Open system authentication

This is the most simple type of authentication, all clients that request authentication will be authenticated. No security is involved at all.

Shared key authentication

In this type of authentication the client and AP is using a shared key, also known as Wired Equivalent Privacy (WEP) key.

Association

The association is started when the station sends management frames of the type IEEE80211_STYPE_ASSOC_REQ. In the kernel code this is handled by ieee80211_send_assoc() in .../net/mac80211/mlme.c

static void ieee80211_send_assoc(struct ieee80211_sub_if_data *sdata)

Reassociation

When the station is roaming, i.e. moving between APs within an ESS (Extended Service Set), it also sends a reassociation request to a new AP of the type IEEE802_STYPE_REASSOC_REQ. Association and reassociation has so much in common that it is both handled by ieee80211_send_assoc().

MAC (Medium Access Control)

All ieee80211 devices needs to implement the Management Layer (MLME), but the implementation could be in device hardware or software. These types of devices are divided into Full MAC device (hardware implementation) and Soft MAC device (software implementation). Most devices today are soft MAC devices.

The MAC layer can be further broken down into two pieces: Upper MAC and Lower MAC. The upper part of the MAC handle the management aspect (all that we covered in the MLME section above), and the lower part handle the time critical operations such as ACK:ing received packets.

Linux does only handle the upper part of MAC, the lower part is operated in device hardware. What we can see in the figure is that the MAC layer is separating data packets from configuration/management packets. The data packets is forwarded to the network device and will travel the same path through the network layer as data packets from all other type of network devices.

The Linux wireless subsystem consists of two major parts, where this, mac80211, is one of them. cfg80211 is the other major part.

CFG80211

cfg80211 is a configuration management service for mac80211 compliant devices. Both Full MAC and Soft MAC devices needs to implement operations to be compatible with the cfg80211 configuration interface in order to let userspace application to configure the device.

The configuration may be done with on of two interfaces, wext and nl80211.

Wireless Extension, WEXT (Legacy)

This is the legacy and ugly way to configure wireless devices. It is still supported only for backward compatibility reasons. Users of this configuration interface are wireless-tools (iwconfig, iwlist).

nl80211

nl80211 on the other hand, is a new netlink interface intended to replace the Wireless Extension (wext) interface. Users of this interface is typically iw and wpa_supplicant.

Conclusion

The whole network stack of the Linux kernel is really complex and optimized for high throughput with low latencies. In this post we only covered what support for wireless devices has complemented the stack with, which is mainly the mac80211 layer for handle all device management, and cfg80211 layer to configure the MAC layer. Packets to wireless devices is divided into data packets and configuration/managment packets. The data packets follow the same path as for all network devices, and the management packets goes to the cfg80211 layer.

Linux driver for PhoenixRC adapter

Linux driver for PhoenixRC adapter

Update: Michael Larabel on Phoronix has written a post [3] about this driver. Go ahead and read it as well!

A few years ago I used to build multirotors, mostly quadcopters and tricopters. It is a fun hobby, both building and flying is incredible satisfying. The first multirotors i built was nicely made with CNC cutted details. They looked really nice and robust. However, with more than 100+ crashes under the belt, the last ones was made out of sticks and a food box. Easy to repair and just as fun to fly.

This hobby requires practice, and even if the most fun way to practice is by flying, it is really time consuming. A more time efficient way to practice is by a simulator, so I bought PhoenixRC [1], which is a flight simulator. It comes with an USB adapter that let you connect and fly with your own RC controller. I did not run the simulator so much. PhoenixRC is a Windows software and there was no driver for the adapter for Linux. The only instance of Windows I had was on a separate disk that layed on the shelf, but switching disk on your laptop each time you want to fly is simply not going to happened.

This new year eve (2017), my wife became ill and I got some time for my own. Far down into a box I found the adapter and plugged it into my Linux computer. Still no driver.

Reverse engineering the adapter

The reverse engineering was quite simple. It turns out that the adapter only has 1 configuration, 1 interface and 1 endpoint of in-interrupt type. This simply means that it only has an unidirectional communication path, initiated by sending an interrupt URB (USB Request Block) to the device. If you are not familiar with what configurations, interfaces and endpoints are in terms of USB, please google the USB standard specification.

The data from our URB was only 8 bytes. After some testing with my RC controller I got the following mapping between data and the channels on the controller:

data[0] = channel 1
data[1] = ? (Possibly a switch)
data[2] = channel 2
data[3] = channel 3
data[4] = channel 4
data[5] = channel 5
data[6] = channel 6
data[7] = channel 7

So I created a device driver that registered an input device with the following events:

Channel Event
1 ABS_X
2 ABS_Y
3 ABS_RX
4 ABS_RY
5 ABS_RUDDER
6 ABS_THROTTLE
7 ABS_MISC

Using a simulator

Heli-X [2] is an excellent cross platform flight simulator that runs perfect on Linux. Now I have spent several hours with a Goblin 700 Helicopter and it is just as fun as I rembembered.

/media/phoenixrc.jpg

Available in Linux kernel 4.17

Of course all code is submitted to the Linux kernel and should be merged in v4.17.

Patch changelogs with git-notes

Patch changelogs with git-notes

Git notes [1] is a neat function that has been around since v1.6.6. Notes is a kind of metadata that belongs to a certain commit but is stored separately (different git object) from the commit itself. The fact that it is a separate git object is important, it will therefor keep the commit hash intact, and as a bonus - it has its own diff log.

Basic usage

git-notes has support for add, append, copy edit, list, prune, remove and show subcommands. These subcommands is rather self-explanatory, so I will not describe them any further.

add and edit fits allmost all my needs, so I will stick to them.

First, create a git repository

mkdir gitnotes
cd gitnotes/
git init
echo testfile > test
git add test
git commit --signoff -m "Initial commit"
echo add-more-stuff >> test
git commit --signoff -a -m "Feature X"

We now have two commits:

commit bce22657cb5fd8353d43537e6b64422a8ce1107d (HEAD -> master)
Author: Marcus Folkesson <marcus.folkesson@gmail.com>
Date:   Sun Feb 11 21:56:17 2018 +0100

    Feature X

    Signed-off-by: Marcus Folkesson <marcus.folkesson@gmail.com>

commit 93726c3f6bf5c95faf969f040966b15d58d6fd0b
Author: Marcus Folkesson <marcus.folkesson@gmail.com>
Date:   Sun Feb 11 21:55:21 2018 +0100

    Initial commit

    Signed-off-by: Marcus Folkesson <marcus.folkesson@gmail.com>

Add a note

Add a note is simple:

git notes add -m "v2: update locking handling for Feature X"

We now have a note appended to our git log HEAD -1

commit bce22657cb5fd8353d43537e6b64422a8ce1107d (HEAD -> master)
Author: Marcus Folkesson <marcus.folkesson@gmail.com>
Date:   Sun Feb 11 21:56:17 2018 +0100

    Feature X

    Signed-off-by: Marcus Folkesson <marcus.folkesson@gmail.com>

Notes:
    v2: update locking handling for Feature X

Note that the SHA, bce22657cb5fd8353d43537e6b64422a8ce1107d, has not changed.

Edit a note

Edit a note is as simple as adding one:

git notes edit HEAD

When to use notes?

The first field of application is obvious, keep notes about individual commits. The kind of notes I usually use is comments during log reviews.

For the last year I have been using notes for a more practical use; patch changelog history, which is an incredible powerful but undocumented (try to google it yourself) area of use for notes.

Patch format and changelogs

When talking about patches, the --- marker line serves the essential purpose of marking for patch handling tools where the commit message ends. One good use for the additional comments after the --- marker is for patch changelogs, which describe what has changed between the v1 and v2 version of the patch.

Rebase and amend

When working with patches, it is not unusual that it will take a few iterations until the patchset is in a ready-to-merge state. The way to rework patches is with an interactive rebase and amending.

A note is linked to a certain git object and this will be a problem with rebase and amend since it will overwrite that git object. With that said; if you amend your commit, the notes will disappear... unless... you configure git to rewrite the note upon amend and rebase.

From git-notes(1):

notes.rewrite.<command>

When rewriting commits with <command> (currently amend or rebase) and this variable is set to true, Git automatically copies your notes from the original to the rewritten commit. Defaults to true, but see notes.rewriteRef below.

notes.rewriteRef

When copying notes during a rewrite, specifies the (fully qualified) ref whose notes should be copied. The ref may be a glob, in which case notes in all matching refs will be copied. You may also specify this configuration several times.

Does not have a default value; you must configure this variable to enable note rewriting. Set it to refs/notes/commits to enable rewriting for the default commit notes.

This setting can be overridden with the GIT_NOTES_REWRITE_REF environment variable, which must be a colon separated list of refs or globs.

The manpage simply tells you to add these lines to your ~/.gitconfig or .git/config

[notes "rewrite"]
    amend = true
    rebase = true
[notes]
    rewriteRef = refs/notes/commits

To make amend and rebase rewrite the notes.

Generate patches

The best part is that git format-patch --notes will generate patches with appended notes after the --- marker.

For example: git format-patch --notes -1 -v2 will generate v2-0001-Feature-X.patch:

From bce22657cb5fd8353d43537e6b64422a8ce1107d Mon Sep 17 00:00:00 2001
From: Marcus Folkesson <marcus.folkesson@gmail.com>
Date: Mon, 19 Feb 2018 21:42:45 +0100
Subject: [PATCH v2] Feature X

Signed-off-by: Marcus Folkesson <marcus.folkesson@gmail.com>
---

Notes:
    v2: update locking handling for Feature X

 test | 1 +
 1 file changed, 1 insertion(+)

diff --git a/test b/test
index 2691857..066c37d 100644
--- a/test
+++ b/test
@@ -1 +1,2 @@
 testfile
+add-more-stuff
--
2.15.1

This makes the patch workflow real smooth instead of manually track and update your patch changelogs.

My typical workflow

/media/git-patch-workflow.png

Other tricks

Note logs

Even notes has a log (all objects in refs/XXX has), and it could be useful to see how your notes has evolved over time:

git log -p notes/commits

Push and fetch notes

The notes itself is stored as git object in .git/refs/notes. A thing to keep in mind is that refs/notes is not pushed by default, but that is not a problem - my opinion is that notes should be local only anyway.

If you really want to push your notes, you can to that with

git push <remote> refs/notes/*

and fetch them with

git fetch origin refs/notes/*:refs/notes/*

Buildroot: out-of-tree builds

Buildroot: out-of-tree builds

Like the Linux kernel, Buildroot supports building out-of-tree in a very similiar way. This could be very useful when you have multiple configurations in the same Buildroot repository that you want to build without interference.

Consider the following example:

cd buildroot/
mkdir -p ../outputs/device{1,2}

make O=../outputs/device1 menuconfig
make O=../outputs/device1

make O=../outputs/device2 menuconfig
make O=../outputs/device2

Each output has its own .config so you may change the build configurations independently.

The big benefit compared with git worktree [1] or other multiple-instances-of-the-same-repository-approaches is that the dl directory is shared (without need to specify BR2_DL_DIR [2]) among all build directories.

The feature is documented in the Buildroot manual [3].

get_maintainers and git send-email

get_maintainers and git send-email

Many with me prefer email as communication channel, especially for patches. Github, Gerrit and all other "nice" and "userfriendly" tools that tries to "help" you to manage your submissions does not simply fit my workflow.

As you may already know, all patches to the Linux kernel is by email. scripts/get_maintainer.pl (see [1] for more info about the process) is a handy tool that takes a patch as input and gives back a bunch of emails addresses. These email addresses is usually passed to git send-email [2] for submission.

I have used various scripts to make the output from get_maintainer.pl to fit git send-email, but was not completely satisfied until I found the --to-cmd and --cc-cmd parameters to git send-email:

--to-cmd=<command>
 Specify a command to execute once per patch file which should generate patch file specific "To:" entries. Output of this command must be single email address per line. Default is the value of sendemail.tocmd configuration value.
--cc-cmd=<command>
 Specify a command to execute once per patch file which should generate patch file specific "Cc:" entries. Output of this command must be single email address per line. Default is the value of sendemail.ccCmd configuration value.

I'm very pleased with these parameters. All I have to to is to put these extra lines into my ~/.gitconfig (or use git config):

[sendemail.linux]
    tocmd ="`pwd`/scripts/get_maintainer.pl --nogit --nogit-fallback --norolestats --nol"
    cccmd ="`pwd`/scripts/get_maintainer.pl --nogit --nogit-fallback --norolestats --nom"

To submit a patch, I just type:

git send-email --identity=linux ./0001-my-fancy-patch.patch

and let --to and --cc to be populated automatically.

OOM-killer

OOM-killer

When the system is running out of memory, the Out-Of-Memory (OOM) killer picks a process to kill based on the current memory footprint. In case of OOM, we will calculate a badness score between 0 (never kill) and 1000 for each process in the system. The process with the highest score will be killed. A score of 0 is reserved for unkillable tasks such as the global init process (see [1]) or kernel threads (processes with PF_KTHREAD flag set).

/media/oomkiller.jpg

The current score of a given process is exposed in procfs, see /proc/[pid]/oom_score, and may be adjusted by setting /proc/[pid]/oom_score_adj. The value of oom_score_adj is added to the score before it is used to determine which task to kill. The value may be set between OOM_SCORE_ADJ_MIN (-1000) and OOM_SCORE_DJ_MAX (+1000). This is useful if you want to guarantee that a process never is selected by the OOM killer.

The calculation is simple (nowadays), if a task is using all its allowed memory, the badness score will be calculated to 1000. If it is using half of its allowed memory, the badness score is calculated to 500 and so on. By setting oom_score_adj to -1000, the badness score sums up to <=0 and the task will never be killed by OOM.

There is one more thing that affects the calculation; if the process is running with the capability CAP_SYS_ADMIN, it gets a 3% discount, but that is simply it.

The old implementation

Before v2.6.36, the calculation of badness score tried to be smarter, besides looking for the total memory usage (task->mm->total_vm), it also considered: - Whether the process creates a lot of children - Whether the process has been running for a long time, or has used a lot of CPU time - Whether the process has a low nice value - Whether the process is privileged (CAP_SYS_ADMIN or CAP_SYS_RESOURCE set) - Whether the process is making direct hardware access

At first glance, all these criteria looks valid, but if you think about it a bit, there is a lot of pitfalls here which makes the selection not so fair. For example: A process that creates a lot of children and consumes some memory could be a leaky webserver. Another process that fits into the description is your session manager for your desktop environment which naturally creates a lot of child processes.

The new implementation

This heuristic selection has evolved over time, instead of looking on mm->total_vm for each task, the task's RSS (resident set size, [2]) and swap space is used instead. RSS and Swap space gives a better indication of the amount that we will be able to free if we chose this task. The drawback with using mm->total_vm is that it includes overcommitted memory ( see [3] for more information ) which is pages that the process has claimed but has not been physically allocated.

The process is now only counted as privileged if CAP_SYS_ADMIN is set, not CAP_SYS_RESOURCE as before.

The code

The whole implementation of OOM killer is located in mm/oom_kill.c. The function oom_badness() will be called for each task in the system and returns the calculated badness score.

Let's go through the function.

unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
              const nodemask_t *nodemask, unsigned long totalpages)
{
    long points;
    long adj;

    if (oom_unkillable_task(p, memcg, nodemask))
        return 0;

Looking for unkillable tasks such as the global init process.

p = find_lock_task_mm(p);
if (!p)
    return 0;

adj = (long)p->signal->oom_score_adj;
if (adj == OOM_SCORE_ADJ_MIN ||
        test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
        in_vfork(p)) {
    task_unlock(p);
    return 0;
}

If proc/[pid]/oom_score_adj is set to OOM_SCORE_ADJ_MIN (-1000), do not even consider this task

points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
    atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm);
task_unlock(p);

Calculate a score based on RSS, pagetables and used swap space

if (has_capability_noaudit(p, CAP_SYS_ADMIN))
    points -= (points * 3) / 100;

If it is root process, give it a 3% discount. We are no mean people after all

adj *= totalpages / 1000;
points += adj;

Normalize and add the oom_score_adj value

return points > 0 ? points : 1;

At last, never return 0 for an eligible task as it is reserved for non killable tasks

}

Conclusion

The OOM logic is quite straightforward and seems to have been stable for a long time (v2.6.36 was released in october 2010). The reason why I was looking at the code was that I did not think the behavior I saw when experimenting corresponds to what was written in the man page for oom_score. It turned out that the manpage was not updated when the new calculation was introduced back in 2010.

I have updated the manpage and it is available in v4.14 of the Linux manpage project [4].

commit 5753354a3af20c8b361ec3d53caf68f7217edf48
Author: Marcus Folkesson <marcus.folkesson@gmail.com>
Date:   Fri Nov 17 13:09:44 2017 +0100

    proc.5: Update description of /proc/<pid>/oom_score

    After Linux 2.6.36, the heuristic calculation of oom_score
    has changed to only consider used memory and CAP_SYS_ADMIN.

    See kernel commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10.

    Signed-off-by: Marcus Folkesson <marcus.folkesson@gmail.com>
    Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com>

diff --git a/man5/proc.5 b/man5/proc.5
index 82d4a0646..4e44b8fba 100644
--- a/man5/proc.5
+++ b/man5/proc.5
@@ -1395,7 +1395,9 @@ Since Linux 2.6.36, use of this file is deprecated in favor of
 .IR /proc/[pid]/oom_score_adj .
 .TP
 .IR /proc/[pid]/oom_score " (since Linux 2.6.11)"
-.\" See mm/oom_kill.c::badness() in the 2.6.25 sources
+.\" See mm/oom_kill.c::badness() in pre 2.6.36 sources
+.\" See mm/oom_kill.c::oom_badness() after 2.6.36
+.\" commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10
 This file displays the current score that the kernel gives to
 this process for the purpose of selecting a process
 for the OOM-killer.
@@ -1403,7 +1405,16 @@ A higher score means that the process is more likely to be
 selected by the OOM-killer.
 The basis for this score is the amount of memory used by the process,
 with increases (+) or decreases (\-) for factors including:
-.\" See mm/oom_kill.c::badness() in the 2.6.25 sources
+.\" See mm/oom_kill.c::badness() in pre 2.6.36 sources
+.\" See mm/oom_kill.c::oom_badness() after 2.6.36
+.\" commit a63d83f427fbce97a6cea0db2e64b0eb8435cd10
+.RS
+.IP * 2
+whether the process is privileged (\-);
+.\" More precisely, if it has CAP_SYS_ADMIN or (pre 2.6.36) CAP_SYS_RESOURCE
+.RE
+.IP
+Before kernel 2.6.36 the following factors were also used in the calculation of oom_score:
 .RS
 .IP * 2
 whether the process creates a lot of children using
@@ -1413,10 +1424,7 @@ whether the process creates a lot of children using
 whether the process has been running a long time,
 or has used a lot of CPU time (\-);
 .IP *
-whether the process has a low nice value (i.e., > 0) (+);
-.IP *
-whether the process is privileged (\-); and
-.\" More precisely, if it has CAP_SYS_ADMIN or CAP_SYS_RESOURCE
+whether the process has a low nice value (i.e., > 0) (+); and
 .IP *
 whether the process is making direct hardware access (\-).
 .\" More precisely, if it has CAP_SYS_RAWIO

Embedded Linux course in Linköping

Embedded Linux course in Linköping

I tech in our Embedded Linux course on a regular basis, this time in Linköping. It's a fun course with interesting labs where you will write your own linux device driver for a custom board.

The course itself is quite moderate, but with talented participants, we easily slip over to more interesting things like memory management, the MTD subsystem, how ftrace works internally and how to use it, different contexts, introduce perf and much more.

This time was no exception.

/media/embedded-linux-course.jpg

printk()

printk()

So, a week in Prague has come to its end. The Embedded Linux Conference Europe was this year co-located with Open Source Summit and offered a lot of interesting talks on various topics.

One of the hottest topics this year was about our most beloved debugging function - prink(). What is so hard with printing? It turns out that printk is quite deadlock-prone and that is not an easy thing to work around in the current infrastructure of the kernel.

A common misconception is that printk() is a fast operation that simply writes the message to the global __log_buf variable. It is not.

A printk() may involve many different subsystems, different contexts or nesting, just to mention a few parts that needs to be handled. For example:

  1. The output needs to go over some output medium (consoles) * The monitor * Frame buffers * UART / Serial console * Network console * Braille * ...
  2. Uses different locking mechanismes * The console_lock (described below) * The logbuf_lock spinlock * Consoles often have their own locks
  3. Wake up waiting applications * syslogd * journald * ...

Besides that, printk() is expected to work in every context, whether it is process, softirq, IRQ or NMI context. With all these locking mechanisms involved, what happens if a printk in process context is interrupted by an NMI, and the NMI also calls printk? In other words, there is a lot of special cases that needs to be handled.

How it works

Historical

Lets look back on how the printing was handled in a pre-history kernel.

SMP (Symmetric Multi Processing) SoCs became common in the late 1990s. Before that, everything was easy and everyone was happy. No NMIs. No races between multiple cores. Simple locking. No Facebook. As a response to SMP systems, Linux v2.1.80 introduced a spin_lock to printk to avoid race conditions between multiple cores. The solution we came up with was to serialize all prints to the console. If two CPUs called printk() at the same time, the second core has to wait for the first core to finish.

This does not scale well. In fact, it does not scale at all. What about a modern system with 100+ CPUs that all calls printk at the same time? Depending on the console, the printing may take milliseconds and you will surely end up with an unresponsive system.

Nowdays

Now we are doing things differently. The first core that grabs the console_lock is responsible to print all messages in the __log_buf. If another core is calling printk() in meanwhile, it puts its data into __log_buf , tries to grab the lock which is busy, and then simple returns. As __log_buf continues getting new data, the unlucky core that grabbed the console_lock may end up doing nothing but printing.

The good thing is that we only locks up a single core instead of all cores. The bad thing is that we locks up a single core.

The code

printk()

printk() is defined in kernel/printk/printk.c and does not look much to the world

asmlinkage __visible int printk(const char *fmt, ...)
{
    va_list args;
    int r;

    va_start(args, fmt);
    r = vprintk_func(fmt, args);
    va_end(args);

    return r;
}

It simple calls vprintk_function with its own arguments.

vprintk_func()

vprintk_func() is a function that forward the arguments to different print-functions depending on the current context

__printf(1, 0) int vprintk_func(const char *fmt, va_list args)
{
    if (this_cpu_read(printk_context) & PRINTK_NMI_CONTEXT_MASK)
        return vprintk_nmi(fmt, args);

    if (this_cpu_read(printk_context) & PRINTK_SAFE_CONTEXT_MASK)
        return vprintk_safe(fmt, args);

    if (this_cpu_read(printk_context) & PRINTK_NMI_DEFERRED_CONTEXT_MASK)
        return vprintk_deferred(fmt, args);

    return vprintk_default(fmt, args);
}

The different contexts we consider are:

Normal context

If we are on normal context, there is nothing to consider at all, go for the vprintk_default() and just do our thing.

NMI context

In the case that the CPU supports NMIs (Non-Maskable Interrupts, (look for CONFIG_HAVE_NMI and CONFIG_PRINTK_NMI in your .config ), we go for vprintk_nmi(). vprintk_nmi() do a safe copy to a per-CPU buffer, not the global __log_buf. Since NMIs are not nested by its nature, there is always only one write running. However, NMIs is only for the local CPU, and the buffer might get flushed from another CPU, so we still need to be careful.

"Recursive" context

If the printk() routine is interrupted and we end up in another call to printk from somewhere else, we go for the lock-less vprintk_safe() to prevent a recursion deadlock. vprintk_safe() is using a per-CPU buffer to store the message, just like NMI.

Deferred context

As already said, multiple locks is involved in the call chain of printk(). vprintk_deferred() is using the main logbuf_lock but avoid calling console drivers that might have their own locks. The actual printing is deferred to klogd_work kernel thread.

vprintk_emit()

vprintk_emit() is responsible to write to __log_buf, (but not the only function, cont_flush() also write to __log_buf) and print out the content to all consoles.

asmlinkage int vprintk_emit(int facility, int level,
                const char *dict, size_t dictlen,
                const char *fmt, va_list args)
{

    ...

    <<<<< Strip kernel syslog prefix >>>>>

    ...

    <<<<< log_output() does the actual printing to __log_buf >>>>>
    printed_len = log_output(facility, level, lflags, dict, dictlen, text, text_len);

    ...

    if (!in_sched) {
        /*
         * Try to acquire and then immediately release the console
         * semaphore.  The release will print out buffers and wake up
         * /dev/kmsg and syslog() users.
         */
        if (console_trylock())
            console_unlock();
    }

    return printed_len;
}

The function is quite straight forward. The only thing that looks a little bit strange is

if (console_trylock())
    console_unlock();

Really? Grab the console_lock and immediately unlock it? The thing is that all magic happens in console_unlock().

console_unlock()

The CPU that is grabbing the console_lock is responsible to print to all registered consoles until all new data in __log_buf is printed. This regardless if other CPUs keeps filling the buffer with new data.

In the worst case, this CPU is doing nothing but printing and will never leave this function.

void console_unlock(void)
{
    ...


    <<<<< Endless loop? >>>>>
    for (;;) {

        <<<<< Go through all new messages >>>>>

        ...

        <<<<< Print to all consoles >>>><
        call_console_drivers(ext_text, ext_len, text, len);

        ...
    }

    ...

    <<<<<  Release the exclusive_console once it is used >>>>>
    console_locked = 0;

    ...

    <<<<< Wake up klogd >>>>>
    if (wake_klogd)
        wake_up_klogd();
}

The function is looping until all new messages is printed. For each new message, a call to call_console_drivers() is made. The last thing that we do is waking up the klogd kernel thread that will signal to all userspace application that is waiting on klogctl(2).

call_console_drivers()

call_console_drivers() is asking all registered consoles to print out a message. The console_lock must be held when calling this function.

static void call_console_drivers(const char *ext_text, size_t ext_len,
                 const char *text, size_t len)
{
    struct console *con;

    trace_console_rcuidle(text, len);

    if (!console_drivers)
        return;

    for_each_console(con) {
        if (exclusive_console && con != exclusive_console)
            continue;
        if (!(con->flags & CON_ENABLED))
            continue;
        if (!con->write)
            continue;
        if (!cpu_online(smp_processor_id()) &&
            !(con->flags & CON_ANYTIME))
            continue;
        if (con->flags & CON_EXTENDED)
            con->write(con, ext_text, ext_len);
        else
            con->write(con, text, len);
    }
}

trace_printk()

As we see, there is a lot of logic involved in a simple call to printk() and you should not be surprised if all your printing has impact on your systems performance or timing. But how do we debug if printk() is a no-no? The answer is trace_printk().

This function write (almost) directly to a trace buffer and is therefore a fairly fast operation. The trace buffer is exposed from tracefs, usually mounted at /sys/kernel/tracing.

As a bonus, the messages is merged with other output from ftrace when doing a function trace.

Other things that is good to know about __log_buf

__log_buf

The kernel log buffer is exported as a global symbol called __log_buf. If you have an systems that deadlocks without any output on the console and you may reboot the system without resetting RAM, then you may print the content of __log_buf from the bootloader.

Determine the physical address of __log_buf

[09:59:31]marcus@little:~/git/linux$ grep __log_buf System.map
c14cfba8 b __log_buf

The 0xc14cfba8 is the virtual address of __log_buf. This kernel is compiled for a 32bit ARM with the CONFIG_VMSPLIT_3G set, so the kernel virtual address space start at 0xc0000000. To get the physical address out of the virtual, subtract the offset (0xc14cfba8 - 0xc0000000) and you will end up with 0x014cfba8. Dump this address from your bootloader and you will see your kernel log.

sizeof(__log_buf)

The size of __log_buf is set at compile-time with CONFIG_LOG_BUF_SHIFT. The value defines the size as a power of 2 and is usually set to 16 (64K).

There is also a CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT that is the per-CPU buffer where messages printed from unsafe context are temporary stored. Examples on unsafe context would be NMI and printk recursions. The messages are copied to the main log buffer in a safe context to avoid a deadlock.

This buffer is rarely used but has to be there to avoid the nasty deadlocks. The CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT was introduced in v4.11 and is also expressed as a power of 2.