Changing the root of your Linux filesystem

Changing the root of your Linux filesystem

After my previous post [1], I've got a few questions from people about the difference between chroot, pivot_root and switch_root, they all seems to do the same thing, right? Almost.

Lets shed some light on this topic.

Rootfs

First of all, we need to specify what we mean when we say rootfs.

rootfs is a special instance of a ram filesystem (usually tpmfs) that is created in an early boot stage. You cannot unmount rootfs for pretty much the same reason you can't kill the init process - you need it to have a functional system.

The kernel build process always creates a gzipped gpio (newc format) archive and ink it into the resulting kernel binary. This is our rootfs! The empty archive consuming ~130 bytes, and that is something we have to live with.

The rootfs is empty - unless you populate it. You may set the CONFIG_INITRAMFS_SOURCE config option to include files and directories to your rootfs.

When the system boots, it searches for an executable /init in the rootfs, and executes if it exists. If it does not exist the system will continue and try to mount whatever is specified on the root= kernel command line.

One important thing to remember is that the system mounts the root= partition on / (which is basically rootfs).

initramfs vs inird

So, now we when we know what rootfs is and that rootfs is basically an initramfs (empty or not).

Now we need to distinguish between initramfs and initrd. At a first glance, it may seems like they're the same, but they are not.

An initrd is basically a complete image of a filesystem that is mounted on a ramdev, a RAM-based block-device. As it's a filesystem, you will need to have support for that filesystem static compiled (=y) into your kernel.

An initramfs, on the other hand, is not a filesystem. It's a cpio archive which is unpacked into a tmpfs.

As we already has spoken about, this archive is embedded into the kernel itself and as a side-effect, the initramfs could be loaded earlier in the kernel boot process and keep a smaller footprint as the kernel can adapt the size of the tmpfs to what is actually loaded.

The both are also fundamental different when it comes how they're treated as a root filesystem. As an initrd is an actual filesystem unpacked into RAM, the root filesystem will be that ramdisk. When changing root filesystem form an initrd to a "real" root device, you have to switch the root device.

The initramfs on the other hand, is some kind of interim filesystem with no underlaying device, so the kernel doesn't need to switch any devices - just "jump" to one.

Jump into a new root filesystem

So, when it's time to "jump" to a new root filesystem, that could be done in several ways.

chroot

chroot runs a command with a specified root directory. Nothing more, nothing less.

The basic syntax of the chroot command is:

chroot option newroot [command [args]]

pivot_root

pivot_root moves the root file system of the current process to the directory put_old and makes new_root the new root file system. It's often used in combination with chroot.

From the manpage [2]:

The typical use of pivot_root() is during system startup,
when the system mounts a temporary root file
system (e.g., an initrd), then mounts the real root file system,
and eventually turns the latter into the current root
of all relevant processes or threads.

The basic syntax of the pivot_root command is:

pivot_root new_root put_old

Example on how to use pivot_root:

mount /dev/hda1 /new-root
cd /new-root
pivot_root . old-root
exec chroot . sh <dev/console >dev/console 2>&1
umount /old-root

pivot_root should be unused for initrd only, it does not work for initramfs.

switch_root

switch_root [3] switches to another filesystem as the root of the mount tree. Pretty much the same thing as pivot_root do, but with a few differences.

switch_root moves already mounted /proc, /dev, /sys and /run to newroot and makes newroot the new root filesystem and starts init process.

Another important difference is that switch_root removes recursively all files and directory in the current root filesystem!

The basic syntax of the switch_root command is:

switch_root newroot init [arg...]

Also, you are not allowed to umount the old mount (as you do with pivot_root) as this is your rootfs (remember?).

Summary

To summarize;

If you use initrd - use pivot_root combined with chroot.

if you use initramfs - use switch_root.

It's technically possible to only use chroot, but it will not free resources and you could end up with some restrictions( see unshare(7) for example).

Also, keep in mind that you must invoke either command via exec, otherwise the new process will not inherit PID 1.

chroot and user namespaces

chroot and user namespaces

When playing around with libcamera [1] and br2-readonly-rootfs-overlay [2] I found something.. well.. unexpected. At least at first glance.

What happened was that I encountered this error:

 $ libcamera-still
Preview window unavailable
[0:02:54.785102683] [517]  INFO Camera camera_manager.cpp:299 libcamera v0.0.0+67319-2023.02-22-gd530afad-dirty (2024-02-20T16:56:34+01:00)
[0:02:54.885731084] [518] ERROR Process process.cpp:312 Failed to unshare execution context: Operation not permitted

Failed to unshare execution context: Operation not permitted... what?

I know that libcamera executes proprietary IPAs (Image Processing Algorithms) as black boxes, and that the execution is isolated in their own namespace. But.. not permitted..?

Lets look into the code for libcamera (src/libcamera/process.cpp) [3]:

int Process::isolate()
{
	int ret = unshare(CLONE_NEWUSER | CLONE_NEWNET);
	if (ret) {
		ret = -errno;
		LOG(Process, Error) << "Failed to unshare execution context: "
				    << strerror(-ret);
		return ret;
	}

	return 0;
}

Libcamera does indeed create a user and network namespace for the execution. In Linux, new namespaces is created with unshare(2).

The unshare(2) library call is used to dissassociate parts of its execution context by creating new namespaces. As you can see in the manpage [4], there are a few scenarios that can give us -EPERM as return value:

EPERM  The calling process did not have the required privileges for this operation.

EPERM  CLONE_NEWUSER  was  specified in flags, but either the effective user ID or the effective group ID of the caller does not have a mapping in
       the parent namespace (see user_namespaces(7)).

EPERM (since Linux 3.9)
       CLONE_NEWUSER was specified in flags and the caller is in a chroot environment (i.e., the caller's root directory does not match  the  root
       directory of the mount namespace in which it resides).

The last one caught my eyes. I was not aware of this, even if it makes very much sense to have it this way.

Unfortunately, the last step of the init-script in the initramfs for br2-readonly-rootfs-overlay is to do chroot(1) to the actual root filesystem. I did initially choose to use chroot because it works with both an initrd and initramfs root filesystem, and the code was intended to be generic.

pivot_root and switch_root can (and should) be used with advantage, but it requires you to now which type of filesystem you are working with.

As the br2-readonly-rootfs-overlay is now only designed to be used with initramfs, I switched to use switch_root and the problem was solved.

For the sake of fun, lets follow the code down to where the error occour.

Code digging

When the application makes a call to unshare(2), which simply is a wrapper in the libc implementation for the system call down to the kernel, it ends up with a syscall(2).

The system call on the kernel side is defined as follow:

SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
{
	return ksys_unshare(unshare_flags);
}

It only calls ksys_unshare() which interpret the unshare_flags and "unshares" namespaces depending on flags. The flag for unsharing a user namespace is CLONE_NEWNS.

int ksys_unshare(unsigned long unshare_flags)
{
	struct fs_struct *fs, *new_fs = NULL;
	struct files_struct *new_fd = NULL;
	struct cred *new_cred = NULL;
	struct nsproxy *new_nsproxy = NULL;
	int do_sysvsem = 0;
	int err;

	/*
	 * If unsharing a user namespace must also unshare the thread group
	 * and unshare the filesystem root and working directories.
	 */
	if (unshare_flags & CLONE_NEWUSER)
		unshare_flags |= CLONE_THREAD | CLONE_FS;
	/*
	 * If unsharing vm, must also unshare signal handlers.
	 */
	if (unshare_flags & CLONE_VM)
		unshare_flags |= CLONE_SIGHAND;
	/*
	 * If unsharing a signal handlers, must also unshare the signal queues.
	 */
	if (unshare_flags & CLONE_SIGHAND)
		unshare_flags |= CLONE_THREAD;
	/*
	 * If unsharing namespace, must also unshare filesystem information.
	 */
	if (unshare_flags & CLONE_NEWNS)
		unshare_flags |= CLONE_FS;

	err = check_unshare_flags(unshare_flags);
	if (err)
		goto bad_unshare_out;
	/*
	 * CLONE_NEWIPC must also detach from the undolist: after switching
	 * to a new ipc namespace, the semaphore arrays from the old
	 * namespace are unreachable.
	 */
	if (unshare_flags & (CLONE_NEWIPC|CLONE_SYSVSEM))
		do_sysvsem = 1;
	err = unshare_fs(unshare_flags, &new_fs);
	if (err)
		goto bad_unshare_out;
	err = unshare_fd(unshare_flags, NR_OPEN_MAX, &new_fd);
	if (err)
		goto bad_unshare_cleanup_fs;
	err = unshare_userns(unshare_flags, &new_cred);
	....

unshare_userns() prepares credentials and then creates the user namespace by calling create_user_ns():

int unshare_userns(unsigned long unshare_flags, struct cred **new_cred)
{
	struct cred *cred;
	int err = -ENOMEM;

	if (!(unshare_flags & CLONE_NEWUSER))
		return 0;

	cred = prepare_creds();
	if (cred) {
		err = create_user_ns(cred);
		if (err)
			put_cred(cred);
		else
			*new_cred = cred;
	}

	return err;
}

Before create_user_ns() creates the actual user namespace, it makes various sanity checks, one of those are if(current_chrooted())

int create_user_ns(struct cred *new)
{
	struct user_namespace *ns, *parent_ns = new->user_ns;
	kuid_t owner = new->euid;
	kgid_t group = new->egid;
	struct ucounts *ucounts;
	int ret, i;

	ret = -ENOSPC;
	if (parent_ns->level > 32)
		goto fail;

	ucounts = inc_user_namespaces(parent_ns, owner);
	if (!ucounts)
		goto fail;

	/*
	 * Verify that we can not violate the policy of which files
	 * may be accessed that is specified by the root directory,
	 * by verifying that the root directory is at the root of the
	 * mount namespace which allows all files to be accessed.
	 */
	ret = -EPERM;
	if (current_chrooted())
		goto fail_dec;
	
	....

This is how we check if the if the current process is chrooted or not:

bool current_chrooted(void)
{
	/* Does the current process have a non-standard root */
	struct path ns_root;
	struct path fs_root;
	bool chrooted;

	/* Find the namespace root */
	ns_root.mnt = &current->nsproxy->mnt_ns->root->mnt;
	ns_root.dentry = ns_root.mnt->mnt_root;
	path_get(&ns_root);
	while (d_mountpoint(ns_root.dentry) && follow_down_one(&ns_root))
		;

	get_fs_root(current->fs, &fs_root);

	chrooted = !path_equal(&fs_root, &ns_root);

	path_put(&fs_root);
	path_put(&ns_root);

	return chrooted;
}

Summary

Do not create user namespaces for chroot:ed processes. It does not work for good reasons.

Streamline your kernel config

Streamline your kernel config

When working with embedded systems, it's not that uncommon that you need to compile your own kernel for your hardware platform.

The configuration file you use is probably based on some default configuration that you borrowed from the kernel tree, a vendor tree, a SoM-vendor or simply from your collegue.

As the configuration is somehow generic, it probably contains tons of kernel modules that is not needed for your application. Walking through the menuconfig and tidy up among all modules could be a lot of work.

What you want is to only enable those modules that are currently in use and disable all other. Maybe you have used make localmodconfig or make localyesconfig during compilation of your host kernel, but did you know that those targets are just as applicable for embedded systems?

streamline_config.pl

Both localmodconfig and localyesconfig make use of ./scripts/kconfig/streamline_config.pl [1] which usually execute lsmod, look into your .config and remove all modules that are currently not loaded.

Instead of execute the default lsmod binary, you could set the LSMOD environment variable to either point to an executable file, OR a text file containing the output from lsmod.

Get your current config

In order to streamline your configuration file, you will of course need to know the configuration. Hopefully you already have it, otherwise it could be extracted from your running kernel if CONFIG_IKCONFIG is set.

Either directly from the target:

 $ zcat /proc/config.gz > /tmp/.config

Or by helper scripts in the kernel source

 $ ./scripts/extract-ikconfig ./vmlinux > /tmp/.config

and make it available to the host.

Get a list of loaded modules

Save the output from lsmod (run on target) to a file:

 $ lsmod > /tmp/mymodules

And make it available to the host.

Streamline

Time to streamline. Assuming you now have the configuration file stored as .config in the root of your kernel tree, you may now run

 $ make LSMOD=/path/to/mymodules localmodconfig
To update the current config and disable all modules not specified in /path/to/mymodules.

Or by

 $ make LSMOD=/path/to/mymodules localyesconfig
To update the current config converting all modules specified in /path/to/mymodules to core.

Power button and embedded Linux

Power button and embedded Linux

Not all embedded Linux systems has a power button, but for consumer electronic devices it could be a good thing to be able to turn it off. But how does it work in practice?

Physical button

It all starts with a physical button.

At the board level, the button is usually connected to a General-Purpose-Input-Output (GPIO) pin on the processor. It doesn't have to be directly connected to the processor though, all that matters is that the button press can somehow be mapped to the Linux input subsystem.

The input system reports certain key codes to a process in user space that listen for those codes and performs the corresponding action.

All possible keycodes is listed in linux-event-codes.h [1]. The ones that are relevant to us are:

  • KEY_POWER
  • KEY_SLEEP
  • KEY_SUSPEND
  • KEY_RESTART

Map the button to the input system

To map a certain GPIO to the input system we make use of the gpio-keys kernel driver [2]. Make sure you have the CONFIG_KEYBOARD_GPIO option selected in your kernel configuration.

The mapping is described in the devicetree. The key (pun intended) is to set the right key-code on the linux,code property.

See example below:

gpio-keys {
        compatible = "gpio-keys";
        pinctrl-names = "default";
        pinctrl-0 = <&button_pins>;
        status = "okay";

        key-power {
                label = "Power Button";
                gpios = <&gpio 7 GPIO_ACTIVE_LOW>;
                linux,code = <KEY_POWER>;
                wakeup-source;
        };
        key-restart {
                label = "Restart Button";
                gpios = <&gpio 8 GPIO_ACTIVE_LOW>;
                linux,code = <KEY_RESTART>;
                wakeup-source;
        };
        key-suspend {
                label = "Suspend Button";
                gpios = <&gpio 9 GPIO_ACTIVE_LOW>;
                linux,code = <KEY_SUSPEND>;
                wakeup-source;
        };
};

Hopefully you will now have the input device registered when you boot up:

 $ dmesg | grep input
 [    6.886635] input: gpio-keys as /devices/platform/gpio-keys/input/input0

systemd-logind

We now have an input device that reports key-events when we press the button. The next step is to have something that read these reports and act on them on the userspace side.

On systems with systemd, we have the systemd-logind service. This service is responsible for a lot of things including:

  • Keeping track of users and sessions, their processes and their idle state. This is implemented by allocating a systemd slice unit for each user below user.slice, and a scope unit below it for each concurrent session of a user. Also, a per-user service manager is started as system service instance of user@.service for each logged in user.
  • Generating and managing session IDs. If auditing is available and an audit session ID is already set for a session, then this ID is reused as the session ID. Otherwise, an independent session counter is used.
  • Providing polkit[1]-based access for users for operations such as system shutdown or sleep
  • Implementing a shutdown/sleep inhibition logic for applications
  • Handling of power/sleep hardware keys
  • Multi-seat management
  • Session switch management
  • Device access management for users
  • Automatic spawning of text logins (gettys) on virtual console activation and user runtime directory management
  • Scheduled shutdown
  • Sending "wall" messages

Is is of course possible to have your own listener, but as we have systemd-logind and ... I sort of started to like systemd, we will go that way.

Configure systemd-logind

systemd-logind has its configuration file [3] in /etc/systemd/logind.conf where it's possible to set an action for each event. Possible actions are:

Action Description
ignore systemd-login will never handle these keys
poweroff System will poweroff
reboot System will reboot
halt System will halt
kexec System will start kexec
suspend System will suspend
hibernate System will hybernate
hybrid-sleep  
suspend-then-hibernate  
lock All running sessions will be screen-locked
factory-reset  

The events that are configurable are:

Configuration item Listen on Default action
HandlePowerKey KEY_POWER poweroff
HandlePowerKeyLongPress KEY_POWER ignore
HandleRebootKey KEY_RESTART reboot
HandleRebootKeyLongPress KEY_RESTART poweroff
HandleSuspendKey KEY_SUSPEND suspend
HandleSuspendKeyLongPress KEY_SUSPEND hibernate
HandleHibernateKey KEY_SLEEP hibernate
HandleHibernateKeyLongPress KEY_SLEEP ignore

It's not obvious how long a long press is, I could not find anything about it in the documentation. However, the code [4] tells us that it's 5 seconds.

#define LONG_PRESS_DURATION (5 * USEC_PER_SEC)

power-switch tag

The logind service will only watch input devices with the power-switch udev tag for keycodes, so there has to be a udev-rule that sets such a tag. The default rule that comes with systemd, 70-power-switch.rules, looks as follow:

ACTION=="remove", GOTO="power_switch_end"

SUBSYSTEM=="input", KERNEL=="event*", ENV{ID_INPUT_SWITCH}=="1", TAG+="power-switch"
SUBSYSTEM=="input", KERNEL=="event*", ENV{ID_INPUT_KEY}=="1", TAG+="power-switch"

LABEL="power_switch_end"

All input devices of the type ID_INPUT_SWITCH or ID_INPUT_KEY will then be monitored.

We can confirm that the tag is set for our devices we created earlier:

 $ udevadm info /dev/input/event0
P: /devices/platform/gpio-keys/input/input0/event0
M: event0
R: 0
U: input
D: c 13:64
N: input/event0
L: 0
S: input/by-path/platform-gpio-keys-event
E: DEVPATH=/devices/platform/gpio-keys/input/input0/event0
E: DEVNAME=/dev/input/event0
E: MAJOR=13
E: MINOR=64
E: SUBSYSTEM=input
E: USEC_INITIALIZED=7790298
E: ID_INPUT=1
E: ID_INPUT_KEY=1
E: ID_PATH=platform-gpio-keys
E: ID_PATH_TAG=platform-gpio-keys
E: DEVLINKS=/dev/input/by-path/platform-gpio-keys-event
E: TAGS=:power-switch:
E: CURRENT_TAGS=:power-switch:

Summary

The signal path from a physical power button up to a service handler in userspace is not that complicated. Systemd makes handling the signals and what behavior they should have both smooth and flexible.

br2-readonly-rootfs-overlay

br2-readonly-rootfs-overlay

This is a Buildroot external module that could be used as a reference design when building your own system with an overlayed root filesystem. It's created as an external module to make it easy to adapt for your to your own application.

The goal is to achieve the same functionality I have in meta-readonly-rootfs-overlay [1] but for Buildroot.

Why does this exists?

Having a read-only root file system is useful for many scenarios:

  • Separate user specific changes from system configuration, and being able to find differences
  • Allow factory reset, by deleting the user specific changes
  • Have a fallback image in case the user specific changes made the root file system no longer bootable.

Because some data on the root file system changes on first boot or while the system is running, just mounting the complete root file system as read-only breaks many applications. There are different solutions to this problem:

  • Symlinking/Bind mounting files and directories that could potentially change while the system is running to a writable partition
  • Instead of having a read-only root files system, mounting a writable overlay root file system, that uses a read-only file system as its base and writes changed data to another writable partition.

To implement the first solution, the developer needs to analyse which file needs to change and then create symlinks for them. When doing factory reset, the developer needs to overwrite every file that is linked with the factory configuration, to avoid dangling symlinks/binds. While this is more work on the developer side, it might increase the security, because only files that are symlinked/bind-mounted can be changed.

This br2-external provides the second solution. Here no investigation of writable files are needed and factory reset can be done by just deleting all files or formatting the writable volume.

How does it work?

This external module makes use of Dracut [2] to create a rootfs.cpio that is embedded into the Linux kernel as an initramfs.

The initramfs mounts the base root filsystem as read-only and the read-write filesystem as the upper layer in an overlay filesystem structure.

Dependencies

The setup requires the following kernel configuration (added as fragment file in board/qemu-x86_64/linux.fragment):

CONFIG_OVERLAY_FS=y
CONFIG_BLK_DEV_INITRD=y

Test it out

This module will build a x86_64 target that is prepared to be emulated with qemu.

Clone this repository

git clone https://github.com/marcusfolkesson/br2-readonly-rootfs-overlay.git

Build

Run build.sh to make a full build.

The script is simple, it just update the Buildroot submodule and start a build:

#!/bin/bash

# Update submodules
git submodule init
git submodule update

# Build buildroot
cd ./buildroot
make BR2_EXTERNAL=../ readonly-rootfs-overlay_defconfig
make

Artifacts

The following artifacts are generated in ./buildroot/output/images/ :

  • bzImage - The Linux kernel
  • start-qemu.sh - script that will start qemu-system-x86_64 and emulate the whole setup
  • sdcard.img - The disk image containing two partitions, one for the read-only rootfs and one for the writable upper filesystem
  • rootfs.cpio - The initramfs file system that is embedded into the kernel
  • rootfs.ext2 - The root filesystem image
  • overlay.ext4 - Empty filsystem image used for the writable layer

Emulate in QEMU

An script to emulate the whole thing is generated in the output directory. Execute the ./buildroot/output/images/start-qemu.sh script to start the emulator.

Once the system has booted, you are able to login as root:

Welcome to Buildroot
buildroot login: root

And as you can see, the root filesystem is overlayed as it should:

$ mount
...
/dev/vda1 on /media/rfs/ro type ext2 (ro,noatime,nodiratime)
/dev/vda2 on /media/rfs/rw type ext4 (rw,noatime,errors=remount-ro)
overlay on / type overlay (rw,relatime,lowerdir=/media/rfs/ro,upperdir=/media/rfs/rw/upperdir,workdir=/media/rfs/rw/work)
...

Kernel command line parameters

These examples are not meant to be complete. They just contain parameters that are used by the initscript of this repository. Some additional paramters might be necessary.

Example without initramfs

root=/dev/sda1 rootrw=/dev/sda2 init=/init

This cmd line starts /sbin/init with /dev/sda1 partition as the read-only rootfs and the /dev/sda2 partition as the read-write persistent state. When using this init script without an initrd, init=/init has to be set.

root=/dev/sda1 rootrw=/dev/sda2 init=/init rootinit=/bin/sh

The same as before but it now starts /bin/sh instead of /sbin/init

Details

All kernel parameters that is used to configure br2-readonly-rootfs-overlay:

  • root - specifies the read-only root file system device. If this is not specified, the current rootfs is used.
  • `rootfstype if support for the read-only file system is not build into the kernel, you can specify the required module name here. It will also be used in the mount command.
  • rootoptions specifies the mount options of the read-only file system. Defaults to noatime,nodiratime.
  • rootinit if the init parameter was used to specify this init script, rootinit can be used to overwrite the default (/sbin/init).
  • rootrw specifies the read-write file system device. If this is not specified, tmpfs is used.
  • rootrwfstype if support for the read-write file system is not build into the kernel, you can specify the required module name here. It will also be used in the mount command.
  • rootrwoptions specifies the mount options of the read-write file system. Defaults to rw,noatime,mode=755.
  • rootrwreset set to yes if you want to delete all the files in the read-write file system prior to building the overlay root files system.

Expose network namespace created by Docker

Expose network namespace created by Docker

Disclaimer: this is probably *not* the best way for doing this, but it's pretty good for educational purposes.

During a debug session I wanted to connect an application to a service tha ran in a docker container. This was for test-purposes only, so hackish and fast are the keywords.

First of all, I'm not a Docker expert, but I've a pretty good understanding on Linux internals, namespaces and how things works on a Linux system. So I started to use the tools I had.

Create a container to work with

This is not the container I wanted to debug, but to have something to demonstrate the concept:

FROM ubuntu

MAINTAINER Marcus Folkesson <marcus.folkesson@gmail.com>

RUN apt-get update
RUN apt-get update
RUN apt-get install -y socat

RUN ["/bin/sh"]

Create and start the container

docker build -t netns .
docker run -ti netns /bin/bash

Inside the container, create a TCP-server with socat:

root@c8a2438ad58e:/# socat - TCP-L:1234

Lets play around

I used to list network namespaces with ip netns list, but the command gave me no outputs even while the docker container was running.

That was unexpected. I wonder where ip actually look for namespaces. To find out I used strace and looked for the openat system call:

$ strace -e openat  ip netns list
...
openat(AT_FDCWD, "/var/run/netns", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 5
...

Ok, ip netns list does probably look for files representing network namespaces in /var/run/netns.

Lets try to create an entry (dockerns) on that location:

$ sudo touch /var/run/netns/dockerns
$ ip netns  list
Error: Peer netns reference is invalid.
Error: Peer netns reference is invalid.
dockerns

Good. It tries to dereference the namespace!

All namespaces for a certain PID is exposed in procfs. For example, here are the namespaces that my bash session belongs to:

$ ls -al /proc/self/ns/
total 0
dr-x--x--x 2 marcus marcus 0 15 dec 12.35 .
dr-xr-xr-x 9 marcus marcus 0 15 dec 12.35 ..
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 mnt -> 'mnt:[4026531841]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 net -> 'net:[4026531840]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 time -> 'time:[4026531834]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 user -> 'user:[4026531837]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 uts -> 'uts:[4026531838]'

Now we only should do the same for the container.

First, find the ID of the running container with docker ps:

$ docker ps
CONTAINER ID   IMAGE     COMMAND       CREATED              STATUS          PORTS     NAMES
c8a2438ad58e   netns     "/bin/bash"   About a minute ago   Up 59 seconds             dazzling_lewi

Inspect the running container to determine the PID:

$ docker inspect -f '{{.State.Pid}}'  c8a2438ad58e
36897

36897, there we have it.

Lets see which namespaces it has:

$ sudo ls -al /proc/36897/ns
total 0
dr-x--x--x 2 root root 0 15 dec 09.59 .
dr-xr-xr-x 9 root root 0 15 dec 09.59 ..
lrwxrwxrwx 1 root root 0 15 dec 10.01 cgroup -> 'cgroup:[4026533596]'
lrwxrwxrwx 1 root root 0 15 dec 10.01 ipc -> 'ipc:[4026533536]'
lrwxrwxrwx 1 root root 0 15 dec 10.01 mnt -> 'mnt:[4026533533]'
lrwxrwxrwx 1 root root 0 15 dec 09.59 net -> 'net:[4026533538]'
lrwxrwxrwx 1 root root 0 15 dec 10.01 pid -> 'pid:[4026533537]'
lrwxrwxrwx 1 root root 0 15 dec 10.01 pid_for_children -> 'pid:[4026533537]'
lrwxrwxrwx 1 root root 0 15 dec 10.01 time -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 15 dec 10.01 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 15 dec 10.01 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 15 dec 10.01 uts -> 'uts:[4026533534]'

As we can see, the container have different IDs for the most (user and time is shared) namespaces.

Now we have the network namespace for the container, lets bind mount it to var/run/netns/dockerns:

$ sudo mount -o bind /proc/36897/ns/net /var/run/netns/dockerns

And run ip netns list again:

$ ip netns list
dockerns (id: 0)

Nice.

Start socat in the dockerns network namespace and connect to localhost:1234:

$ sudo ip netns exec dockerns socat - TCP:localhost:1234
hello
Hello from the outside world
Hello from inside docker

It works! We are now connected to the service running in the container!

Conclusion

It's fun to play around, but there are room for improvements.

For example, a better way to list namespaces is to use lsns as this tool looks after namespaces in more paths including /run/docker/netns/.

Also, a more "correct" way is probably to create a virtual ethernet device and attach it to the same namespace.

veth example

To do that, we first need to determine the PID for the container:

$ lsns -t net
        NS TYPE NPROCS     PID USER       NETNSID NSFS                           COMMAND
	...
	4026533538 net       3   36897 root             0 /run/docker/netns/06a083424158 /bin/bash

Create the public dockerns namespace (create /var/run/netns/dockerns as we did earlier but with ip netns attach):

$ sudo ip netns attach dockerns 36897

Create virtual interfaces, assign network namespace and create routes:

$ sudo ip link add veth0 type veth peer name veth1
$ sudo ip link set veth1 netns dockerns
$ sudo ip address add 192.168.3.1/24 dev veth0
$ sudo ip netns exec dockerns ip a add 192.168.3.2/24 dev veth1
$ sudo ip link set up veth0
$ sudo ip netns exec dockerns ip l set up veth1
$ sudo ip route add 10.0.42.0/24 via 192.168.3.2

That is pretty much it.

Skip flashing unused blocks with UUU

Skip flashing unused blocks with UUU

TL;DR: UUU does now (or will shortly) support blockmaps for flashing images. Use it. It will shorten your flashing time *a lot.*

It will soon be time to manufacture circuit boards for a project I'm currently working on. After manufacturing, it will need some firmware for sure, but how do we flash it in the most efficient way? The board is based on an i.MX.8 chip from NXP and has an eMMC as storage medium.

Overall, short iteration is nice to have as it will benifit the development cycles for the developers on a daily basis, but the time it takes to flash a device with its firmware is not just "nice to have" in the production line, it's critical. We do not want to have a (another?) bottleneck in the production line, so we have to make some effort to make it as fast as possible.

The Scenario

The scenario we have in front of us is that:

  • We want to write a full disk image to the eMMC in the production
  • Most space in the disk image is reserved for empty partitions
  • The disk image is quite big (~12GB)
  • Only ~2GB is of the disk image is "real data" (not just chunks of zeros)
  • Flash firmware to eMMC takes ~25min which is unacceptable

Sparse files is to me a very familiar concept and the .wic file created by the Yocto project is indeed a sparse file. But the flashing tool (UUU) did not seems to support write sparse images without expanding them (that is not completely true, as we will see).

To speed up the production line, I thought of several different solutions including:

  • Do not include empty partitions in the .wic file, create those at the first boot instead
  • Only use UUU to boot the board and then flash the disk image using dfu-utils
  • ...

But all have its own drawbacks. Best would be if we could stick to a uniform way of flashing everything we need, and UUU will be needed regardless.

Before we go any further, lets look at what a sparse file is.

Sparse images

A sparse image [2] is a type of file that contains metadata to describe empty sections (let's call them holes) instead of filling it up with zeros (that take up actual space). The advantage is that sparse files only take as much space as they actually need.

For example; If the logical file size is, say 128 GiB, and the areas with real data is only 8GiB, the physical file size will only be 8GiB+metadata. The allocated size are significantly reduced compared to a full 128GiB image.

/media/sparse-file.png

Most of the filesystems in Linux (btrfs, XFS, ext[234], ...) and even Windows (NTFS) handle sparse files transparently to the user. This means that the filesystem will not allocate more memory than necessary. Reading these holes returns just zeros, and the filesystem will allocate blocks on the filesystem only when someone actually writes to those holes.

Think on it as the Copy-On-Write (COW) feature that many modern filesystems implements.

An example

It could be a little bit confusing when trying to figure out the file size for sparse files.

For example, ls reports 14088667136 bytes, which is quite a big file!

$ ls -alrt image.wic
-rw-r--r-- 1 marcus marcus 14088667136 12 dec 11.08 image.wic

But du only reports 1997888 bytes?!

$ du  image.wic
1997888    image.wic

To get the real occupied size, we need to print the allocated size in blocks (-s):

$ ls -alrts image.wic
1997888 -rw-r--r-- 1 marcus marcus 14088667136 12 dec 11.08 image.wic

If you want to see it with your own eyes, you can easily create a 2GB file with the truncate command:

$ truncate  -s  2G image.sparse
$ stat image.sparse 
  File: image.sparse
    Size: 2147483648	Blocks: 0          IO Block: 4096   regular file

As you can see, the file size is 2GB, but it occupies zero blocks on the disk. The file does only consists of a big chunk of zeros which is described in the metadata.

Handle the holes with care

It's easy to expand the sparse files if you are not careful. The most common applications that is used to copy files have a --sparse option to preserve the holes. If not used, it will allocate blocks and fill them out with zeros. For some applications the --sparse option is set as their default behavior, e.g., see the relevant part of the manpage for cp(1):

--sparse=WHEN
              control creation of sparse files. See below

 ...
 By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well.
 That  is  the  behavior  selected  by --sparse=auto.
 Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes.
 Use --sparse=never to inhibit creation of sparse files.

While some applications does not use it as default, e.g. rsync(1):

--sparse, -S
        Try to handle sparse files efficiently so they take up less space on the destination.  If combined with --inplace the file created might not end  up  with
        sparse  blocks  with some combinations of kernel version and/or filesystem type.  If --whole-file is in effect (e.g. for a local copy) then it will always
        work because rsync truncates the file prior to writing out the updated version.

        Note that versions of rsync older than 3.1.3 will reject the combination of --sparse and --inplace.

All kind of compression/decompression of files usually expands sparse files as well.

bmap-tools

bmap-tools [4] is a handy tool for creating block maps (bmaps) for a file and then use that information and copy that file to a media in a more efficient way.

The advantages of bmap-tools compared to e.g. dd is (as they state on their Github page) :

  • Faster. Depending on various factors, like write speed, image size, how full is the image, and so on, bmaptool was 5-7 times faster than dd in the Tizen IVI project.
  • Integrity. bmaptool verifies data integrity while flashing, which means that possible data corruptions will be noticed immediately.
  • Usability. bmaptool can read images directly from the remote server, so users do not have to download images and save them locally.
  • Protects user's data. Unlike dd, if you make a mistake and specify a wrong block device name, bmaptool will less likely destroy your data because it has protection mechanisms which, for example, prevent bmaptool from writing to a mounted block device.

The tool comes with two sub commands, create and copy. The create subcommand generates the block maps and copy commands copy a file to a certain destinatin.

A typical usage of dd could be to write an image (bigimage) to to a SD-card ( /dev/sdX):

 $ bzcat bigimage | dd of=/dev/sdX bs=1M conv=fsync

The dd command will copy all data, including those zeroes for the empty partitions. Instead of copy all data with dd, we could generate a block map file with metadata that describes all the "empty" sections:

bmaptool create bigimage > bigimage.bmap

The bmap file is a human readable XML file that shows all the block maps and also checksums for each block:

<?xml version="1.0" ?>
<!-- This file contains the block map for an image file, which is basically
     a list of useful (mapped) block numbers in the image file. In other words,
     it lists only those blocks which contain data (boot sector, partition
     table, file-system metadata, files, directories, extents, etc). These
     blocks have to be copied to the target device. The other blocks do not
     contain any useful data and do not have to be copied to the target
     device.

     The block map an optimization which allows to copy or flash the image to
     the image quicker than copying of flashing the entire image. This is
     because with bmap less data is copied: <MappedBlocksCount> blocks instead
     of <BlocksCount> blocks.

     Besides the machine-readable data, this file contains useful commentaries
     which contain human-readable information like image size, percentage of
     mapped data, etc.

     The 'version' attribute is the block map file format version in the
     'major.minor' format. The version major number is increased whenever an
     incompatible block map format change is made. The minor number changes
     in case of minor backward-compatible changes. -->

<bmap version="2.0">
    <!-- Image size in bytes: 13.1 GiB -->
    <ImageSize> 14088667136 </ImageSize>

    <!-- Size of a block in bytes -->
    <BlockSize> 4096 </BlockSize>

    <!-- Count of blocks in the image file -->
    <BlocksCount> 3439616 </BlocksCount>

    <!-- Count of mapped blocks: 1.9 GiB or 14.5%     -->
    <MappedBlocksCount> 499471  </MappedBlocksCount>

    <!-- Type of checksum used in this file -->
    <ChecksumType> sha256 </ChecksumType>

    <!-- The checksum of this bmap file. When it's calculated, the value of
         the checksum has be zero (all ASCII "0" symbols).  -->
    <BmapFileChecksum> 19337986514b3952866af5fb80054c40166f608f42efbf0551956e80678aba46 </BmapFileChecksum>

    <!-- The block map which consists of elements which may either be a
         range of blocks or a single block. The 'chksum' attribute
         (if present) is the checksum of this blocks range. -->
    <BlockMap>
        <Range chksum="a53e6dcf984243a7d87a23fed5d28f3594c72a83081e90e296694256cfc26395"> 0-2 </Range>
        <Range chksum="a60a80a89aa832b3480da91fe0c75e77e24b29c1de0fcdffb9d2b50102013ff5"> 8-452 </Range>
        <Range chksum="7b07df777cd3441ac3053c2c39be52baef308e4befabd500d0c72ef4ab7c5565"> 2048-12079 </Range>
        <Range chksum="cf6ac0e4aec163a228bad0ab85722e011dd785fc8d047dc4d0f86e886fa6684d"> 24576-24902 </Range>
        <Range chksum="7521285590601370cc063cc807237eaf666f879d84d4fcae001026a7bb5a7eff"> 24904-24905 </Range>
        <Range chksum="9661e72b75fe483d53d10585bff79316ea38f15c48547b99da3e0c8b38634ceb"> 24920-26173 </Range>
        <Range chksum="93d69200bd59286ebf66ba57ae7108cc46ef81b3514b1d45864d4200cf4c4182"> 40824-57345 </Range>
        <Range chksum="56ed03479c8d200325e5ef6c1c88d8f86ef243bf11394c6a27faa8f1a98ab30e"> 57656-122881 </Range>
        <Range chksum="cad5eeff8c937438ced0cba539d22be29831b0dae986abdfd3edc8ecf841afdb"> 123192-188417 </Range>
        <Range chksum="88992bab7f6a4d06b55998e74e8a4635271592b801d642c1165528d5e22f23ff"> 188728-253953 </Range>
        <Range chksum="f5332ff218a38bf58e25b39fefc8e00f5f95e01928d0a5edefa29340f4a24b42"> 254264-319489 </Range>
        <Range chksum="714c8373aa5f7b55bd01d280cb182327f407c434e1dcc89f04f9d4c522c24522"> 319800-509008 </Range>
        <Range chksum="b1ec6c807ea2f47e0d937ac0032693a927029ff5d451183a5028b8af59fb3dd2"> 548864 </Range>
        <Range chksum="554448825ffae96d000e4b965ef2c53ae305a6734d1756c72cd68d6de7d2a8b0"> 548867 </Range>
        <Range chksum="f627ca4c2c322f15db26152df306bd4f983f0146409b81a4341b9b340c365a16"> 660672-660696 </Range>
        <Range chksum="fa01901c7f34cdb08467000c3b52a4927e208a6c128892018c0f5039fedb93c0"> 661504-661569 </Range>
        <Range chksum="ee3b76f97d2b039924e60726a165f688c93c87add940880a5d6a4a62fe4f7876"> 661573 </Range>
        <Range chksum="db3aa8c65438e58fc0140b81da36ec37c36fa94549b43ad32d36e206a9ec7819"> 661577-661578 </Range>
        <Range chksum="726f800d4405552f2406fb07e158a3966a89006f9bf474104c822b425aed0f18"> 662593-662596 </Range>
        <Range chksum="e9085dd345859b72dbe86b9c33ce44957dbe742ceb5221330ad87d83676af6cb"> 663552-663553 </Range>
        <Range chksum="1f3660636b045d5fc31af89f6e276850921d34e3e008a2f5337f1a1372b61114"> 667648-667649 </Range>
        <Range chksum="b3f3b8a0818aeade05b5e7af366663130fdc502c63a24a236bb54545d548d458"> 671744-671745 </Range>
        <Range chksum="b4a759d879bc4a47e8b292a5a02565d551350e9387227016d8aed62f3a668a1d"> 675840-675841 </Range>
        <Range chksum="1ff00c7f6feca0b42a957812832b89e7839c6754ca268b06b3c28ba6124c36fd"> 679936-679937 </Range>
        <Range chksum="9bbedd9ac9ab58f06abc2cdc678ac5ddf77c117f0004139504d2aaa321c7475f"> 694272 </Range>
        <Range chksum="c8defb770c630fa1faca3d8c2dff774f7967a0b9902992f313fc8984a9bb43a8"> 696320-698368 </Range>
        <Range chksum="9d5fad916ab4a1f7311397aa3f820ae29ffff4c3937b43a49a0a2a62bf8d6846"> 712704-712705 </Range>
        <Range chksum="48f179784918cf9bf366201edc1585a243f892e6133aa296609486734398be7f"> 716800-716801 </Range>
        <Range chksum="cc0dd36d4891ae52c5626759b21edc4894c76ba145da549f85316646f00e8f2d"> 727040 </Range>
        <Range chksum="955be1419c8df86dc56151737426c53f5ba9df523f66d2b6056d3b2028e5240d"> 759808 </Range>
        <Range chksum="ab6cafa862c5a250b731e41e373ad49767535f6920297376835f04a9e81de023"> 759811 </Range>
        <Range chksum="0ec8e7c7512276ad5fabe76591aeda7ce6c3e031b170454f800a00b330eea2de"> 761856-761857 </Range>
        <Range chksum="de2f256064a0af797747c2b97505dc0b9f3df0de4f489eac731c23ae9ca9cc31"> 789488-789503 </Range>
        <Range chksum="8c8853e6be07575b48d3202ba3e9bb4035b025cd7b546db7172e82f97bd53b1c"> 790527-791555 </Range>
        <Range chksum="918fc7f7a7300f41474636bf6ac4e7bf18e706a341f5ee75420625b85c66c47e"> 791571 </Range>
        <Range chksum="c25cb25875f56f2832b2fcd7417b7970c882cc73ffd6e8eaa7b9b09cbf19f172"> 791587 </Range>
        <Range chksum="52aec994d40f31ce57d57391a8f6db6c23393dfca6d53b99fb8f0ca4faad0253"> 807971-807976 </Range>
        <Range chksum="8ffcde1a642eb739a05bde4d74c73430d5fb4bcffe4293fed7b35e3ce9fe16df"> 823296-823298 </Range>
        <Range chksum="1d21f93952243d3a6ba80e34edcac9a005ad12c82dec79203d9ae6e0c2416043"> 888832-888834 </Range>
        <Range chksum="cdb117909634c406611acef34bedc52de88e8e8102446b3924538041f36a6af9"> 954368-954370 </Range>
        <Range chksum="18945d38f9ac981a16fb5c828b42da4ce86c43fe7b033b7c69b21835c4a3945b"> 1019904-1019906 </Range>
        <Range chksum="1ad3cc0ed83dad0ea03b9fcc4841332361ded18b8e29c382bfbb142a962aef81"> 1085440-1085442 </Range>
        <Range chksum="c19fc180da533935f50551627bd3913d56ff61a3ecf6740fbc4be184c5385bc9"> 1314816 </Range>
        <Range chksum="b045e6499699f9dea08702e15448f5da3e4736f297bbe5d91bb899668d488d22"> 1609728-1609730 </Range>
        <Range chksum="a25229feee84521831f652f3ebcbb23b86192c181ef938ed0af258bfde434945"> 1675264-1675266 </Range>
        <Range chksum="741a9ded7aee9d4812966c66df87632d3e1fde97c68708a10c5f71b9d76b0adf"> 1839104-1839105 </Range>
        <Range chksum="d17c50d57cfb768c3308eff3ccc11cf220ae41fd52fe94de9c36a465408dd268"> 1871872-1888255 </Range>
        <Range chksum="c19fc180da533935f50551627bd3913d56ff61a3ecf6740fbc4be184c5385bc9"> 2363392 </Range>
        <Range chksum="1747cefe9731aa9a0a9e07303199f766beecaef8fd3898d42075f833950a3dd2"> 2396160-2396162 </Range>
        <Range chksum="c19fc180da533935f50551627bd3913d56ff61a3ecf6740fbc4be184c5385bc9"> 2887680 </Range>
        <Range chksum="ad7facb2586fc6e966c004d7d1d16b024f5805ff7cb47c7a85dabd8b48892ca7"> 2887695 </Range>
        <Range chksum="de2f256064a0af797747c2b97505dc0b9f3df0de4f489eac731c23ae9ca9cc31"> 3411952-3411967 </Range>
        <Range chksum="7d501c17f812edb240475aa0f1ee9e15c4c8619df82860b2bd008c20bb58dc6d"> 3414000 </Range>
        <Range chksum="f71e4b72a8f3d4d8d4ef3c045e4a35963b2f5c0e61c9a31d3035c75a834b502c"> 3414016-3414080 </Range>
        <Range chksum="98033bc2b57b7220f131ca9a5daa6172d87ca91ff3831a7e91b153329bf52eb1"> 3414082-3414084 </Range>
        <Range chksum="b5a2b3aaf6dcfba16895e906a5d98d6167408337a3b259013fa3057bafe80918"> 3414087 </Range>
        <Range chksum="81b766d472722dbc326e921a436e16c0d7ad2c061761b2baeb3f6723665997c5"> 3414886-3414890 </Range>
        <Range chksum="2250b13e06cc186d3a6457a16824a3c9686088a1c0866ee4ebe8e36dfe3e17d7"> 3416064 </Range>
        <Range chksum="a82ec47f112bff20b91d748aebc769b2c9003b7eb5851991337586115f31da62"> 3420160 </Range>
        <Range chksum="99dcc3517a44145d057b63675b939baa7faf6b450591ebfccd2269a203cab194"> 3424256 </Range>
        <Range chksum="f5a94286ab129fb41eea23bf4195ca85e425148b151140dfa323405e6c34b01c"> 3426304-3427328 </Range>
        <Range chksum="d261d607d8361715c880e288ccb5b6602af50ce99e13362e8c97bce7425749d2"> 3428352 </Range>
        <Range chksum="d6ff528a483c3a6c581588000d925f38573e7a8fd84f6bb656b19e61cb36bd06"> 3432448 </Range>
        <Range chksum="de2f256064a0af797747c2b97505dc0b9f3df0de4f489eac731c23ae9ca9cc31"> 3439600-3439615 </Range>
    </BlockMap>
</bmap>

Once the bmap file is generated, we can copy the file to , e.g. our SD-card (/dev/sdX) using the copy subcommand:

bmaptool copy bigimage /dev/sdX

In my case, this was 6 times faster compared to dd.

Yocto

Yocto supports OpenEmbedded Kickstart Reference (.wks) to describe how to create a disk image.

A .wks file could look like this:

part u-boot --source rawcopy --sourceparams="file="${IMAGE_BOOTLOADER}   --no-table    --align ${IMX_BOOT_SEEK}
part /boot  --source bootimg-partition                        --fstype=vfat --label boot --active --align 8192 --size 64
part        --source rootfs                                   --fstype=ext4 --label root    --align 8192 --size 1G
part                                                          --fstype=ext4 --label overlay --align 8192 --size 500M
part /mnt/data                                                --fstype=ext4 --label data --align 8192 --size 10G
bootloader --ptable msdos

The format is rather self explained. Pay some attention to the --size parameters. Here we create several (empty) partitions which is actually quite big.

The resulting image occupy 14088667136 bytes in total, but only 1997888 bytes is non-sparse data.

Yocto is able to generate a block map file using the bmaptool for the resulting ẁic` file. Just specify that you want to generate a wic.bmap image among your other fs types:

IMAGE_FSTYPES = "wic.gz wic.bmap"

UUU

The Universal Update Utility (UUU) [1] is a image deploy tool for Freescale/NXP I.MX chips. It allows downloading and executing code via the Serial Download Protocol (SDP) when booted into manufacturing mode.

The tool let you flash all of most the common types of storage devices such as NAND, NOR, eMMC and SD-cards.

That is good, but the USB-OTG port that is used for data transfer is only USB 2.0 (Hi-Speed) which limit the transfer speed to 480Mb/s.

A rather common case is to write a full disk image to a storage device using the tool. A full disk image contains, as you may guessed, a full disk, this includes all unpartionated data, partition table, all partitions, both those that contains data and eventually any empty partitions. In other words, if you have a 128MiB storage, your disk image will occupy 128MiB, and 128MiB of data will be downloaded via SDP.

Support for sparse files could be handy here.

UUU claims to support sparse images by its raw2sparse type, but that is simply a lie, the code that detects continous chunks of data is actually commented out [3] for unknown reason:

//int type = is_same_value(data, pheader->blk_sz) ? CHUNK_TYPE_FILL : CHUNK_TYPE_RAW;
int type = CHUNK_TYPE_RAW;

Pending pull request

Luckily enough, there is a pull request [5] that add support for using bmap files with UUU!

The PR is currently (2023-12-11) not merged but I hope is is soon. I've tested it out and it works like a charm.

EDIT: The PR was merged 2023-12-12, one day after I wrote this blog entry :-)

With this PR you can provide a bmap file to the flash command via the -bmap <image> option in your .uuu script. E.g. :

uuu_version 1.2.39

SDPS: boot -f ../imx-boot

FB: ucmd setenv emmc_dev 2
FB: ucmd setenv mmcdev ${emmc_dev}
FB: ucmd mmc dev ${emmc_dev}
FB: flash -raw2sparse -bmap rawimage.wic.bmap all rawimage.wic
FB: done

It cut the time flashing down to a sixth!

As this is a rather new feature in the UUU tool, I would like to promote it and thank dnbazhenov @Github for implement this.

Mutex guards in the Linux kernel

Mutex guards in the Linux kernel

I found an interresting thread [1] while searching my inbox for something completely unrelated.

Peter Zijistra has written a few cleanup functions that where introduced in v6.4 with this commit:

commit 54da6a0924311c7cf5015533991e44fb8eb12773
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri May 26 12:23:48 2023 +0200

    locking: Introduce __cleanup() based infrastructure

    Use __attribute__((__cleanup__(func))) to build:

     - simple auto-release pointers using __free()

     - 'classes' with constructor and destructor semantics for
       scope-based resource management.

     - lock guards based on the above classes.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

It adds functionality to "guard" locks. The guard wraps the lock, takes ownership of the given mutex and release it as soon the guard leaves the scope. In other words - no more forgotten locks due to early exits.

Compare this to the std::lock_guard class we have in C++.

Although this adds valuable functionality to the core, it's currently not widely used. In fact, it only has two users in the latest (v.6.6) kernel:

	$ git grep -l "guard(mutex)" 
	drivers/gpio/gpio-sim.c
	kernel/sched/core.c

Hands on

I have adapted ( [2], [3]) two of my drivers to make use of the guard locks. The adaptation is quickly made.

The features is located in linux/cleanup.h:

+#include <linux/cleanup.h>

Then we can start to make use of the guards. What I like is that the code will be simplier in two ways:

  • All the mutex_lock-pairs in the same scope could be replaced with guard(mutex)(&your->mutex).
  • The code can now return without taking any taken locks into account.

Together with device managed ( devm ) resources, you will end up with a code that clean up itself pretty good.

A typical adaption to guarded mutexes could look likt this:

	@@ -83,31 +85,26 @@ static int pxrc_open(struct input_dev *input)
		struct pxrc *pxrc = input_get_drvdata(input);
		int retval;

	-       mutex_lock(&pxrc->pm_mutex);
	+       guard(mutex)(&pxrc->pm_mutex);
		retval = usb_submit_urb(pxrc->urb, GFP_KERNEL);
		if (retval) {
			dev_err(&pxrc->intf->dev,
				"%s - usb_submit_urb failed, error: %d\n",
				__func__, retval);
	-               retval = -EIO;
	-               goto out;
	+               return -EIO;
		}

		pxrc->is_open = true;
	-
	-out:
	-       mutex_unlock(&pxrc->pm_mutex);
	-       return retval;
	+       return 0;
	 }

What it does is:

  • Removes the mutex_lock/mutex_unlock pair
  • Simplifies the error handling to just return in case of error
  • No need for the out: label anymore so remove it

Under the hood

The implementation makes use of the __attribute__((cleanup())) attribute that is available for both LLVM [4] and GCC [5].

Here is what the GCC documentation [5] says about the cleanup_function:

cleanup (cleanup_function)
The cleanup attribute runs a function when the variable goes out of scope. This attribute can only be applied to auto function scope variables; it may not be applied to parameters or variables with static storage duration.
The function must take one parameter, a pointer to a type compatible with the variable. The return value of the function (if any) is ignored.

If -fexceptions is enabled, then cleanup_function is run during the stack unwinding that happens during the processing of the exception.
Note that the cleanup attribute does not allow the exception to be caught, only to perform an action. It's undefined what happens if cleanup_function does not return normally.

To illustrate this, consider the following example:

#include <stdio.h>

void cleanup_func (int *x)
{
	printf("Tidy up for x as it's leaving its scope\n");
}

int main(int argc, char **argv)
{
	printf("Start\n");
	{
		int x __attribute__((cleanup(cleanup_func)));
		/* Do stuff */
	}
	printf("Exit\n");
}

We create a variable, x, declared with the cleanup attribute inside of its own scope. This makes that the cleanup_func() will be called as soon x goes out of scope.

Here is the output of the example above:

$ gcc main.c -o main && ./main
Start
Tidy up for x as it's leaving its scope
Exit

As you can see, the cleanup_func() is called in between of Start and Exit - as expected.

Test packages in Buildroot

Test packages in Buildroot

When writing packages for Buildroot there are several conditions that you have to test your package against.

This includes different toolchains, architectures, C-libraries, thread-implementations and more. To help you with that, Buildroot provides the utils/test-pkg script.

Nothing describes the script better than its own help text [1]:

test-pkg: test-build a package against various toolchains and architectures

The supplied config snippet is appended to each toolchain config, the
resulting configuration is checked to ensure it still contains all options
specified in the snippet; if any is missing, the build is skipped, on the
assumption that the package under test requires a toolchain or architecture
feature that is missing.

In case failures are noticed, you can fix the package and just re-run the
same command again; it will re-run the test where it failed. If you did
specify a package (with -p), the package build dir will be removed first.

The list of toolchains is retrieved from support/config-fragments/autobuild/toolchain-configs.csv.
Only the external toolchains are tried, because building a Buildroot toolchain
would take too long. An alternative toolchains CSV file can be specified with
the -t option. This file should have lines consisting of the path to the
toolchain config fragment and the required host architecture, separated by a
comma. The config fragments should contain only the toolchain and architecture
settings.

By default, a useful subset of toolchains is tested. If needed, all
toolchains can be tested (-a), an arbitrary number of toolchains (-n
in order, -r for random).

Hands on

In these examples I'm going to build the CRIU package that i recently worked on.

First I will create a config snippet that contains the necessary options to enable my packet. In my case it's CRIU and HOST_PYTHON3:

     cat > criu.config <<EOF
     BR2_PACKAGE_HOST_PYTHON3=y
     BR2_PACKAGE_CRIU=y
     EOF

I can now start test-pkg and provide the config snippet:

     utils/test-pkg -c criu.config -p criu
                     bootlin-armv5-uclibc [1/6]: FAILED
                      bootlin-armv7-glibc [2/6]: FAILED
                    bootlin-armv7m-uclibc [3/6]: SKIPPED
                      bootlin-x86-64-musl [4/6]: OK
                       br-arm-full-static [5/6]: SKIPPED
                             sourcery-arm [6/6]: FAILED
     6 builds, 2 skipped, 3 build failed, 0 legal-info failed, 0
show-info failed

Some of them fails, the end of the build log is available at ~/br-test-pkg/*/logfile.

Now start read the logfiles and fix the errors for each failed test.

Once the test-pkg (without -a) will have no failures, it will be a good test to retry it with the -a (run all tests) option:

     utils/test-pkg -a -c criu.config -p criu

As soon as more and more tests passes, it takes unnecessary amount of time to run the same (successful) test again, so it's better to hand-pick those tests that actually fails.

To do this you may provide a specific list of toolchain configurations:

     cp support/config-fragments/autobuild/toolchain-configs.csv
criu-toolchains.csv
     # edit criu-toolchains.csv to keep your toolchains of interest.
     utils/test-pkg -a -t criu-toolchains.csv -c criu.config -p criu

This will retest only the toolchains kept in the csv.

Git version in cmake

Git version in CMake

All applications have versions. The version should somehow be exposed in the application to make it possible to determine which application we are actually running.

I've seen a plenty of variants on how this is achieved, some are good and some are really bad. Since it's such a common thing, I thought I'd show how I usually do it.

I use to let CMake determine the version based on git describe and tags, the benefit's that it is part of the build process (i.e no extra step on the build server - I've seen it all...), and you will get right version information also for local builds.

git describe also gives you useful information which allows you to derive the build to the actual commit. Incredible useful.

The version information will be stored as a define in a header file which should be included into the project.

Hands on

The version file we intend to include in the project is generated from Version.h.in:

#pragma once
#define SOFTWARE_VERSION "@FOO_VERSION@"

@FOO_VERSION@ is a placeholder that will be replaced by configure_file() later on.

The "core" functionality is in GenerateVersion.cmake:

find_package(Git)

if(GIT_EXECUTABLE)
  get_filename_component(WORKING_DIR ${SRC} DIRECTORY)
  execute_process(
    COMMAND ${GIT_EXECUTABLE} describe --tags --dirty
    WORKING_DIRECTORY ${WORKING_DIR}
    OUTPUT_VARIABLE FOO_VERSION
    RESULT_VARIABLE ERROR_CODE
    OUTPUT_STRIP_TRAILING_WHITESPACE
    )
endif()

if(FOO_VERSION STREQUAL "")
  set(FOO_VERSION 0.0.0-unknown)
  message(WARNING "Failed to determine version from Git tags. Using default version \"${FOO_VERSION}\".")
endif()

configure_file(${SRC} ${DST} @ONLY)

It will look for the git package, receive the tag information with git describe and store the version in the FOO_VERSION variable.

configure_file() is then used to create a version.h out of version.h.in where the FOO_VERSION placeholder is replaced by the actual version.

To use GenerateVersion.cmake, we have to create a custom target in the CMakeLists.txt file and provide the source and destination path for the version file:

add_custom_target(version
  ${CMAKE_COMMAND} -D SRC=${CMAKE_SOURCE_DIR}/Version.h.in
                   -D DST=${CMAKE_SOURCE_DIR}/Version.h
                   -P ${CMAKE_SOURCE_DIR}/GenerateVersion.cmake
  )

Finally, make sure that the target (${PROJECT_NAME} in this case) depends on the version target:

add_dependencies(${PROJECT_NAME} version)

That's basically it.