Changing the root of your Linux filesystem

Changing the root of your Linux filesystem

After my previous post [1], I've got a few questions from people about the difference between chroot, pivot_root and switch_root, they all seems to do the same thing, right? Almost.

Lets shed some light on this topic.

Rootfs

First of all, we need to specify what we mean when we say rootfs.

rootfs is a special instance of a ram filesystem (usually tpmfs) that is created in an early boot stage. You cannot unmount rootfs for pretty much the same reason you can't kill the init process - you need it to have a functional system.

The kernel build process always creates a gzipped gpio (newc format) archive and ink it into the resulting kernel binary. This is our rootfs! The empty archive consuming ~130 bytes, and that is something we have to live with.

The rootfs is empty - unless you populate it. You may set the CONFIG_INITRAMFS_SOURCE config option to include files and directories to your rootfs.

When the system boots, it searches for an executable /init in the rootfs, and executes if it exists. If it does not exist the system will continue and try to mount whatever is specified on the root= kernel command line.

One important thing to remember is that the system mounts the root= partition on / (which is basically rootfs).

initramfs vs inird

So, now we when we know what rootfs is and that rootfs is basically an initramfs (empty or not).

Now we need to distinguish between initramfs and initrd. At a first glance, it may seems like they're the same, but they are not.

An initrd is basically a complete image of a filesystem that is mounted on a ramdev, a RAM-based block-device. As it's a filesystem, you will need to have support for that filesystem static compiled (=y) into your kernel.

An initramfs, on the other hand, is not a filesystem. It's a cpio archive which is unpacked into a tmpfs.

As we already has spoken about, this archive is embedded into the kernel itself and as a side-effect, the initramfs could be loaded earlier in the kernel boot process and keep a smaller footprint as the kernel can adapt the size of the tmpfs to what is actually loaded.

The both are also fundamental different when it comes how they're treated as a root filesystem. As an initrd is an actual filesystem unpacked into RAM, the root filesystem will be that ramdisk. When changing root filesystem form an initrd to a "real" root device, you have to switch the root device.

The initramfs on the other hand, is some kind of interim filesystem with no underlaying device, so the kernel doesn't need to switch any devices - just "jump" to one.

Jump into a new root filesystem

So, when it's time to "jump" to a new root filesystem, that could be done in several ways.

chroot

chroot runs a command with a specified root directory. Nothing more, nothing less.

The basic syntax of the chroot command is:

chroot option newroot [command [args]]

pivot_root

pivot_root moves the root file system of the current process to the directory put_old and makes new_root the new root file system. It's often used in combination with chroot.

From the manpage [2]:

The typical use of pivot_root() is during system startup,
when the system mounts a temporary root file
system (e.g., an initrd), then mounts the real root file system,
and eventually turns the latter into the current root
of all relevant processes or threads.

The basic syntax of the pivot_root command is:

pivot_root new_root put_old

Example on how to use pivot_root:

mount /dev/hda1 /new-root
cd /new-root
pivot_root . old-root
exec chroot . sh <dev/console >dev/console 2>&1
umount /old-root

pivot_root should be unused for initrd only, it does not work for initramfs.

switch_root

switch_root [3] switches to another filesystem as the root of the mount tree. Pretty much the same thing as pivot_root do, but with a few differences.

switch_root moves already mounted /proc, /dev, /sys and /run to newroot and makes newroot the new root filesystem and starts init process.

Another important difference is that switch_root removes recursively all files and directory in the current root filesystem!

The basic syntax of the switch_root command is:

switch_root newroot init [arg...]

Also, you are not allowed to umount the old mount (as you do with pivot_root) as this is your rootfs (remember?).

Summary

To summarize;

If you use initrd - use pivot_root combined with chroot.

if you use initramfs - use switch_root.

It's technically possible to only use chroot, but it will not free resources and you could end up with some restrictions( see unshare(7) for example).

Also, keep in mind that you must invoke either command via exec, otherwise the new process will not inherit PID 1.

chroot and user namespaces

chroot and user namespaces

When playing around with libcamera [1] and br2-readonly-rootfs-overlay [2] I found something.. well.. unexpected. At least at first glance.

What happened was that I encountered this error:

 $ libcamera-still
Preview window unavailable
[0:02:54.785102683] [517]  INFO Camera camera_manager.cpp:299 libcamera v0.0.0+67319-2023.02-22-gd530afad-dirty (2024-02-20T16:56:34+01:00)
[0:02:54.885731084] [518] ERROR Process process.cpp:312 Failed to unshare execution context: Operation not permitted

Failed to unshare execution context: Operation not permitted... what?

I know that libcamera executes proprietary IPAs (Image Processing Algorithms) as black boxes, and that the execution is isolated in their own namespace. But.. not permitted..?

Lets look into the code for libcamera (src/libcamera/process.cpp) [3]:

int Process::isolate()
{
	int ret = unshare(CLONE_NEWUSER | CLONE_NEWNET);
	if (ret) {
		ret = -errno;
		LOG(Process, Error) << "Failed to unshare execution context: "
				    << strerror(-ret);
		return ret;
	}

	return 0;
}

Libcamera does indeed create a user and network namespace for the execution. In Linux, new namespaces is created with unshare(2).

The unshare(2) library call is used to dissassociate parts of its execution context by creating new namespaces. As you can see in the manpage [4], there are a few scenarios that can give us -EPERM as return value:

EPERM  The calling process did not have the required privileges for this operation.

EPERM  CLONE_NEWUSER  was  specified in flags, but either the effective user ID or the effective group ID of the caller does not have a mapping in
       the parent namespace (see user_namespaces(7)).

EPERM (since Linux 3.9)
       CLONE_NEWUSER was specified in flags and the caller is in a chroot environment (i.e., the caller's root directory does not match  the  root
       directory of the mount namespace in which it resides).

The last one caught my eyes. I was not aware of this, even if it makes very much sense to have it this way.

Unfortunately, the last step of the init-script in the initramfs for br2-readonly-rootfs-overlay is to do chroot(1) to the actual root filesystem. I did initially choose to use chroot because it works with both an initrd and initramfs root filesystem, and the code was intended to be generic.

pivot_root and switch_root can (and should) be used with advantage, but it requires you to now which type of filesystem you are working with.

As the br2-readonly-rootfs-overlay is now only designed to be used with initramfs, I switched to use switch_root and the problem was solved.

For the sake of fun, lets follow the code down to where the error occour.

Code digging

When the application makes a call to unshare(2), which simply is a wrapper in the libc implementation for the system call down to the kernel, it ends up with a syscall(2).

The system call on the kernel side is defined as follow:

SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
{
	return ksys_unshare(unshare_flags);
}

It only calls ksys_unshare() which interpret the unshare_flags and "unshares" namespaces depending on flags. The flag for unsharing a user namespace is CLONE_NEWNS.

int ksys_unshare(unsigned long unshare_flags)
{
	struct fs_struct *fs, *new_fs = NULL;
	struct files_struct *new_fd = NULL;
	struct cred *new_cred = NULL;
	struct nsproxy *new_nsproxy = NULL;
	int do_sysvsem = 0;
	int err;

	/*
	 * If unsharing a user namespace must also unshare the thread group
	 * and unshare the filesystem root and working directories.
	 */
	if (unshare_flags & CLONE_NEWUSER)
		unshare_flags |= CLONE_THREAD | CLONE_FS;
	/*
	 * If unsharing vm, must also unshare signal handlers.
	 */
	if (unshare_flags & CLONE_VM)
		unshare_flags |= CLONE_SIGHAND;
	/*
	 * If unsharing a signal handlers, must also unshare the signal queues.
	 */
	if (unshare_flags & CLONE_SIGHAND)
		unshare_flags |= CLONE_THREAD;
	/*
	 * If unsharing namespace, must also unshare filesystem information.
	 */
	if (unshare_flags & CLONE_NEWNS)
		unshare_flags |= CLONE_FS;

	err = check_unshare_flags(unshare_flags);
	if (err)
		goto bad_unshare_out;
	/*
	 * CLONE_NEWIPC must also detach from the undolist: after switching
	 * to a new ipc namespace, the semaphore arrays from the old
	 * namespace are unreachable.
	 */
	if (unshare_flags & (CLONE_NEWIPC|CLONE_SYSVSEM))
		do_sysvsem = 1;
	err = unshare_fs(unshare_flags, &new_fs);
	if (err)
		goto bad_unshare_out;
	err = unshare_fd(unshare_flags, NR_OPEN_MAX, &new_fd);
	if (err)
		goto bad_unshare_cleanup_fs;
	err = unshare_userns(unshare_flags, &new_cred);
	....

unshare_userns() prepares credentials and then creates the user namespace by calling create_user_ns():

int unshare_userns(unsigned long unshare_flags, struct cred **new_cred)
{
	struct cred *cred;
	int err = -ENOMEM;

	if (!(unshare_flags & CLONE_NEWUSER))
		return 0;

	cred = prepare_creds();
	if (cred) {
		err = create_user_ns(cred);
		if (err)
			put_cred(cred);
		else
			*new_cred = cred;
	}

	return err;
}

Before create_user_ns() creates the actual user namespace, it makes various sanity checks, one of those are if(current_chrooted())

int create_user_ns(struct cred *new)
{
	struct user_namespace *ns, *parent_ns = new->user_ns;
	kuid_t owner = new->euid;
	kgid_t group = new->egid;
	struct ucounts *ucounts;
	int ret, i;

	ret = -ENOSPC;
	if (parent_ns->level > 32)
		goto fail;

	ucounts = inc_user_namespaces(parent_ns, owner);
	if (!ucounts)
		goto fail;

	/*
	 * Verify that we can not violate the policy of which files
	 * may be accessed that is specified by the root directory,
	 * by verifying that the root directory is at the root of the
	 * mount namespace which allows all files to be accessed.
	 */
	ret = -EPERM;
	if (current_chrooted())
		goto fail_dec;
	
	....

This is how we check if the if the current process is chrooted or not:

bool current_chrooted(void)
{
	/* Does the current process have a non-standard root */
	struct path ns_root;
	struct path fs_root;
	bool chrooted;

	/* Find the namespace root */
	ns_root.mnt = &current->nsproxy->mnt_ns->root->mnt;
	ns_root.dentry = ns_root.mnt->mnt_root;
	path_get(&ns_root);
	while (d_mountpoint(ns_root.dentry) && follow_down_one(&ns_root))
		;

	get_fs_root(current->fs, &fs_root);

	chrooted = !path_equal(&fs_root, &ns_root);

	path_put(&fs_root);
	path_put(&ns_root);

	return chrooted;
}

Summary

Do not create user namespaces for chroot:ed processes. It does not work for good reasons.

Streamline your kernel config

Streamline your kernel config

When working with embedded systems, it's not that uncommon that you need to compile your own kernel for your hardware platform.

The configuration file you use is probably based on some default configuration that you borrowed from the kernel tree, a vendor tree, a SoM-vendor or simply from your collegue.

As the configuration is somehow generic, it probably contains tons of kernel modules that is not needed for your application. Walking through the menuconfig and tidy up among all modules could be a lot of work.

What you want is to only enable those modules that are currently in use and disable all other. Maybe you have used make localmodconfig or make localyesconfig during compilation of your host kernel, but did you know that those targets are just as applicable for embedded systems?

streamline_config.pl

Both localmodconfig and localyesconfig make use of ./scripts/kconfig/streamline_config.pl [1] which usually execute lsmod, look into your .config and remove all modules that are currently not loaded.

Instead of execute the default lsmod binary, you could set the LSMOD environment variable to either point to an executable file, OR a text file containing the output from lsmod.

Get your current config

In order to streamline your configuration file, you will of course need to know the configuration. Hopefully you already have it, otherwise it could be extracted from your running kernel if CONFIG_IKCONFIG is set.

Either directly from the target:

 $ zcat /proc/config.gz > /tmp/.config

Or by helper scripts in the kernel source

 $ ./scripts/extract-ikconfig ./vmlinux > /tmp/.config

and make it available to the host.

Get a list of loaded modules

Save the output from lsmod (run on target) to a file:

 $ lsmod > /tmp/mymodules

And make it available to the host.

Streamline

Time to streamline. Assuming you now have the configuration file stored as .config in the root of your kernel tree, you may now run

 $ make LSMOD=/path/to/mymodules localmodconfig
To update the current config and disable all modules not specified in /path/to/mymodules.

Or by

 $ make LSMOD=/path/to/mymodules localyesconfig
To update the current config converting all modules specified in /path/to/mymodules to core.

Power button and embedded Linux

Power button and embedded Linux

Not all embedded Linux systems has a power button, but for consumer electronic devices it could be a good thing to be able to turn it off. But how does it work in practice?

Physical button

It all starts with a physical button.

At the board level, the button is usually connected to a General-Purpose-Input-Output (GPIO) pin on the processor. It doesn't have to be directly connected to the processor though, all that matters is that the button press can somehow be mapped to the Linux input subsystem.

The input system reports certain key codes to a process in user space that listen for those codes and performs the corresponding action.

All possible keycodes is listed in linux-event-codes.h [1]. The ones that are relevant to us are:

  • KEY_POWER
  • KEY_SLEEP
  • KEY_SUSPEND
  • KEY_RESTART

Map the button to the input system

To map a certain GPIO to the input system we make use of the gpio-keys kernel driver [2]. Make sure you have the CONFIG_KEYBOARD_GPIO option selected in your kernel configuration.

The mapping is described in the devicetree. The key (pun intended) is to set the right key-code on the linux,code property.

See example below:

gpio-keys {
        compatible = "gpio-keys";
        pinctrl-names = "default";
        pinctrl-0 = <&button_pins>;
        status = "okay";

        key-power {
                label = "Power Button";
                gpios = <&gpio 7 GPIO_ACTIVE_LOW>;
                linux,code = <KEY_POWER>;
                wakeup-source;
        };
        key-restart {
                label = "Restart Button";
                gpios = <&gpio 8 GPIO_ACTIVE_LOW>;
                linux,code = <KEY_RESTART>;
                wakeup-source;
        };
        key-suspend {
                label = "Suspend Button";
                gpios = <&gpio 9 GPIO_ACTIVE_LOW>;
                linux,code = <KEY_SUSPEND>;
                wakeup-source;
        };
};

Hopefully you will now have the input device registered when you boot up:

 $ dmesg | grep input
 [    6.886635] input: gpio-keys as /devices/platform/gpio-keys/input/input0

systemd-logind

We now have an input device that reports key-events when we press the button. The next step is to have something that read these reports and act on them on the userspace side.

On systems with systemd, we have the systemd-logind service. This service is responsible for a lot of things including:

  • Keeping track of users and sessions, their processes and their idle state. This is implemented by allocating a systemd slice unit for each user below user.slice, and a scope unit below it for each concurrent session of a user. Also, a per-user service manager is started as system service instance of user@.service for each logged in user.
  • Generating and managing session IDs. If auditing is available and an audit session ID is already set for a session, then this ID is reused as the session ID. Otherwise, an independent session counter is used.
  • Providing polkit[1]-based access for users for operations such as system shutdown or sleep
  • Implementing a shutdown/sleep inhibition logic for applications
  • Handling of power/sleep hardware keys
  • Multi-seat management
  • Session switch management
  • Device access management for users
  • Automatic spawning of text logins (gettys) on virtual console activation and user runtime directory management
  • Scheduled shutdown
  • Sending "wall" messages

Is is of course possible to have your own listener, but as we have systemd-logind and ... I sort of started to like systemd, we will go that way.

Configure systemd-logind

systemd-logind has its configuration file [3] in /etc/systemd/logind.conf where it's possible to set an action for each event. Possible actions are:

Action Description
ignore systemd-login will never handle these keys
poweroff System will poweroff
reboot System will reboot
halt System will halt
kexec System will start kexec
suspend System will suspend
hibernate System will hybernate
hybrid-sleep  
suspend-then-hibernate  
lock All running sessions will be screen-locked
factory-reset  

The events that are configurable are:

Configuration item Listen on Default action
HandlePowerKey KEY_POWER poweroff
HandlePowerKeyLongPress KEY_POWER ignore
HandleRebootKey KEY_RESTART reboot
HandleRebootKeyLongPress KEY_RESTART poweroff
HandleSuspendKey KEY_SUSPEND suspend
HandleSuspendKeyLongPress KEY_SUSPEND hibernate
HandleHibernateKey KEY_SLEEP hibernate
HandleHibernateKeyLongPress KEY_SLEEP ignore

It's not obvious how long a long press is, I could not find anything about it in the documentation. However, the code [4] tells us that it's 5 seconds.

#define LONG_PRESS_DURATION (5 * USEC_PER_SEC)

power-switch tag

The logind service will only watch input devices with the power-switch udev tag for keycodes, so there has to be a udev-rule that sets such a tag. The default rule that comes with systemd, 70-power-switch.rules, looks as follow:

ACTION=="remove", GOTO="power_switch_end"

SUBSYSTEM=="input", KERNEL=="event*", ENV{ID_INPUT_SWITCH}=="1", TAG+="power-switch"
SUBSYSTEM=="input", KERNEL=="event*", ENV{ID_INPUT_KEY}=="1", TAG+="power-switch"

LABEL="power_switch_end"

All input devices of the type ID_INPUT_SWITCH or ID_INPUT_KEY will then be monitored.

We can confirm that the tag is set for our devices we created earlier:

 $ udevadm info /dev/input/event0
P: /devices/platform/gpio-keys/input/input0/event0
M: event0
R: 0
U: input
D: c 13:64
N: input/event0
L: 0
S: input/by-path/platform-gpio-keys-event
E: DEVPATH=/devices/platform/gpio-keys/input/input0/event0
E: DEVNAME=/dev/input/event0
E: MAJOR=13
E: MINOR=64
E: SUBSYSTEM=input
E: USEC_INITIALIZED=7790298
E: ID_INPUT=1
E: ID_INPUT_KEY=1
E: ID_PATH=platform-gpio-keys
E: ID_PATH_TAG=platform-gpio-keys
E: DEVLINKS=/dev/input/by-path/platform-gpio-keys-event
E: TAGS=:power-switch:
E: CURRENT_TAGS=:power-switch:

Summary

The signal path from a physical power button up to a service handler in userspace is not that complicated. Systemd makes handling the signals and what behavior they should have both smooth and flexible.

br2-readonly-rootfs-overlay

br2-readonly-rootfs-overlay

This is a Buildroot external module that could be used as a reference design when building your own system with an overlayed root filesystem. It's created as an external module to make it easy to adapt for your to your own application.

The goal is to achieve the same functionality I have in meta-readonly-rootfs-overlay [1] but for Buildroot.

Why does this exists?

Having a read-only root file system is useful for many scenarios:

  • Separate user specific changes from system configuration, and being able to find differences
  • Allow factory reset, by deleting the user specific changes
  • Have a fallback image in case the user specific changes made the root file system no longer bootable.

Because some data on the root file system changes on first boot or while the system is running, just mounting the complete root file system as read-only breaks many applications. There are different solutions to this problem:

  • Symlinking/Bind mounting files and directories that could potentially change while the system is running to a writable partition
  • Instead of having a read-only root files system, mounting a writable overlay root file system, that uses a read-only file system as its base and writes changed data to another writable partition.

To implement the first solution, the developer needs to analyse which file needs to change and then create symlinks for them. When doing factory reset, the developer needs to overwrite every file that is linked with the factory configuration, to avoid dangling symlinks/binds. While this is more work on the developer side, it might increase the security, because only files that are symlinked/bind-mounted can be changed.

This br2-external provides the second solution. Here no investigation of writable files are needed and factory reset can be done by just deleting all files or formatting the writable volume.

How does it work?

This external module makes use of Dracut [2] to create a rootfs.cpio that is embedded into the Linux kernel as an initramfs.

The initramfs mounts the base root filsystem as read-only and the read-write filesystem as the upper layer in an overlay filesystem structure.

Dependencies

The setup requires the following kernel configuration (added as fragment file in board/qemu-x86_64/linux.fragment):

CONFIG_OVERLAY_FS=y
CONFIG_BLK_DEV_INITRD=y

Test it out

This module will build a x86_64 target that is prepared to be emulated with qemu.

Clone this repository

git clone https://github.com/marcusfolkesson/br2-readonly-rootfs-overlay.git

Build

Run build.sh to make a full build.

The script is simple, it just update the Buildroot submodule and start a build:

#!/bin/bash

# Update submodules
git submodule init
git submodule update

# Build buildroot
cd ./buildroot
make BR2_EXTERNAL=../ readonly-rootfs-overlay_defconfig
make

Artifacts

The following artifacts are generated in ./buildroot/output/images/ :

  • bzImage - The Linux kernel
  • start-qemu.sh - script that will start qemu-system-x86_64 and emulate the whole setup
  • sdcard.img - The disk image containing two partitions, one for the read-only rootfs and one for the writable upper filesystem
  • rootfs.cpio - The initramfs file system that is embedded into the kernel
  • rootfs.ext2 - The root filesystem image
  • overlay.ext4 - Empty filsystem image used for the writable layer

Emulate in QEMU

An script to emulate the whole thing is generated in the output directory. Execute the ./buildroot/output/images/start-qemu.sh script to start the emulator.

Once the system has booted, you are able to login as root:

Welcome to Buildroot
buildroot login: root

And as you can see, the root filesystem is overlayed as it should:

$ mount
...
/dev/vda1 on /media/rfs/ro type ext2 (ro,noatime,nodiratime)
/dev/vda2 on /media/rfs/rw type ext4 (rw,noatime,errors=remount-ro)
overlay on / type overlay (rw,relatime,lowerdir=/media/rfs/ro,upperdir=/media/rfs/rw/upperdir,workdir=/media/rfs/rw/work)
...

Kernel command line parameters

These examples are not meant to be complete. They just contain parameters that are used by the initscript of this repository. Some additional paramters might be necessary.

Example without initramfs

root=/dev/sda1 rootrw=/dev/sda2 init=/init

This cmd line starts /sbin/init with /dev/sda1 partition as the read-only rootfs and the /dev/sda2 partition as the read-write persistent state. When using this init script without an initrd, init=/init has to be set.

root=/dev/sda1 rootrw=/dev/sda2 init=/init rootinit=/bin/sh

The same as before but it now starts /bin/sh instead of /sbin/init

Details

All kernel parameters that is used to configure br2-readonly-rootfs-overlay:

  • root - specifies the read-only root file system device. If this is not specified, the current rootfs is used.
  • `rootfstype if support for the read-only file system is not build into the kernel, you can specify the required module name here. It will also be used in the mount command.
  • rootoptions specifies the mount options of the read-only file system. Defaults to noatime,nodiratime.
  • rootinit if the init parameter was used to specify this init script, rootinit can be used to overwrite the default (/sbin/init).
  • rootrw specifies the read-write file system device. If this is not specified, tmpfs is used.
  • rootrwfstype if support for the read-write file system is not build into the kernel, you can specify the required module name here. It will also be used in the mount command.
  • rootrwoptions specifies the mount options of the read-write file system. Defaults to rw,noatime,mode=755.
  • rootrwreset set to yes if you want to delete all the files in the read-write file system prior to building the overlay root files system.

Expose network namespace created by Docker

Expose network namespace created by Docker

Disclaimer: this is probably *not* the best way for doing this, but it's pretty good for educational purposes.

During a debug session I wanted to connect an application to a service tha ran in a docker container. This was for test-purposes only, so hackish and fast are the keywords.

First of all, I'm not a Docker expert, but I've a pretty good understanding on Linux internals, namespaces and how things works on a Linux system. So I started to use the tools I had.

Create a container to work with

This is not the container I wanted to debug, but to have something to demonstrate the concept:

FROM ubuntu

MAINTAINER Marcus Folkesson <marcus.folkesson@gmail.com>

RUN apt-get update
RUN apt-get update
RUN apt-get install -y socat

RUN ["/bin/sh"]

Create and start the container

docker build -t netns .
docker run -ti netns /bin/bash

Inside the container, create a TCP-server with socat:

root@c8a2438ad58e:/# socat - TCP-L:1234

Lets play around

I used to list network namespaces with ip netns list, but the command gave me no outputs even while the docker container was running.

That was unexpected. I wonder where ip actually look for namespaces. To find out I used strace and looked for the openat system call:

$ strace -e openat  ip netns list
...
openat(AT_FDCWD, "/var/run/netns", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 5
...

Ok, ip netns list does probably look for files representing network namespaces in /var/run/netns.

Lets try to create an entry (dockerns) on that location:

$ sudo touch /var/run/netns/dockerns
$ ip netns  list
Error: Peer netns reference is invalid.
Error: Peer netns reference is invalid.
dockerns

Good. It tries to dereference the namespace!

All namespaces for a certain PID is exposed in procfs. For example, here are the namespaces that my bash session belongs to:

$ ls -al /proc/self/ns/
total 0
dr-x--x--x 2 marcus marcus 0 15 dec 12.35 .
dr-xr-xr-x 9 marcus marcus 0 15 dec 12.35 ..
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 mnt -> 'mnt:[4026531841]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 net -> 'net:[4026531840]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 time -> 'time:[4026531834]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 user -> 'user:[4026531837]'
lrwxrwxrwx 1 marcus marcus 0 15 dec 12.35 uts -> 'uts:[4026531838]'

Now we only should do the same for the container.

First, find the ID of the running container with docker ps:

$ docker ps
CONTAINER ID   IMAGE     COMMAND       CREATED              STATUS          PORTS     NAMES
c8a2438ad58e   netns     "/bin/bash"   About a minute ago   Up 59 seconds             dazzling_lewi

Inspect the running container to determine the PID:

$ docker inspect -f '{{.State.Pid}}'  c8a2438ad58e
36897

36897, there we have it.

Lets see which namespaces it has:

$ sudo ls -al /proc/36897/ns
total 0
dr-x--x--x 2 root root 0 15 dec 09.59 .
dr-xr-xr-x 9 root root 0 15 dec 09.59 ..
lrwxrwxrwx 1 root root 0 15 dec 10.01 cgroup -> 'cgroup:[4026533596]'
lrwxrwxrwx 1 root root 0 15 dec 10.01 ipc -> 'ipc:[4026533536]'
lrwxrwxrwx 1 root root 0 15 dec 10.01 mnt -> 'mnt:[4026533533]'
lrwxrwxrwx 1 root root 0 15 dec 09.59 net -> 'net:[4026533538]'
lrwxrwxrwx 1 root root 0 15 dec 10.01 pid -> 'pid:[4026533537]'
lrwxrwxrwx 1 root root 0 15 dec 10.01 pid_for_children -> 'pid:[4026533537]'
lrwxrwxrwx 1 root root 0 15 dec 10.01 time -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 15 dec 10.01 time_for_children -> 'time:[4026531834]'
lrwxrwxrwx 1 root root 0 15 dec 10.01 user -> 'user:[4026531837]'
lrwxrwxrwx 1 root root 0 15 dec 10.01 uts -> 'uts:[4026533534]'

As we can see, the container have different IDs for the most (user and time is shared) namespaces.

Now we have the network namespace for the container, lets bind mount it to var/run/netns/dockerns:

$ sudo mount -o bind /proc/36897/ns/net /var/run/netns/dockerns

And run ip netns list again:

$ ip netns list
dockerns (id: 0)

Nice.

Start socat in the dockerns network namespace and connect to localhost:1234:

$ sudo ip netns exec dockerns socat - TCP:localhost:1234
hello
Hello from the outside world
Hello from inside docker

It works! We are now connected to the service running in the container!

Conclusion

It's fun to play around, but there are room for improvements.

For example, a better way to list namespaces is to use lsns as this tool looks after namespaces in more paths including /run/docker/netns/.

Also, a more "correct" way is probably to create a virtual ethernet device and attach it to the same namespace.

veth example

To do that, we first need to determine the PID for the container:

$ lsns -t net
        NS TYPE NPROCS     PID USER       NETNSID NSFS                           COMMAND
	...
	4026533538 net       3   36897 root             0 /run/docker/netns/06a083424158 /bin/bash

Create the public dockerns namespace (create /var/run/netns/dockerns as we did earlier but with ip netns attach):

$ sudo ip netns attach dockerns 36897

Create virtual interfaces, assign network namespace and create routes:

$ sudo ip link add veth0 type veth peer name veth1
$ sudo ip link set veth1 netns dockerns
$ sudo ip address add 192.168.3.1/24 dev veth0
$ sudo ip netns exec dockerns ip a add 192.168.3.2/24 dev veth1
$ sudo ip link set up veth0
$ sudo ip netns exec dockerns ip l set up veth1
$ sudo ip route add 10.0.42.0/24 via 192.168.3.2

That is pretty much it.

Skip flashing unused blocks with UUU

Skip flashing unused blocks with UUU

TL;DR: UUU does now (or will shortly) support blockmaps for flashing images. Use it. It will shorten your flashing time *a lot.*

It will soon be time to manufacture circuit boards for a project I'm currently working on. After manufacturing, it will need some firmware for sure, but how do we flash it in the most efficient way? The board is based on an i.MX.8 chip from NXP and has an eMMC as storage medium.

Overall, short iteration is nice to have as it will benifit the development cycles for the developers on a daily basis, but the time it takes to flash a device with its firmware is not just "nice to have" in the production line, it's critical. We do not want to have a (another?) bottleneck in the production line, so we have to make some effort to make it as fast as possible.

The Scenario

The scenario we have in front of us is that:

  • We want to write a full disk image to the eMMC in the production
  • Most space in the disk image is reserved for empty partitions
  • The disk image is quite big (~12GB)
  • Only ~2GB is of the disk image is "real data" (not just chunks of zeros)
  • Flash firmware to eMMC takes ~25min which is unacceptable

Sparse files is to me a very familiar concept and the .wic file created by the Yocto project is indeed a sparse file. But the flashing tool (UUU) did not seems to support write sparse images without expanding them (that is not completely true, as we will see).

To speed up the production line, I thought of several different solutions including:

  • Do not include empty partitions in the .wic file, create those at the first boot instead
  • Only use UUU to boot the board and then flash the disk image using dfu-utils
  • ...

But all have its own drawbacks. Best would be if we could stick to a uniform way of flashing everything we need, and UUU will be needed regardless.

Before we go any further, lets look at what a sparse file is.

Sparse images

A sparse image [2] is a type of file that contains metadata to describe empty sections (let's call them holes) instead of filling it up with zeros (that take up actual space). The advantage is that sparse files only take as much space as they actually need.

For example; If the logical file size is, say 128 GiB, and the areas with real data is only 8GiB, the physical file size will only be 8GiB+metadata. The allocated size are significantly reduced compared to a full 128GiB image.

/media/sparse-file.png

Most of the filesystems in Linux (btrfs, XFS, ext[234], ...) and even Windows (NTFS) handle sparse files transparently to the user. This means that the filesystem will not allocate more memory than necessary. Reading these holes returns just zeros, and the filesystem will allocate blocks on the filesystem only when someone actually writes to those holes.

Think on it as the Copy-On-Write (COW) feature that many modern filesystems implements.

An example

It could be a little bit confusing when trying to figure out the file size for sparse files.

For example, ls reports 14088667136 bytes, which is quite a big file!

$ ls -alrt image.wic
-rw-r--r-- 1 marcus marcus 14088667136 12 dec 11.08 image.wic

But du only reports 1997888 bytes?!

$ du  image.wic
1997888    image.wic

To get the real occupied size, we need to print the allocated size in blocks (-s):

$ ls -alrts image.wic
1997888 -rw-r--r-- 1 marcus marcus 14088667136 12 dec 11.08 image.wic

If you want to see it with your own eyes, you can easily create a 2GB file with the truncate command:

$ truncate  -s  2G image.sparse
$ stat image.sparse 
  File: image.sparse
    Size: 2147483648	Blocks: 0          IO Block: 4096   regular file

As you can see, the file size is 2GB, but it occupies zero blocks on the disk. The file does only consists of a big chunk of zeros which is described in the metadata.

Handle the holes with care

It's easy to expand the sparse files if you are not careful. The most common applications that is used to copy files have a --sparse option to preserve the holes. If not used, it will allocate blocks and fill them out with zeros. For some applications the --sparse option is set as their default behavior, e.g., see the relevant part of the manpage for cp(1):

--sparse=WHEN
              control creation of sparse files. See below

 ...
 By default, sparse SOURCE files are detected by a crude heuristic and the corresponding DEST file is made sparse as well.
 That  is  the  behavior  selected  by --sparse=auto.
 Specify --sparse=always to create a sparse DEST file whenever the SOURCE file contains a long enough sequence of zero bytes.
 Use --sparse=never to inhibit creation of sparse files.

While some applications does not use it as default, e.g. rsync(1):

--sparse, -S
        Try to handle sparse files efficiently so they take up less space on the destination.  If combined with --inplace the file created might not end  up  with
        sparse  blocks  with some combinations of kernel version and/or filesystem type.  If --whole-file is in effect (e.g. for a local copy) then it will always
        work because rsync truncates the file prior to writing out the updated version.

        Note that versions of rsync older than 3.1.3 will reject the combination of --sparse and --inplace.

All kind of compression/decompression of files usually expands sparse files as well.

bmap-tools

bmap-tools [4] is a handy tool for creating block maps (bmaps) for a file and then use that information and copy that file to a media in a more efficient way.

The advantages of bmap-tools compared to e.g. dd is (as they state on their Github page) :

  • Faster. Depending on various factors, like write speed, image size, how full is the image, and so on, bmaptool was 5-7 times faster than dd in the Tizen IVI project.
  • Integrity. bmaptool verifies data integrity while flashing, which means that possible data corruptions will be noticed immediately.
  • Usability. bmaptool can read images directly from the remote server, so users do not have to download images and save them locally.
  • Protects user's data. Unlike dd, if you make a mistake and specify a wrong block device name, bmaptool will less likely destroy your data because it has protection mechanisms which, for example, prevent bmaptool from writing to a mounted block device.

The tool comes with two sub commands, create and copy. The create subcommand generates the block maps and copy commands copy a file to a certain destinatin.

A typical usage of dd could be to write an image (bigimage) to to a SD-card ( /dev/sdX):

 $ bzcat bigimage | dd of=/dev/sdX bs=1M conv=fsync

The dd command will copy all data, including those zeroes for the empty partitions. Instead of copy all data with dd, we could generate a block map file with metadata that describes all the "empty" sections:

bmaptool create bigimage > bigimage.bmap

The bmap file is a human readable XML file that shows all the block maps and also checksums for each block:

<?xml version="1.0" ?>
<!-- This file contains the block map for an image file, which is basically
     a list of useful (mapped) block numbers in the image file. In other words,
     it lists only those blocks which contain data (boot sector, partition
     table, file-system metadata, files, directories, extents, etc). These
     blocks have to be copied to the target device. The other blocks do not
     contain any useful data and do not have to be copied to the target
     device.

     The block map an optimization which allows to copy or flash the image to
     the image quicker than copying of flashing the entire image. This is
     because with bmap less data is copied: <MappedBlocksCount> blocks instead
     of <BlocksCount> blocks.

     Besides the machine-readable data, this file contains useful commentaries
     which contain human-readable information like image size, percentage of
     mapped data, etc.

     The 'version' attribute is the block map file format version in the
     'major.minor' format. The version major number is increased whenever an
     incompatible block map format change is made. The minor number changes
     in case of minor backward-compatible changes. -->

<bmap version="2.0">
    <!-- Image size in bytes: 13.1 GiB -->
    <ImageSize> 14088667136 </ImageSize>

    <!-- Size of a block in bytes -->
    <BlockSize> 4096 </BlockSize>

    <!-- Count of blocks in the image file -->
    <BlocksCount> 3439616 </BlocksCount>

    <!-- Count of mapped blocks: 1.9 GiB or 14.5%     -->
    <MappedBlocksCount> 499471  </MappedBlocksCount>

    <!-- Type of checksum used in this file -->
    <ChecksumType> sha256 </ChecksumType>

    <!-- The checksum of this bmap file. When it's calculated, the value of
         the checksum has be zero (all ASCII "0" symbols).  -->
    <BmapFileChecksum> 19337986514b3952866af5fb80054c40166f608f42efbf0551956e80678aba46 </BmapFileChecksum>

    <!-- The block map which consists of elements which may either be a
         range of blocks or a single block. The 'chksum' attribute
         (if present) is the checksum of this blocks range. -->
    <BlockMap>
        <Range chksum="a53e6dcf984243a7d87a23fed5d28f3594c72a83081e90e296694256cfc26395"> 0-2 </Range>
        <Range chksum="a60a80a89aa832b3480da91fe0c75e77e24b29c1de0fcdffb9d2b50102013ff5"> 8-452 </Range>
        <Range chksum="7b07df777cd3441ac3053c2c39be52baef308e4befabd500d0c72ef4ab7c5565"> 2048-12079 </Range>
        <Range chksum="cf6ac0e4aec163a228bad0ab85722e011dd785fc8d047dc4d0f86e886fa6684d"> 24576-24902 </Range>
        <Range chksum="7521285590601370cc063cc807237eaf666f879d84d4fcae001026a7bb5a7eff"> 24904-24905 </Range>
        <Range chksum="9661e72b75fe483d53d10585bff79316ea38f15c48547b99da3e0c8b38634ceb"> 24920-26173 </Range>
        <Range chksum="93d69200bd59286ebf66ba57ae7108cc46ef81b3514b1d45864d4200cf4c4182"> 40824-57345 </Range>
        <Range chksum="56ed03479c8d200325e5ef6c1c88d8f86ef243bf11394c6a27faa8f1a98ab30e"> 57656-122881 </Range>
        <Range chksum="cad5eeff8c937438ced0cba539d22be29831b0dae986abdfd3edc8ecf841afdb"> 123192-188417 </Range>
        <Range chksum="88992bab7f6a4d06b55998e74e8a4635271592b801d642c1165528d5e22f23ff"> 188728-253953 </Range>
        <Range chksum="f5332ff218a38bf58e25b39fefc8e00f5f95e01928d0a5edefa29340f4a24b42"> 254264-319489 </Range>
        <Range chksum="714c8373aa5f7b55bd01d280cb182327f407c434e1dcc89f04f9d4c522c24522"> 319800-509008 </Range>
        <Range chksum="b1ec6c807ea2f47e0d937ac0032693a927029ff5d451183a5028b8af59fb3dd2"> 548864 </Range>
        <Range chksum="554448825ffae96d000e4b965ef2c53ae305a6734d1756c72cd68d6de7d2a8b0"> 548867 </Range>
        <Range chksum="f627ca4c2c322f15db26152df306bd4f983f0146409b81a4341b9b340c365a16"> 660672-660696 </Range>
        <Range chksum="fa01901c7f34cdb08467000c3b52a4927e208a6c128892018c0f5039fedb93c0"> 661504-661569 </Range>
        <Range chksum="ee3b76f97d2b039924e60726a165f688c93c87add940880a5d6a4a62fe4f7876"> 661573 </Range>
        <Range chksum="db3aa8c65438e58fc0140b81da36ec37c36fa94549b43ad32d36e206a9ec7819"> 661577-661578 </Range>
        <Range chksum="726f800d4405552f2406fb07e158a3966a89006f9bf474104c822b425aed0f18"> 662593-662596 </Range>
        <Range chksum="e9085dd345859b72dbe86b9c33ce44957dbe742ceb5221330ad87d83676af6cb"> 663552-663553 </Range>
        <Range chksum="1f3660636b045d5fc31af89f6e276850921d34e3e008a2f5337f1a1372b61114"> 667648-667649 </Range>
        <Range chksum="b3f3b8a0818aeade05b5e7af366663130fdc502c63a24a236bb54545d548d458"> 671744-671745 </Range>
        <Range chksum="b4a759d879bc4a47e8b292a5a02565d551350e9387227016d8aed62f3a668a1d"> 675840-675841 </Range>
        <Range chksum="1ff00c7f6feca0b42a957812832b89e7839c6754ca268b06b3c28ba6124c36fd"> 679936-679937 </Range>
        <Range chksum="9bbedd9ac9ab58f06abc2cdc678ac5ddf77c117f0004139504d2aaa321c7475f"> 694272 </Range>
        <Range chksum="c8defb770c630fa1faca3d8c2dff774f7967a0b9902992f313fc8984a9bb43a8"> 696320-698368 </Range>
        <Range chksum="9d5fad916ab4a1f7311397aa3f820ae29ffff4c3937b43a49a0a2a62bf8d6846"> 712704-712705 </Range>
        <Range chksum="48f179784918cf9bf366201edc1585a243f892e6133aa296609486734398be7f"> 716800-716801 </Range>
        <Range chksum="cc0dd36d4891ae52c5626759b21edc4894c76ba145da549f85316646f00e8f2d"> 727040 </Range>
        <Range chksum="955be1419c8df86dc56151737426c53f5ba9df523f66d2b6056d3b2028e5240d"> 759808 </Range>
        <Range chksum="ab6cafa862c5a250b731e41e373ad49767535f6920297376835f04a9e81de023"> 759811 </Range>
        <Range chksum="0ec8e7c7512276ad5fabe76591aeda7ce6c3e031b170454f800a00b330eea2de"> 761856-761857 </Range>
        <Range chksum="de2f256064a0af797747c2b97505dc0b9f3df0de4f489eac731c23ae9ca9cc31"> 789488-789503 </Range>
        <Range chksum="8c8853e6be07575b48d3202ba3e9bb4035b025cd7b546db7172e82f97bd53b1c"> 790527-791555 </Range>
        <Range chksum="918fc7f7a7300f41474636bf6ac4e7bf18e706a341f5ee75420625b85c66c47e"> 791571 </Range>
        <Range chksum="c25cb25875f56f2832b2fcd7417b7970c882cc73ffd6e8eaa7b9b09cbf19f172"> 791587 </Range>
        <Range chksum="52aec994d40f31ce57d57391a8f6db6c23393dfca6d53b99fb8f0ca4faad0253"> 807971-807976 </Range>
        <Range chksum="8ffcde1a642eb739a05bde4d74c73430d5fb4bcffe4293fed7b35e3ce9fe16df"> 823296-823298 </Range>
        <Range chksum="1d21f93952243d3a6ba80e34edcac9a005ad12c82dec79203d9ae6e0c2416043"> 888832-888834 </Range>
        <Range chksum="cdb117909634c406611acef34bedc52de88e8e8102446b3924538041f36a6af9"> 954368-954370 </Range>
        <Range chksum="18945d38f9ac981a16fb5c828b42da4ce86c43fe7b033b7c69b21835c4a3945b"> 1019904-1019906 </Range>
        <Range chksum="1ad3cc0ed83dad0ea03b9fcc4841332361ded18b8e29c382bfbb142a962aef81"> 1085440-1085442 </Range>
        <Range chksum="c19fc180da533935f50551627bd3913d56ff61a3ecf6740fbc4be184c5385bc9"> 1314816 </Range>
        <Range chksum="b045e6499699f9dea08702e15448f5da3e4736f297bbe5d91bb899668d488d22"> 1609728-1609730 </Range>
        <Range chksum="a25229feee84521831f652f3ebcbb23b86192c181ef938ed0af258bfde434945"> 1675264-1675266 </Range>
        <Range chksum="741a9ded7aee9d4812966c66df87632d3e1fde97c68708a10c5f71b9d76b0adf"> 1839104-1839105 </Range>
        <Range chksum="d17c50d57cfb768c3308eff3ccc11cf220ae41fd52fe94de9c36a465408dd268"> 1871872-1888255 </Range>
        <Range chksum="c19fc180da533935f50551627bd3913d56ff61a3ecf6740fbc4be184c5385bc9"> 2363392 </Range>
        <Range chksum="1747cefe9731aa9a0a9e07303199f766beecaef8fd3898d42075f833950a3dd2"> 2396160-2396162 </Range>
        <Range chksum="c19fc180da533935f50551627bd3913d56ff61a3ecf6740fbc4be184c5385bc9"> 2887680 </Range>
        <Range chksum="ad7facb2586fc6e966c004d7d1d16b024f5805ff7cb47c7a85dabd8b48892ca7"> 2887695 </Range>
        <Range chksum="de2f256064a0af797747c2b97505dc0b9f3df0de4f489eac731c23ae9ca9cc31"> 3411952-3411967 </Range>
        <Range chksum="7d501c17f812edb240475aa0f1ee9e15c4c8619df82860b2bd008c20bb58dc6d"> 3414000 </Range>
        <Range chksum="f71e4b72a8f3d4d8d4ef3c045e4a35963b2f5c0e61c9a31d3035c75a834b502c"> 3414016-3414080 </Range>
        <Range chksum="98033bc2b57b7220f131ca9a5daa6172d87ca91ff3831a7e91b153329bf52eb1"> 3414082-3414084 </Range>
        <Range chksum="b5a2b3aaf6dcfba16895e906a5d98d6167408337a3b259013fa3057bafe80918"> 3414087 </Range>
        <Range chksum="81b766d472722dbc326e921a436e16c0d7ad2c061761b2baeb3f6723665997c5"> 3414886-3414890 </Range>
        <Range chksum="2250b13e06cc186d3a6457a16824a3c9686088a1c0866ee4ebe8e36dfe3e17d7"> 3416064 </Range>
        <Range chksum="a82ec47f112bff20b91d748aebc769b2c9003b7eb5851991337586115f31da62"> 3420160 </Range>
        <Range chksum="99dcc3517a44145d057b63675b939baa7faf6b450591ebfccd2269a203cab194"> 3424256 </Range>
        <Range chksum="f5a94286ab129fb41eea23bf4195ca85e425148b151140dfa323405e6c34b01c"> 3426304-3427328 </Range>
        <Range chksum="d261d607d8361715c880e288ccb5b6602af50ce99e13362e8c97bce7425749d2"> 3428352 </Range>
        <Range chksum="d6ff528a483c3a6c581588000d925f38573e7a8fd84f6bb656b19e61cb36bd06"> 3432448 </Range>
        <Range chksum="de2f256064a0af797747c2b97505dc0b9f3df0de4f489eac731c23ae9ca9cc31"> 3439600-3439615 </Range>
    </BlockMap>
</bmap>

Once the bmap file is generated, we can copy the file to , e.g. our SD-card (/dev/sdX) using the copy subcommand:

bmaptool copy bigimage /dev/sdX

In my case, this was 6 times faster compared to dd.

Yocto

Yocto supports OpenEmbedded Kickstart Reference (.wks) to describe how to create a disk image.

A .wks file could look like this:

part u-boot --source rawcopy --sourceparams="file="${IMAGE_BOOTLOADER}   --no-table    --align ${IMX_BOOT_SEEK}
part /boot  --source bootimg-partition                        --fstype=vfat --label boot --active --align 8192 --size 64
part        --source rootfs                                   --fstype=ext4 --label root    --align 8192 --size 1G
part                                                          --fstype=ext4 --label overlay --align 8192 --size 500M
part /mnt/data                                                --fstype=ext4 --label data --align 8192 --size 10G
bootloader --ptable msdos

The format is rather self explained. Pay some attention to the --size parameters. Here we create several (empty) partitions which is actually quite big.

The resulting image occupy 14088667136 bytes in total, but only 1997888 bytes is non-sparse data.

Yocto is able to generate a block map file using the bmaptool for the resulting ẁic` file. Just specify that you want to generate a wic.bmap image among your other fs types:

IMAGE_FSTYPES = "wic.gz wic.bmap"

UUU

The Universal Update Utility (UUU) [1] is a image deploy tool for Freescale/NXP I.MX chips. It allows downloading and executing code via the Serial Download Protocol (SDP) when booted into manufacturing mode.

The tool let you flash all of most the common types of storage devices such as NAND, NOR, eMMC and SD-cards.

That is good, but the USB-OTG port that is used for data transfer is only USB 2.0 (Hi-Speed) which limit the transfer speed to 480Mb/s.

A rather common case is to write a full disk image to a storage device using the tool. A full disk image contains, as you may guessed, a full disk, this includes all unpartionated data, partition table, all partitions, both those that contains data and eventually any empty partitions. In other words, if you have a 128MiB storage, your disk image will occupy 128MiB, and 128MiB of data will be downloaded via SDP.

Support for sparse files could be handy here.

UUU claims to support sparse images by its raw2sparse type, but that is simply a lie, the code that detects continous chunks of data is actually commented out [3] for unknown reason:

//int type = is_same_value(data, pheader->blk_sz) ? CHUNK_TYPE_FILL : CHUNK_TYPE_RAW;
int type = CHUNK_TYPE_RAW;

Pending pull request

Luckily enough, there is a pull request [5] that add support for using bmap files with UUU!

The PR is currently (2023-12-11) not merged but I hope is is soon. I've tested it out and it works like a charm.

EDIT: The PR was merged 2023-12-12, one day after I wrote this blog entry :-)

With this PR you can provide a bmap file to the flash command via the -bmap <image> option in your .uuu script. E.g. :

uuu_version 1.2.39

SDPS: boot -f ../imx-boot

FB: ucmd setenv emmc_dev 2
FB: ucmd setenv mmcdev ${emmc_dev}
FB: ucmd mmc dev ${emmc_dev}
FB: flash -raw2sparse -bmap rawimage.wic.bmap all rawimage.wic
FB: done

It cut the time flashing down to a sixth!

As this is a rather new feature in the UUU tool, I would like to promote it and thank dnbazhenov @Github for implement this.

Burn eFuses for MAC address on iMX8MP

Burn eFuses for MAC address on iMX8MP

The iMX (iMX6, iMX7, iMX8) has a similiar OCOTP (On-Chip One Time Programmable) module that store, for example the MAC addresses for the internal ethernet controllers.

The reference manual is not clear either on the byte order or which bytes belong to which MAC address when there are several. In fact, I had to look at the U-boot implementation [1] to know for sure how these fuses is used:

void imx_get_mac_from_fuse(int dev_id, unsigned char *mac)
{
	struct imx_mac_fuse *fuse;
	u32 offset;
	bool has_second_mac;

	offset = is_mx6() ? MAC_FUSE_MX6_OFFSET : MAC_FUSE_MX7_OFFSET;
	fuse = (struct imx_mac_fuse *)(ulong)(OCOTP_BASE_ADDR + offset);
	has_second_mac = is_mx7() || is_mx6sx() || is_mx6ul() || is_mx6ull() || is_imx8mp();

	if (has_second_mac && dev_id == 1) {
		u32 value = readl(&fuse->mac_addr2);

		mac[0] = value >> 24;
		mac[1] = value >> 16;
		mac[2] = value >> 8;
		mac[3] = value;

		value = readl(&fuse->mac_addr1);
		mac[4] = value >> 24;
		mac[5] = value >> 16;

	} else {
		u32 value = readl(&fuse->mac_addr1);

		mac[0] = value >> 8;
		mac[1] = value;

		value = readl(&fuse->mac_addr0);
		mac[2] = value >> 24;
		mac[3] = value >> 16;
		mac[4] = value >> 8;
		mac[5] = value;
	}
}

OCOTP Layout

The fuses related to MAC addresses starts at offset 0x640 for iMX7 and iMX8MP, and at offset 0x620 for all iMX6 processors.

The MAC fuses belongs to fuse bank 9 as seen in the table below:

/media/imx8-mac-fuse-layout.png

Burn fuses

There are several ways to burn the fuses nowadays. A few years ago, the only way (that I'm aware of) was by the non-mainlined fsl_otp-driver provided in the Freescale kernel tree. I'm not going to describe how to use since it should not be used anyway.

The fuses are mapped to the MAC address as described in this picture:

/media/imx8-mac-fuse-example.png

The iMX8MP has two MACs and we will assign the MAC address 00:bb:cc:dd:ee:ff for MAC0 and 00:22:33:44:55:66 for MAC1.

Via U-boot

With the CONFIG_CMD_FUSE config set, U-boot are able to burn and sense eFuses via the fuse command:

u-boot=> fuse
fuse - Fuse sub-system

Usage:
fuse read <bank> <word> [<cnt>] - read 1 or 'cnt' fuse words,
    starting at 'word'
fuse sense <bank> <word> [<cnt>] - sense 1 or 'cnt' fuse words,
    starting at 'word'
fuse prog [-y] <bank> <word> <hexval> [<hexval>...] - program 1 or
    several fuse words, starting at 'word' (PERMANENT)
fuse override <bank> <word> <hexval> [<hexval>...] - override 1 or
    several fuse words, starting at 'word'

Burn the fuses with fuse prog:

fuse prog -y 9 0 0xccddeeff
fuse prog -y 9 1 0x556600bb
fuse prog -y 9 2 0x00223344

And read it back with fuse sense:

u-boot=> fuse sense 9 0
Sensing bank 9:

Word 0x00000000: ccddeeff
u-boot=> fuse sense 9 1
Sensing bank 9:

Word 0x00000001: 556600bb
u-boot=> fuse sense 9 2
Sensing bank 9:

Word 0x00000002: 00223344

As it's a U-boot command, it is also possible to burn the fuses with UUU [2] (Universal Update Utility) via the SDP protocol. It could be handy e.g. in production.

Example on a uuu-script:

$ cat imx8mplus-emmc-all.uuu 
uuu_version 1.2.39

# This script will flash u-boot to mmc on bus 1
# Usage: uuu <script>

SDPS: boot -f ../imx-boot

#Burn fuses
FB: ucmd fuse prog -y 9 0 0xccddeeff
FB: ucmd fuse prog -y 9 1 0x556600bb
FB: ucmd fuse prog -y 9 2 0x00223344

#Burn image
FB: ucmd setenv emmc_dev 2
FB: ucmd setenv emmc_ack 1
FB: ucmd setenv fastboot_dev mmc
FB: ucmd setenv mmcdev ${emmc_dev}
FB: ucmd mmc dev ${emmc_dev}
FB: flash -raw2sparse all ../distro-image-dev-imx8mp.wic
FB: ucmd mmc partconf ${emmc_dev} ${emmc_ack} 1 0
FB: done

Via nvmem-imx-ocotp

The OCOTP-module is exposed by the nvmem-imx-ocotp (CONFIG_NVMEM_IMX_OCOTP) driver and the fuses could be read and written to via the sysfs entry /sys/devices/platform/soc@0/30000000.bus/30350000.efuse/imx-ocotp0/nvmem.

Note that it's not the full OCOTP module but only the eFuses that are exposed this way, so MAC_ADDR0 is placed at offset 0x90, not 0x640!

We could read out our MAC addresses at offset 0x90:

root@imx8mp:~# hexdump /sys/devices/platform/soc@0/30000000.bus/30350000.efuse/imx-ocotp0/nvmem 
0000000 a9eb ffaf aaff 0002 52bb ea35 6000 1119
0000010 4591 2002 0000 0100 007f 0000 2000 9800
0000020 0000 0000 0000 0000 0000 0000 0000 0000
*
0000040 bada bada bada bada bada bada bada bada
*
0000060 0000 0000 0000 0000 0000 0000 0000 0000
*
0000080 0000 0000 0000 0000 0000 0000 0004 0000
0000090 eeff ccdd 00bb 5566 3344 0022 0000 0000

We can also see we have the expected MAC addresses set for our interfaces:

root@imx8mp:~# ip a
4: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 00:bb:cc:dd:ee:ff brd ff:ff:ff:ff:ff:ff
5: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 00:22:33:44:55:66 brd ff:ff:ff:ff:ff:ff

Loopback with two (physical) ethernet interfaces

Loopback with two (physical) ethernet interfaces

Imagine that you have an embedded device with two physical ethernet ports. You want to verify the functionality of both these ports in the manufacturing process, so you connect an ethernet cable between the ports, setup IP addresses and now what?

As Linux (actually the default network namespace) is aware of the both adapters and their IP/MAC-addresses, the system see no reason to send any traffic out. Instead, Linux will loop all traffic between the interfaces internally.

To avoid that and actually force traffic out on the cable, we have to make the adapters unaware of eachother. This is done by putting them into different network namespaces!

/media/loopback.png

Hands on

To do this, all you need is to have support for network namespaces in the kernel (CONFIG_NET_NS=y) and the iproute2 [1] package, which both probably is included in every standard Linux distribution nowadays.

We will create two network namespaces, lets call them netns_eth0 and netns_eth1:

ip netns add netns_eth0
ip netns add netns_eth1

Move each adapter to their new home:

ip link set eth0 netns netns_eth0
ip link set eth1 netns netns_eth1

Assign ip addresses:

ip netns exec netns_eth0 ip addr add dev eth0 192.168.0.1/24
ip netns exec netns_eth1 ip addr add dev eth1 192.168.0.2/24

Bring up the interfaces:

ip netns exec netns_eth0 ip link set eth0 up
ip netns exec netns_eth1 ip link set eth1 up

Now we can ping each interface and know for sure that the traffic is actually on the cable:

ip netns exec netns_eth0 ping 192.168.0.2
ip netns exec netns_eth1 ping 192.168.0.1

Support for CRIU in Buildroot

Support for CRIU in Buildroot

A couple of months ago I started to evaluate [1] CRIU [2] for a project I'm working on. The project itself is using Buildroot to build and generate the root filesystem. Unfortunately, Buildroot lacks support for CRIU so there were some work to do.

/media/buildroot-plus-criu.png

To write the package was not straight forward. The package is only supported on certain architectures and the utils/test-pkg script failed for a few toolchains. Julien Olivain was really helpful to sort it out and he even wrote runtime scripts for it. Thanks for that.

I do not understand why projects still use custom Makefiles instead of CMake or Autotools though. Is is something essential that I've completely missed?

Kernel configuration

CRIU makes use of a lot of features that has to be enabled in the Linux kernel for full usage.

CONFIG_CHECKPOINT_RESTORE will be set by the package itself, but there are more configuration options that could be useful depending on how you intend to use the tool.

Relevant configuration options are:

General setup options
  • CONFIG_CHECKPOINT_RESTORE=y (Checkpoint/restore support)
  • CONFIG_NAMESPACES=y (Namespaces support)
  • CONFIG_UTS_NS=y (Namespaces support -> UTS namespace)
  • CONFIG_IPC_NS=y (Namespaces support -> IPC namespace)
  • CONFIG_SYSVIPC_SYSCTL=y
  • CONFIG_PID_NS=y (Namespaces support -> PID namespaces)
  • CONFIG_NET_NS=y (Namespaces support -> Network namespace)
  • CONFIG_FHANDLE=y (Open by fhandle syscalls)
  • CONFIG_EVENTFD=y (Enable eventfd() system call)
  • CONFIG_EPOLL=y (Enable eventpoll support)
  • CONFIG_RSEQ=y (Enable eventpoll support)
Networking support -> Networking options options for sock-diag subsystem
  • CONFIG_UNIX_DIAG=y (Unix domain sockets -> UNIX: socket monitoring interface)
  • CONFIG_INET_DIAG=y (TCP/IP networking -> INET: socket monitoring interface)
  • CONFIG_INET_UDP_DIAG=y (TCP/IP networking -> INET: socket monitoring interface -> UDP: socket monitoring interface)
  • CONFIG_PACKET_DIAG=y (Packet socket -> Packet: sockets monitoring interface)
  • CONFIG_NETLINK_DIAG=y (Netlink socket -> Netlink: sockets monitoring interface)
  • CONFIG_NETFILTER_XT_MARK=y (Networking support -> Networking options -> Network packet filtering framework (Netfilter) -> Core Netfilter Configuration -> Netfilter Xtables support (required for ip_tables) -> nfmark target and match support)
  • CONFIG_TUN=y (Networking support -> Universal TUN/TAP device driver support)

In the beginning of the project, CRIU had their own custom kernel which contained some experimental CRIU related patches. Nowadays many of those patches has been mainlined.

One such patch [3] that I missed in my current kernel verson (v5.10) was introduced in v5.12. It was related to how CRIU gets the process state and is essential to create a checkpoint of a running process:

commit 90f093fa8ea48e5d991332cee160b761423d55c1
Author: Piotr Figiel <figiel@google.com>
Date:   Fri Feb 26 14:51:56 2021 +0100

    rseq, ptrace: Add PTRACE_GET_RSEQ_CONFIGURATION request

    For userspace checkpoint and restore (C/R) a way of getting process state
    containing RSEQ configuration is needed.

    There are two ways this information is going to be used:
     - to re-enable RSEQ for threads which had it enabled before C/R
     - to detect if a thread was in a critical section during C/R

    Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
    using the address registered before C/R.

    Detection whether the thread is in a critical section during C/R is needed
    to enforce behavior of RSEQ abort during C/R. Attaching with ptrace()
    before registers are dumped itself doesn't cause RSEQ abort.
    Restoring the instruction pointer within the critical section is
    problematic because rseq_cs may get cleared before the control is passed
    to the migrated application code leading to RSEQ invariants not being
    preserved. C/R code will use RSEQ ABI address to find the abort handler
    to which the instruction pointer needs to be set.

    To achieve above goals expose the RSEQ ABI address and the signature value
    with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION.

    This new ptrace request can also be used by debuggers so they're aware
    of stops within restartable sequences in progress.

    Signed-off-by: Piotr Figiel <figiel@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Reviewed-by: Michal Miroslaw <emmir@google.com>
    Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Acked-by: Oleg Nesterov <oleg@redhat.com>
    Link: https://lkml.kernel.org/r/20210226135156.1081606-1-figiel@google.com

With that said, to make use of CRIU with the latest features it's highly recommended to use a recent kernel.

And soon it will be available as a package in Buildroot.