Burn eFuses for MAC address on iMX8MP

Burn eFuses for MAC address on iMX8MP

The iMX (iMX6, iMX7, iMX8) has a similiar OCOTP (On-Chip One Time Programmable) module that store e.g. the MAC addresses for the internal ethernet controllers.

The reference manual is not clear either on the byte order or which bytes belong to which MAC address when there are several. In fact, I had to look at the U-boot implementation [1] to know for sure how these fuses is used:

void imx_get_mac_from_fuse(int dev_id, unsigned char *mac)
{
	struct imx_mac_fuse *fuse;
	u32 offset;
	bool has_second_mac;

	offset = is_mx6() ? MAC_FUSE_MX6_OFFSET : MAC_FUSE_MX7_OFFSET;
	fuse = (struct imx_mac_fuse *)(ulong)(OCOTP_BASE_ADDR + offset);
	has_second_mac = is_mx7() || is_mx6sx() || is_mx6ul() || is_mx6ull() || is_imx8mp();

	if (has_second_mac && dev_id == 1) {
		u32 value = readl(&fuse->mac_addr2);

		mac[0] = value >> 24;
		mac[1] = value >> 16;
		mac[2] = value >> 8;
		mac[3] = value;

		value = readl(&fuse->mac_addr1);
		mac[4] = value >> 24;
		mac[5] = value >> 16;

	} else {
		u32 value = readl(&fuse->mac_addr1);

		mac[0] = value >> 8;
		mac[1] = value;

		value = readl(&fuse->mac_addr0);
		mac[2] = value >> 24;
		mac[3] = value >> 16;
		mac[4] = value >> 8;
		mac[5] = value;
	}
}

OCOTP Layout

The fuses related to MAC addresses starts at offset 0x640 for iMX7 and iMX8MP, and at offset 0x620 for all iMX6 processors.

The MAC fuses belongs to fuse bank 9 as seen in the table below:

/media/imx8-mac-fuse-layout.png

Burn fuses

There are several ways to burn the fuses nowadays. A few years ago, the only way (that I'm aware of) was by the non-mainlined fsl_otp-driver provided in the Freescale kernel tree. I'm not going to describe how to use since it should not be used anyway.

The fuses are mapped to the MAC address as described in this picture:

/media/imx8-mac-fuse-example.png

The iMX8MP has two MACs and we will assign the MAC address 00:bb:cc:dd:ee:ff for MAC0 and 00:22:33:44:55:66 for MAC1.

Via U-boot

With the CONFIG_CMD_FUSE config set, U-boot are able to burn and sense eFuses via the fuse command:

u-boot=> fuse
fuse - Fuse sub-system

Usage:
fuse read <bank> <word> [<cnt>] - read 1 or 'cnt' fuse words,
    starting at 'word'
fuse sense <bank> <word> [<cnt>] - sense 1 or 'cnt' fuse words,
    starting at 'word'
fuse prog [-y] <bank> <word> <hexval> [<hexval>...] - program 1 or
    several fuse words, starting at 'word' (PERMANENT)
fuse override <bank> <word> <hexval> [<hexval>...] - override 1 or
    several fuse words, starting at 'word'

Burn the fuses with fuse prog:

fuse prog -y 9 0 0xccddeeff
fuse prog -y 9 1 0x556600bb
fuse prog -y 9 2 0x00223344

And read it back with fuse sense:

u-boot=> fuse sense 9 0
Sensing bank 9:

Word 0x00000000: ccddeeff
u-boot=> fuse sense 9 1
Sensing bank 9:

Word 0x00000001: 556600bb
u-boot=> fuse sense 9 2
Sensing bank 9:

Word 0x00000002: 00223344

As it is a U-boot command, it is also possible to burn the fuses with UUU [2] (Universal Update Utility) via the SDP protocol. It could be handy e.g. in production.

Example on a uuu-script:

$ cat imx8mplus-emmc-all.uuu 
uuu_version 1.2.39

# This script will flash u-boot to mmc on bus 1
# Usage: uuu <script>

SDPS: boot -f ../imx-boot

#Burn fuses
FB: ucmd fuse prog -y 9 0 0xccddeeff
FB: ucmd fuse prog -y 9 1 0x556600bb
FB: ucmd fuse prog -y 9 2 0x00223344

#Burn image
FB: ucmd setenv emmc_dev 2
FB: ucmd setenv emmc_ack 1
FB: ucmd setenv fastboot_dev mmc
FB: ucmd setenv mmcdev ${emmc_dev}
FB: ucmd mmc dev ${emmc_dev}
FB: flash -raw2sparse all ../distro-image-dev-imx8mp.wic
FB: ucmd mmc partconf ${emmc_dev} ${emmc_ack} 1 0
FB: done

Via nvmem-imx-ocotp

The OCOTP-module is exposed by the nvmem-imx-ocotp (CONFIG_NVMEM_IMX_OCOTP) driver and the fuses could be read and written to via the sysfs entry /sys/devices/platform/soc@0/30000000.bus/30350000.efuse/imx-ocotp0/nvmem.

Note that it is not the full OCOTP module but only the eFuses that are exposed this way, so MAC_ADDR0 is placed at offset 0x90, not 0x640!

We could read out our MAC addresses at offset 0x90:

root@imx8mp:~# hexdump /sys/devices/platform/soc@0/30000000.bus/30350000.efuse/imx-ocotp0/nvmem 
0000000 a9eb ffaf aaff 0002 52bb ea35 6000 1119
0000010 4591 2002 0000 0100 007f 0000 2000 9800
0000020 0000 0000 0000 0000 0000 0000 0000 0000
*
0000040 bada bada bada bada bada bada bada bada
*
0000060 0000 0000 0000 0000 0000 0000 0000 0000
*
0000080 0000 0000 0000 0000 0000 0000 0004 0000
0000090 eeff ccdd 00bb 5566 3344 0022 0000 0000

We can also see we have the expected MAC addresses set for our interfaces:

root@imx8mp:~# ip a
4: eth0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 00:bb:cc:dd:ee:ff brd ff:ff:ff:ff:ff:ff
5: eth1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 00:22:33:44:55:66 brd ff:ff:ff:ff:ff:ff

Loopback with two (physical) ethernet interfaces

Loopback with two (physical) ethernet interfaces

Imagine that you have an embedded device with two physical ethernet ports. You want to verify the functionality of both these ports in the manufacturing process, so you connect an ethernet cable between the ports, setup IP addresses and now what?

As Linux (actually the default network namespace) is aware of the both adapters and their IP/MAC-addresses, the system see no reason to send any traffic out. Instead, Linux will loop all traffic between the interfaces internally.

To avoid that and actually force traffic out on the cable, we have to make the adapters unaware of eachother. This is done by putting them into different network namespaces!

/media/loopback.png

Hands on

To do this, all you need is to have support for network namespaces in the kernel (CONFIG_NET_NS=y) and the iproute2 [1] package, which both probably is included in every standard Linux distribution nowadays.

We will create two network namespaces, lets call them netns_eth0 and netns_eth1:

ip netns add netns_eth0
ip netns add netns_eth1

Move each adapter to their new home:

ip link set eth0 netns netns_eth0
ip link set eth1 netns netns_eth1

Assign ip addresses:

ip netns exec netns_eth0 ip addr add dev eth0 192.168.0.1/24
ip netns exec netns_eth1 ip addr add dev eth1 192.168.0.2/24

Bring up the interfaces:

ip netns exec netns_eth0 ip link set eth0 up
ip netns exec netns_eth1 ip link set eth1 up

Now we can ping each interface and know for sure that the traffic is actually on the cable:

ip netns exec netns_eth0 ping 192.168.0.2
ip netns exec netns_eth1 ping 192.168.0.1

Support for CRIU in Buildroot

Support for CRIU in Buildroot

A couple of months ago I started to evaluate [1] CRIU [2] for a project I'm working on. The project itself is using Buildroot to build and generate the root filesystem. Unfortunately, Buildroot lacks support for CRIU so there were some work to do.

/media/buildroot-plus-criu.png

To write the package was not straight forward. The package is only supported on certain architectures and the utils/test-pkg script failed for a few toolchains. Julien Olivain was really helpful to sort it out and he even wrote runtime scripts for it. Thanks for that.

I do not understand why projects still use custom Makefiles instead of CMake or Autotools though. Is is something essential that I've completely missed?

Kernel configuration

CRIU makes use of a lot of features that has to be enabled in the Linux kernel for full usage.

CONFIG_CHECKPOINT_RESTORE will be set by the package itself, but there are more configuration options that could be useful depending on how you intend to use the tool.

Relevant configuration options are:

General setup options
  • CONFIG_CHECKPOINT_RESTORE=y (Checkpoint/restore support)
  • CONFIG_NAMESPACES=y (Namespaces support)
  • CONFIG_UTS_NS=y (Namespaces support -> UTS namespace)
  • CONFIG_IPC_NS=y (Namespaces support -> IPC namespace)
  • CONFIG_SYSVIPC_SYSCTL=y
  • CONFIG_PID_NS=y (Namespaces support -> PID namespaces)
  • CONFIG_NET_NS=y (Namespaces support -> Network namespace)
  • CONFIG_FHANDLE=y (Open by fhandle syscalls)
  • CONFIG_EVENTFD=y (Enable eventfd() system call)
  • CONFIG_EPOLL=y (Enable eventpoll support)
  • CONFIG_RSEQ=y (Enable eventpoll support)
Networking support -> Networking options options for sock-diag subsystem
  • CONFIG_UNIX_DIAG=y (Unix domain sockets -> UNIX: socket monitoring interface)
  • CONFIG_INET_DIAG=y (TCP/IP networking -> INET: socket monitoring interface)
  • CONFIG_INET_UDP_DIAG=y (TCP/IP networking -> INET: socket monitoring interface -> UDP: socket monitoring interface)
  • CONFIG_PACKET_DIAG=y (Packet socket -> Packet: sockets monitoring interface)
  • CONFIG_NETLINK_DIAG=y (Netlink socket -> Netlink: sockets monitoring interface)
  • CONFIG_NETFILTER_XT_MARK=y (Networking support -> Networking options -> Network packet filtering framework (Netfilter) -> Core Netfilter Configuration -> Netfilter Xtables support (required for ip_tables) -> nfmark target and match support)
  • CONFIG_TUN=y (Networking support -> Universal TUN/TAP device driver support)

In the beginning of the project, CRIU had their own custom kernel which contained some experimental CRIU related patches. Nowadays many of those patches has been mainlined.

One such patch [3] that I missed in my current kernel verson (v5.10) was introduced in v5.12. It was related to how CRIU gets the process state and is essential to create a checkpoint of a running process:

commit 90f093fa8ea48e5d991332cee160b761423d55c1
Author: Piotr Figiel <figiel@google.com>
Date:   Fri Feb 26 14:51:56 2021 +0100

    rseq, ptrace: Add PTRACE_GET_RSEQ_CONFIGURATION request

    For userspace checkpoint and restore (C/R) a way of getting process state
    containing RSEQ configuration is needed.

    There are two ways this information is going to be used:
     - to re-enable RSEQ for threads which had it enabled before C/R
     - to detect if a thread was in a critical section during C/R

    Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
    using the address registered before C/R.

    Detection whether the thread is in a critical section during C/R is needed
    to enforce behavior of RSEQ abort during C/R. Attaching with ptrace()
    before registers are dumped itself doesn't cause RSEQ abort.
    Restoring the instruction pointer within the critical section is
    problematic because rseq_cs may get cleared before the control is passed
    to the migrated application code leading to RSEQ invariants not being
    preserved. C/R code will use RSEQ ABI address to find the abort handler
    to which the instruction pointer needs to be set.

    To achieve above goals expose the RSEQ ABI address and the signature value
    with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION.

    This new ptrace request can also be used by debuggers so they are aware
    of stops within restartable sequences in progress.

    Signed-off-by: Piotr Figiel <figiel@google.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Reviewed-by: Michal Miroslaw <emmir@google.com>
    Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Acked-by: Oleg Nesterov <oleg@redhat.com>
    Link: https://lkml.kernel.org/r/20210226135156.1081606-1-figiel@google.com

With that said, to make use of CRIU with the latest features it is highly recommended to use a recent kernel.

And soon it will be available as a package in Buildroot.

Checkpoint-restore in Linux

Checkpoint-restore in Linux

I'm working on power saving features for a project based on a Raspberry Pi Zero. Unfortunately, the RPi does not support features as hibernation to disk or suspend to RAM because how the processor is constructed (the GPU is actually the main processor). So I was looking for alternatives.

That's when I stumpled upon CRIU ( [1], [2] ), Checkpoint-Restore In Userspace. (I actually started to read about PTRACE_SEIZE [4] and ptrace parasite code [3] and found out that CRIU is one of their users.)

/media/CRIU.png

CRIU

CRIU is a project that implements checkpoint/restore functionality by freeze the state of the process and its sub tasks. CRIU makes use of ptrace [4] to stop the process by attach to the process by sending a PTRACE_SEIZE request. Then it injects parasitic code to dump the process's memory pages into image files to create a recoverable checkpoint.

Such process information is memory pages (collected from /proc/$PID/smaps, /proc/$PID/mapfiles/ and /proc/$PID/pagemap), but also information about opened files, credentials, registers, task states and more.

My first concern was that this could not work very well, how about open sockets (especially clients)? It turns out that CRIU alredy handle most of that stuff. There are only a few scenarios that cannot be dumped [5] yet.

Usage

CRIU has many possible use-cases. Some of those are:

  • Container live migration
  • Slow-boot services speed up
  • Seamless kernel upgrade
  • Seamless kernel upgrade
  • "Save" ability in apps (games), that don't have such
  • Snapshots of apps

My use case or now is just to save a snapshot of an application and poweroff the CPU module to later be able to power on and restore it.

PTRACE

For those not familiar with ptrace(2):

The  ptrace() system call provides a means by which one process (the "tracer") may observe and control the execution of an‐
other process (the "tracee"), and examine and change the tracee's memory and registers.  It is primarily used to  implement
breakpoint debugging and system call tracing.

ptrace is the only interface that the Linux kernel provides to poke around and fetch information from inside another application (think debugger and/or tracers).

The PTRACE_SEIZE was introduced in Linux 3.4:

PTRACE_SEIZE (since Linux 3.4)
       Attach  to  the  process  specified  in  pid,  making  it  a  tracee  of the calling process.  Unlike PTRACE_ATTACH,
       PTRACE_SEIZE does not stop the process.  Group-stops are reported as PTRACE_EVENT_STOP and WSTOPSIG(status)  returns
       the  stop  signal.  Automatically attached children stop with PTRACE_EVENT_STOP and WSTOPSIG(status) returns SIGTRAP
       instead of having SIGSTOP signal delivered  to  them.   execve(2)  does  not  deliver  an  extra  SIGTRAP.   Only  a
       PTRACE_SEIZEd  process can accept PTRACE_INTERRUPT and PTRACE_LISTEN commands.  The "seized" behavior just described
       is inherited by  children  that  are  automatically  attached  using  PTRACE_O_TRACEFORK,  PTRACE_O_TRACEVFORK,  and
       PTRACE_O_TRACECLONE.  addr must be zero.  data contains a bit mask of ptrace options to activate immediately.

       Permission to perform a PTRACE_SEIZE is governed by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see be‐
       low.

But it took a while until the checkpoint/restore capability was created for this purpose, see capabilities(7):

CAP_CHECKPOINT_RESTORE (since Linux 5.9)
       •  Update /proc/sys/kernel/ns_last_pid (see pid_namespaces(7));
       •  employ the set_tid feature of clone3(2);
       •  read the contents of the symbolic links in /proc/pid/map_files for other processes.

       This capability was added in Linux  5.9  to  separate  out  checkpoint/restore  functionality  from  the  overloaded
       CAP_SYS_ADMIN capability.

Example

I wrote a simple C application that just count a variable up each second and print the value:

    #include <stdio.h>
    #include <unistd.h>
    int main()
    {
        printf("My PID is %i\n", getpid());
        int count = 0;
        while (1) {
            printf("%d\n", count++);
            sleep(1);
        }
    }

Compile the code:

    gcc main.c -o main

Start The application:

    [17:26:03]marcus@goliat:~/tmp/count$ ./main 
    My PID is 2483855
    0
    1
    2
    3
    4
    5
    6

The process is started with process ID 2483855.

We can now dump the process and store its state. We have to add the --shell-job flag to tell that it was spawned from a shell (and therefor have some file descriptors open to PTYs that needs to be restored).

    [17:27:26]marcus@goliat:~/tmp/criu$ sudo criu dump -t 2483855 --shell-job
    Warn  (compel/arch/x86/src/lib/infect.c:356): Will restore 2483855 with interrupted system call

CRIU needs to have the CAP_SYS_ADMIN or the CAP_CHECKPOINT_RESTORE capability. Set it by:

    setcap cap_checkpoint_restore+eip /usr/bin/criu

The criu dump command will now generate a bunch of files to store the current state of the application. These includes open file descriptors, registers, stackframes, memorymaps and more:

    [17:28:00]marcus@goliat:~/tmp/criu$ ls -1
    core-2483855.img
    fdinfo-2.img
    files.img
    fs-2483855.img
    ids-2483855.img
    inventory.img
    mm-2483855.img
    pagemap-2483855.img
    pages-1.img
    pstree.img
    seccomp.img
    stats-dump
    timens-0.img
    tty-info.img

We can now restore the application from where we stopped:

    [17:29:07]marcus@goliat:~/tmp/criu$ sudo criu restore --shell-job
    27
    28
    29
    30

This is cool. But what is even cooler is that you may restore the application on a different host(!).

Summary

I do not know if CRIU is applicable for what I want to achieve right now, but it is a cool project that I will probably find usage for in the future, so it is a welcome tool to my toolbag.

meta-readonly-rootfs-overlay

meta-readonly-rootfs-overlay

meta-readonly-rootfs-overlay [1] is a meta layer for the Yocto project [2] originally written by Claudius Heine. I took over the maintainership in May 2022 to keep it updated with recent Yocto releases and keep add functionality.

I've implemented it in a couple of industrial products so far and think it needs some extra attention as I find it so useful.

Why does this exists?

Having a read-only root file system is useful for many scenarios:

  • Separate user specific changes from system configuration, and being able to find differences
  • Allow factory reset, by deleting the user specific changes
  • Have a fallback image in case the user specific changes made the root file system no longer bootable.

Because some data on the root file system changes on first boot or while the system is running, just mounting the complete root file system as read-only breaks many applications. There are different solutions to this problem:

  • Symlinking/Bind mounting files and directories that could potentially change while the system is running to a writable partition
  • Instead of having a read-only root files system, mounting a writable overlay root file system, that uses a read-only file system as its base and writes changed data to another writable partition.

To implement the first solution, the developer needs to analyse which file needs to change and then create symlinks for them. When doing factory reset, the developer needs to overwrite every file that is linked with the factory configuration, to avoid dangling symlinks/binds. While this is more work on the developer side, it might increase the security, because only files that are symlinked/bind-mounted can be changed.

This meta-layer provides the second solution. Here no investigation of writable files are needed and factory reset can be done by just deleting all files or formatting the writable volume.

How does it work?

The implementation make use of OverlayFS [3], which is a union mount filesystem that combines multiple underlying mount points into one. The filesystem make use of the terms upper and lower filesystem where the upper is filesystem is applied as an overlay on the lower filesystem.

The resulting merge directory is a combination of these two where all files in the upper filesystem overrides all files in the lower.

/media/meta-readonly-rootfs-overlay.png

Dependencies

This layer only depends on:

URI: git://git.openembedded.org/bitbake
branch: kirkstone

and

URI: git://git.openembedded.org/openembedded-core
layers: meta
branch: kirkstone

Usage

Adding the readonly-rootfs-overlay layer to your build

In order to use this layer, you need to make the build system aware of it.

Assuming the readonly-rootfs-overlay layer exists at the top-level of your OpenEmbedded source tree, you can add it to the build system by adding the location of the readonly-rootfs-overlay layer to bblayers.conf, along with any other layers needed. e.g.:

BBLAYERS ?= " \
  /path/to/layers/meta \
  /path/to/layers/meta-poky \
  /path/to/layers/meta-yocto-bsp \
  /path/to/layers/meta-readonly-rootfs-overlay \
  "

To add the script to your image, just add:

IMAGE_INSTALL:append = " initscripts-readonly-rootfs-overlay"

to your local.conf or image recipe. Or use core-image-rorootfs-overlay-initramfs as initrd.

Read-only root filesystem

If you use this layer you do not need to set read-only-rootfs in the IMAGE_FEATURES or EXTRA_IMAGE_FEATURES variable.

Kernel command line parameters

These examples are not meant to be complete. They just contain parameters that are used by the initscript of this repository. Some additional paramters might be necessary.

Example using initrd

root=/dev/sda1 rootrw=/dev/sda2

This cmd line start /sbin/init with the /dev/sda1 partition as the read-only rootfs and the /dev/sda2 partition as the read-write persistent state.

root=/dev/sda1 rootrw=/dev/sda2 init=/bin/sh

The same as before but it now starts /bin/sh instead of /sbin/init.

Example without initrd

root=/dev/sda1 rootrw=/dev/sda2 init=/init

This cmd line starts /sbin/init with /dev/sda1 partition as the read-only rootfs and the /dev/sda2 partition as the read-write persistent state. When using this init script without an initrd, init=/init has to be set.

root=/dev/sda1 rootrw=/dev/sda2 init=/init rootinit=/bin/sh

The same as before but it now starts /bin/sh instead of /sbin/init

Details

All kernel parameters that is used to configure meta-readonly-rootfs-overlay:

  • root - specifies the read-only root file system device. If this is not specified, the current rootfs is used.
  • `rootfstype if support for the read-only file system is not build into the kernel, you can specify the required module name here. It will also be used in the mount command.
  • rootoptions specifies the mount options of the read-only file system. Defaults to noatime,nodiratime.
  • rootinit if the init parameter was used to specify this init script, rootinit can be used to overwrite the default (/sbin/init).
  • rootrw specifies the read-write file system device. If this is not specified, tmpfs is used.
  • rootrwfstype if support for the read-write file system is not build into the kernel, you can specify the required module name here. It will also be used in the mount command.
  • rootrwoptions specifies the mount options of the read-write file system. Defaults to rw,noatime,mode=755.
  • rootrwreset set to yes if you want to delete all the files in the read-write file system prior to building the overlay root files system.

Embedded Open Source Summit 2023

Embedded Open Source Summit 2023

This year the Embedded Linux Conference is colocated with Automotive Linux Summit, Embedded IOT summit, Safety-critical software summit, LFEnergy and Zephyr Summit. The event was held in Prague, Czech Republic this time.

It is the second time I'm at a Linux conference in Czech Republic, and it clearly is my favorite place for such a event. Not only for the cheap beer but also for the architecture and the culture.

I've collected notes from some of the talks. Mostly for my own good, but here they are:

9 Years in the making, the story of Zephyr [1]

Much has happened since the project started by an announcement at an internal event at Intel in 2014. Two years later it went public and was quickly picked up by the Linux Foundation, and now it is listed as one of the top critical open source projects by Google.

Now, in June 2023, it has mad 40 releases and has over a milion lines of code. What a trip, hue?

The project has made a huge progress, but the road has not been straight forward. Many design decisions has been made and changed over time. Not only technical decisions but in all areas. For example, Zephyr was originally BSD licenced. The current license, Apache2, was not the first choice. The license was changed upon requests from other vendors. I think it is good that not only one company has full dominance on the project.

Even the name has been up for discussion before it landed in Zephyr. One fun thing is that Zephyr has completely taken over all search results, it is hard to find anything that are not related to the Zephyr project as it masks out all other hits... oopsie.

Some major transitions and transformations made by the project:

  • The build system which was initially a bunch of custom made Makefiles, which then became Kbuild and finally CMake.
  • The kernel itself moved from a nano/micro kernel model to a unified kernel.
  • Even the review system has changed from Garrit to Github.

The change from the dual kernel model to a unified kernel was made in 2016. The motivation was that the older model suffers from a few drawbacks:

  • Non-intutive nature of the nano/micro kernel split
  • Double context switch affecting the performance
  • Duplication of object types for nano and micro
  • System initialixation in the idle task

Instead, we ended up with something that:

  • Made the nanokernel 'pre-emptible thread' aware
  • Unified fibers and tasks as one type of threads by dropping the Microkernel server
  • Allowed cooperative threads to operate on all types of objects
  • Clarified duplicated object types
  • Created a new, more streamlined API, without any loss of functionality

Many things points to that Zephyr has healthy eco system. If we look at the contributions we can se that the member/ community contributions are strictly increasing every year and the commits by Intel is decreasing.

It shows us that the project itself is an evolving and become more and more of a self- sustaining open eco-system.

System device trees [2]

As the current usage of device tree does not scale well, especially when working with Multi-core AMP SoCs. we have to come up with some alternatives.

One such alternative is the System Device Tree. It is an extenstion of the DT specification that are devleoped in the open. To me it sounded uncomfortible at the first glance, but the talker made it clear that the work is heavily in cooperate with the DT specifications and the Linux device tree maintainters.

The main problem is that there are one instance of everything that is available for all CPUs and that is not suitable for AMP architectures where each core could be of a completely different types. The CPU cores are normally instantiated by one CPU node. One thing that the system device trees contribute to is to change that to independent CPU clusters instead.

Also, in a normal setup, many peripherals are attached to the global simple bus, and are shared across cores. The new indirect-bus on the other hand, which are introduced in System Device Tree, addresses this problem by map the bus to a particular CPU cluster which makes the peripheral visable for a specific set of cores.

System Device Tree will also introduce independent execution domains, of course also mapped to a specific set of CPU cluster. By this we can encapsulate which peripherals that should be accessable from which application.

But how does it work? The suggestion is to let a tool, sysbuild to postprocess the standard DT structure into several standard devicetrees, one for each execution domain.

Manifests: Project sanity in the ever-changing Zephyr world [3]

Mike Szczys talked about manifests files and why you should use those in your project.

But first, what is a manifest file?

It is a file that manages the project hiearchy by specify all repositories by URL, which branch/tag/hash to use and the local path for checkout. The manifest file also support some more advanced features such as:

  • Inheritance
  • Allow/block lists
  • Grouping
  • West support for validation

The Zephyr tree already uses Manifest files to manage versions of modules and libraries, and there is no reason to do not use the same method in your application. It let you keep control of which versions of all modules that your application requires in a clear way. Besides, as the manifest file is part of your application repository, it does also has a commit history and all changes to the manifest is trackable and hopefully explained in the commit message.

The inheritance feature in the manifest file is a powerful tool. It let you to import other manifest files and explicitely allow or exclude parts of it. This let you reduce the size of of your project significally.

West will handle everything for you. It will parse the manifest file, recursively clone all repositories and update those to a certain commit/tag/branch. It is preferred to not use branches (or even tags) in the manifest files as those may change. Use the hash if possible. Generally speaking, this is the preferred way in any such system (Yocto, Buildroot, ...).

The biggest benifit that I see is that you treat all dependencies aside from your application and that those dependencies are locked to known versions. Zephyr itself will be treated as a dependency to your application, not the other way around.

It is easy to draw parallells to the Yocto project. My first impression of Yocto was that it is REALLY hard to maintain, pretty much for the same reason that we are talking about here - how do I keep track of every layer in a controllable way? The solution for me wasto use KAS which pretty much do exactly the same thing - it creates a manifest files with all layers (read dependencies) that you can version control.

Zbus [4]

Rodrigo Peixoto, the maintainer and author of the Zbus subsystem had a talk where he gave us an introduction on what it is.

(Rodrigo is a nice guy. If you see him, throw a snowball at him and say hi from me - he will understand).

Zephyr has support for many IPC mechanisms such as LIFO, FIFO, Stack, Message Queue, Mailbox and pipes. All of those works great for one-to-one communication, but that is not allways what we need. Even one-to-many could be tricky with the existing mechanism that Zephyr provides.

ZBus is an internal bus used in Zephyr for Many-to-Many communication, besides, such a infrastructure cover all cases (1:1, 1:N, N:M) as a bonus.

I like these kind of infrastructure. It reminds me of dbus (and kbus..) but in a more simplier manner (and that is a good thing). It allows you to have a event-driven architecture in your application and a unified way to make threads talk and share data. Testability is also a bulletpoint for ZBus. You may easily swap a real sensor for stubbed code and the rest of the system would not notice.

The conference

/media/myself-embedded-open-source-summit.jpg

(I got stuck on a picture. Don't know which talk, but it seems like I enjoyed it)

Route priorities - metric values

Route priorities - metric values

Brief

It is not an uncommon scenario that a Linux system has several network interfaces that are all up and routeable. For example, consider a laptop with both Ethernet and WiFi.

But how does the system determine which route to use when trying to reach another host?

I was up to setup a system with both a 4G modem and a WiFi connection. My use case was that when the WiFi is available, that interface should be prioritized over 4G. This achieved by adjusting the route metric values for those interfaces.

/media/route-metric.png

Metric values

The metric value is one of many fields in the routing table and indicates the cost of the route. This become useful if multiple routes exists to a given destination and the system has to make a decision on which route to use. With that said, the lower metric value (lower cost) a route have, the highter priority i gets.

It is up to you or your network manager to set proper metric values for your routes. The actual value could be determine based on several different factors depending on what is important for your setup. E.g:

  • Hop count - The number of routes (hops) in a path to reach a certein network. This is a common metric.
  • Delay - Some interfaces have higher delays than others. Compare a 4G modem with a fiber connecton.
  • Throughput - The expected throughput of the route.
  • Reliability - If some links are more prone på link failures than others, prefer to use other interfaces.

The ip route command will show you all the routes that your system currently have, the last number in the output is the metric value:

$ ip route
default via 192.168.20.1 dev enp0s13f0u1u4 proto dhcp src 192.168.20.173 metric 100
default via 192.168.20.1 dev wlp0s20f3 proto dhcp src 192.168.20.197 metric 600

I have two routes that both is routed via 192.168.20.1.

As you can see, my wlp0s20f3 (Wireless) interface has a higher metric value than my enp0s13f0u1u4 (Ethernet) interface, which will cause the system to choose the ethernet interface over WiFi. In my case, these values are chosen by NetworkManager.

Set metric value

If you want to set specific metric values for your routes, the way will differ depending on how your routes are created.

iproute2

The ip command could be handy to manually create or change the metric value for a certain route:

$ ip route replace default via {IP} dev {DEVICE} metric {METRIC}

ifmetric

ifmetric is a tool for setting the metric value for IPv4 routes attached to a given network interface. Compared to the raw ip command above, ifmetric works on interfaces rather than routes.

$ ifmetric INTERFACE [METRIC]

dhcpcd

Metric values could be set in /etc/dhcpcd.conf according to the manual [1]:

metric metric
Metrics are used to prefer an interface over another one, lowest wins.

e.g.:

interface wlan0
metric 200

If no metric value is given, the default metric is calculated by 200 + if_nametoindex(3). An extra 100 will be added for wireless interfaces.

NetworkManager

Add ipv4.route-metric METRIC to your /etc/NetworkManager/system-connections/<connection>.nmconnection file.

You could also use the command line tool:

    $ nmcli connection edit tuxnet

    ===| nmcli interactive connection editor | ===

    Editing existing '802-11-wireless' connection: 'tuxnet'

    Type 'help' or '?' for available commands.
    Type 'print' to show all the connection properties.
    Type 'describe [<setting>.<prop>]' for detailed property description.

    You may edit the following settings: connection, 802-11-wireless (wifi), 802-11-wireless-security (wifi-sec), 802-1x, ethtool, match, ipv4, ipv6, tc, proxy
    nmcli> set ipv4.route-metric 600
    nmcli> save
    nmcli> quit

PPPD

PPP is a protocol used for establishing internet links over dial-up modems. These links is usually not the preferred link when the device has other more reliable and/or cheaper connections.

The pppd daemon has a few options as specified in the manual [2] for creating a default route and set the metric value:

defaultroute
       Add a default route to the system routing tables, using
       the peer as the gateway, when IPCP negotiation is
       successfully completed.  This entry is removed when the
       PPP connection is broken.  This option is privileged if
       the nodefaultroute option has been specified.

defaultroute-metric
       Define the metric of the defaultroute and only add it if
       there is no other default route with the same metric.
       With the default value of -1, the route is only added if
       there is no default route at all.

replacedefaultroute
       This option is a flag to the defaultroute option. If
       defaultroute is set and this flag is also set, pppd
       replaces an existing default route with the new default
       route.  This option is privileged.

E.g.

replacedefaultroute
defaultroute-metric 900

Summary

It is not that often you actually have to set the metric value yourself. The network manager usually does a great job.

In my system, the NetworkManager did not manage the PPP interface so its metric-logic did not apply to that interface. Therefor I had to let pppd create a default route with a fixed metric.

Lund Linux Conference 2023

Lund Linux Conference 2023

The conference

Lund Linux Conference (LLC) [1] is a "half-open" conference located in Lund. It is a conference with with high quality and I appreciate that the athmosphere is more familiar than at the larger conferences. I've been at the conference a couple of times before and the quality on the talks this year was as good as usual. ( The talks are by the way availalble on Youtube [3].)

We are growing though. Axis generously assists with premisses, but it remains to be seen wether we will get place next year.

Anyway, I took some notes as usual, and this blog post is nothing more than the notes I took during the talk.

The RISC-V Linux port; past/current/next

Björn Töpel talked about the current status of RISC-V architecture in the Linux kernel.

For those who don't know - RISC-V is a open and royalty free Instruction Set Architecture. In practice, this means for example that whenever you want to implement your own CPU core in your FPGA, you are free to do so using the RISC-V ISA. Compare that to ARM that you are strictly not allowed to even think about it without pay royalties and other fees.

RISC-V is a rather new port, the first proposol was sent out to the mailing list in 2016. It makes it a pretty good target to get involved into if you want to get to know the kernel in-depth as the implementation is still quite small in lines of code, which makes it easier to overview.

Björn told us that kernel support for RISC-V has made huge progress in the embedded area, but still lack some important functionality to be useful on the server side. Parts that are missing is e.g. support for ACPI, EUFI, AP-TEE, hotplugs and an advanced interrupt controller.

The architecture gets more support for each kernel release though. Some of the news for RISC-V in linux v6.4 are:

  • Support for Kernel Adress Space Layout Randomization (KASLR)
  • Relocatable kernel
  • HWprobe syscall

Vector support is on its way, but it currently break the ABI, so there are a few things left that needs to be addressed before we can expect any merge.

One giant leap for security: leveraging capabilities in Linux

Kevin Brodsky talked about self aware pointers, which I found interresting. That we can use address bits for other purposes than addresses is nothing new. In a 64bit ARM kernel we do often use only 52bits anyway (4PiB of addressable memory is more than enough for all people(phun intended )).

What Kevin and his team has done is to extend the address to 129bits to even include meta data for boundaries, capabilities and validity tags. The 129bits reservaton has of course a huge impact on the system as it use more than double the size compared to a normal 64-bit system, but it also gives us much in return.

These 129 bits is by the way already a compressed version of the 256 bit variant they started with..

Unfortunately, the implementation is for userpace only, which is a little sad because we already have tons of tools to run application in a protected and constrained environment, but it proves that the concept works and maybe we will see something like this for kernel space in the future.

The implementation requires changes is several parts of the system. The memory allocator and unwind code is most affected, but even the kernel itself and glibc has to be modified. Most of the applications and libraries is not affected at all though.

There is a working meta-layer for Yocto called Morello that can be used to test it out. It contains a usage guide and even a little tutorial on howto build and run Doom :-)

Supporting zoned storage in Ublk

Andreas Hindborg has been working with support for zoned storage [2] in the ublk driver. Zoned storage is basically about to spit the address space into regions called zones that can only be written sequentially. This leads to higher throughput and increased capacity. It also eliminates the need for a Flash Translation Layer (FTL) for e.g. SSD devices.

ublk make use of io_uring internally, which by the way is a cool feature. The io_uring let you queue system calls into a ring buffer, which makes it possible to do more work every time you enter the kernel space. This has impact on the performance as you do not need to context switch back and forth to userspace between each system call.

It is quite easy to add support for io_uring operations to normal character devices, as the struct file_operation now has a uring_cmd callback function that could be populated. This makes it to a high performance alternative to the IOCTL we are used to.

ublk is used to create a block device driver in userspace. It works as all requests and results to/from the block device is redirected to a userspace daemon. The userspace daemon used for this is called ublk-rs, which is entirely written i Rust (of course..). Unfortunately, the source code is not yet available due to legal reasons, but is on its way.

His work was to add support for zoned storage (basically split the address space into regions called zones)

Rust

Then there was a couple of talks about the most hip programming language for now; Rust.

Linus Walleij gave us a history lecture in programming languages in his talk "Rust: Abstraction and Productivity" and his thoughts aout why Rust could be something good for kernel. Andreas Hindborg continued and showed how he implemented a null_blk driver completely in Rust.

But why should we even consider Rust for the kernel? In fact, the language is guaranteed to have a few properties C does not, and we basic Rust support was introduced in Linux v6.1.

We say that Rust is safe, and when we state that, we think of that Rust does have:

  • No buffer overflows
  • No use after free
  • No dereferencing null or invalid pointers
  • No double free
  • No pointer aliasing
  • No type errors
  • No data races
  • ... and more

What was new to me is that a Rust application does not even compile if you try something of the above.

This together makes Rust both memory safe, type safe and thread safe. Consider that 20-60% of the bug fixes in the kernel are for memory safety bugs. These memory bugs takes a lot of productivity away as it often takes long time to find and fix them. Maybe Rust is not that bad after all.

Many cool projects are going on in Rust, example on those are:

  • TLS handshake in the kernel
  • Ethernet-SPI drivers
  • M1&M3 GPU drivers.

The goal with Andreas null_blk driver is to first write a solid Rust API for the blk-mq implementation and then use it in the null_blk driver to provide a reference implementation for linux kernel developers to get started with.

Summary

This was far from all talks, but only those that I had some taken some meaningful notes from.

Hope to see you there next year!

/media/lund-linuxcon-2018.jpg

Encrypted storage on i.MX

Encrypted storage on i.MX

Brief

Many embedded Linux systems does have some kind of sensitive information on a file storage. It could be private keys, passwords or whatever. It is always a risk that this information could be revealed by an unauthorized person that got their physical hands on the device. The only protection against attackers that who simply bypass the system and access the data storage directly is encryption.

Let's say that we encrypt our sensitive data. Where should we then store the decryption key?

We need to store even that sensitive key on a secure place.

i.MX CAAM

Most of the i.MX SoCs has the Cryptographic Accelerator and Assurance Module (CAAM). This includes both the i.MX6 and i.MX8 SoCs series. The only i.MX SoC that I have worked with that does not have the CAAM module is i.MX6ULL, but there could be more.

The CAAM module does have many use cases and one of those is to generate and handle secure keys. Secure keys that we could use to encrypt/decrypt a file, partition or a whole disk.

Device mapper

Device mapper is a framework that adds an extra abstraction layer on block devices that lets you create virtual block devices to offer additional features. Such features could be snapshots, RAID, or as in our case, disc encryption.

As you can see in the picture below, the device mapper is a layer in between the Block layer and the Virtual File System (VFS) layer:

/media/device-mapper.png

The Linux kernel does support a bunch of different mappers. The current kernel (v6.2) does support the following mappers [1]:

  • dm-delay
  • dm-clone
  • dm-crypt
  • dm-dust
  • dm-ebs
  • dm-flakey
  • dm-ima
  • dm-integrity
  • dm-io
  • dm-queue-length
  • dm-raid
  • dm-service-time
  • dm-zoned
  • dm-era
  • dm-linear
  • dm-log-writes
  • dm-stripe
  • dm-switch
  • dm-verity
  • dm-zero

Where dm-crypt [2] is the one we will focus on. One cool feature of device mappers is that those are stackable. You could for example use dm-crypt on top of a dm-raid mapping. How cool isn't that?

DM-Crypt

DM-Crypt is a device mapper implementation that uses the Crypto API [3] to transparently encrypt/decrypt all access to the block device. Once the device is mounted, all users will not even notice that the data read/written to that mount point is encrypted.

Normally you will use cryptsetup [4] or cryptmount [5] as those are the preferred way to handle the dm-crypt layer. For this we will use dmsetup though, which is a very low level (and difficult) tool to use.

CAAM Secure Keys

Now it is time to answer the question in the introduction section;

Let's say that we encrypt our sensitive data. Where should we then store the decryption key?

The CAAM module has a way to handle these keys in a secure way by store the keys in a protected area that is only readable by the CAAM module itself. In other word, it is not even possible to read out the key. Together with dm-crypt, we can create a master key that will never leave this protected area. On each boot, we will generate a derived (session) key that is the key we could use from userspace. These session keys are called black keys.

How to use it?

Installation

We need to build and install keyctl_caam in order to generate black keys and encapsulate it into a black blob. Download the source code:

git clone https://github.com/nxp-imx/keyctl_caam.git
cd keyctl_caam

And build:

CC=aarch64-linux-gnu-gcc make

I build with a external toolchain prefixed with aarch64-linux-gnu-. If you have a Yocto environment, you could use the toolchain from that SDK instead by use the environment setup script, e.g.:

./environment-setup-aarch64-poky-linux
make

You also have to make sure that the following kernel configurations is enabled:

CONFIG_BLK_DEV_DM=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD=y
CONFIG_DM_CRYPT=y
CONFIG_DM_MULTIPATH=y
CONFIG_CRYPTO_DEV_FSL_CAAM_TK_API=y

Usage

Create a black key from random data, use ECB encryption:

caam-keygen create randomkey ecb -s 16

The file is written to the /data/caam/ folder unless the application is built to use another location (specified with KEYBLOB_LOCATION). Two files should now been generated:

ls -l /data/caam/
total 8
-rw-r--r-- 1 root root 36 apr 3 21.09 randomkey
-rw-r--r-- 1 root root 96 apr 3 21.09 randomkey.bb

Add the generated black key to the kernel key retention service. To this we use the keyctl command:

cat /data/caam/randomkey | keyctl padd logon logkey: @s

Create a deivce-mapper device named $ENCRYPTED_LABEL and map it to the block device $DEVICE:

dmsetup -v create $ENCRYPTED_LABEL --table "0 $(blockdev --getsz $DEVICE) crypt capi:tk(cbc(aes))-plain :36:logon:logkey: 0 $DEVICE 0 1 sector_size:512"

Create a filesystem on our newly created mapper device:

mkfs.ext4 -L $VOLUME_LABEL /dev/mapper/$ENCRYPTED_LABEL

Mount it on $MOUNT_POINT:

mount /dev/mapper/$ENCRYPTED_LABEL ${MOUNT_POINT}

Congrats! Your encrypted device is now ready to use! All data written to $MOUNT_POINT will be encrypted on the fly and decrypted upon read.

To illustrate this, create a file on the encrypted volume:

echo "Encrypted data" > ${MOUNT_POINT}/encrypted-file

Clean up and reboot:

umount $MOUNT_POINT
dmsetup remove $ENCRYPTED_LABEL
keyctl clear @s
reboot

A new session key will be generated upon each cold boot. So we have to import the key from the blob and add it to the key retention service. We also have to create the device mapper. This has to be done at each boot:

caam-keygen import $KEYPATH/$KEYNAME.bb $IMPORTKEY
cat $IMPORTKEYPATH/$IMPORTKEY | keyctl padd logon logkey: @s
dmsetup -v create $ENCRYPTED_LABEL --table "0 $(blockdev --getsz $DEVICE) crypt capi:tk(cbc(aes))-plain :36:logon:logkey: 0 $DEVICE 0 1 sector_size:512"
mount /dev/mapper/$ENCRYPTED_LABEL ${MOUNT_POINT}

We will now be able read back the data from the encrypted device:

cat ${MOUNT_POINT}/encrypted-file
Encrypted data

That was it!

Conclusion

Encryption could be hard, but the CAAM module makes it pretty much straight forward. It protect your secrets from physical attacks, which could be hard to protect otherwise.

However, keep in mind that as soon as the encrypted device is mounted and available to the system, it is free to read for any intruder that have access to the system.

The device security chain is no stronger than its weakest link and you have to identify and handle all potential security risks. This is only one.

Bug in the iMX8MP ECSPI module?

Bug in the iMX8MP ECSPI module?

Background

I do have a system where I can swap between iMX8M Mini and iMX8M Plus CPU modules on the same carrier board.

I did write a a SPI driver for a device on the carrier board. The device is connected to the ECSPI1 (the CPU contains several ECSPI modules) and use the hardware chipselect 0 (SS0). The driver has been used with the iMX8MM CPU module for a while, but as soon I swapped to the iMX8MP it certainly stopped working.

Both iMX8MM and iMX8MP have the same ECSPI IP block that is managed by the spi-imx [1] Linux kernel driver, the application and root filesystem is the same as well.

Same driver, same application, different module. What is happening?

The driver layer also did not report anything suspicious, all SPI transactions contained the data I expected and was successfully sent out on the bus. After debugging the application, driver and devicetree for a while, I took a closer look on the actual SPI signals.

SPI signals

I'm not going to describe the SPI interface specifications, please see Wikipedia [2] or such for more details.

It turns out that the chip select goes inactive after each sent byte, which is a weird behavior. The chipselect should stay low during the whole data transaction.

Here is the signals of one transaction of two bytes:

/media/imx8mp-spi-ss0.jpg

The ECSPI modules supports dynamic burst size, so I was experimenting with that without any success.

Workaround

The best workaround I came up with was to MUX the chipselect pin to the GPIO function instead of SS0 and map that GPIO as chipselect to ECSPI1 by override the affected properties in the device tree file:

&ecspi1 {
          cs-gpios =
                      <&gpio5 9 GPIO_ACTIVE_LOW>,
                      <&gpio2 8 GPIO_ACTIVE_LOW>;
};

&pinctrl_ecspi1_cs0 {
        fsl,pins = <
                MX8MP_IOMUXC_ECSPI1_SS0__GPIO5_IO09         0x40000
                    >;
};

Then the signals looks better:

/media/imx8mp-spi-gpio.jpg

Conclusion

I do not know if all ECSPI modules with all HW chipselects is affected or only SS0 @ ECSPI1. I could not find anything about it in the iMX8MP Errata.

The fact that the workaround did work makes me suspect a hardware bug in the iMX8MP processor. I guess we will see if it shows up in the errata later on.