/media/systemd.png

Mounting with systemd and udev

Systemd hasn't always been my first choice as init system for embedded system, but I cannot ignore that it has many good and handy things that other init systems don't. At the same time, that is just what I don't like with systemd, it does not follow the "Do one thing and do it well"-philosophy that I like so much. I'm very thorn about it.

However, when trying to do some things with systemd as you used to do with other systems you sometimes encounter some difficulties. Mostly it's simple because there is another way to accomplish what you want, the "systemd-way", which is usually a better and safer way, but sometimes you simply don't want to.

One such thing I encountered was to mount file systems with udev. This used to work, but since v239 of systemd, two separate directives were introduced and changed this default behavior.

units: switch from system call blacklist to whitelist

Commit ee8f26180d01e3ddd4e5f20b03b81e5e737657ae [1]

units: switch from system call blacklist to whitelist

This is generally the safer approach, and is what container managers
(including nspawn) do, hence let's move to this too for our own
services. This is particularly useful as this this means the new
@System-service system call filter group will get serious real-life
testing quickly.

This also switches from firing SIGSYS on unexpected syscalls to
returning EPERM. This would have probably been a better default anyway,
but it's hard to change that these days. When whitelisting system calls
SIGSYS is highly problematic as system calls that are newly introduced
to Linux become minefields for services otherwise.

Note that this enables a system call filter for udev for the first time,
and will block @clock, @mount and @swap from it. Some downstream
distributions might want to revert this locally if they want to permit
unsafe operations on udev rules, but in general this shiuld be mostly
safe, as we already set MountFlags=shared for udevd, hence at least
@mount won't change anything.

This patch change the default filter behavior from a blacklist to a whitelist and @mount is no longer allowed

+ SystemCallFilter=@system-service @module @raw-io
+ SystemCallErrorNumber=EPERM

units: switch udev service to use PrivateMounts=yes

Commit b2e8ae7380d009ab9f9260a34e251ac5990b01ca [2]

units: switch udev service to use PrivateMounts=yes

Given that PrivateMounts=yes is the "successor" to MountFlags=slave in
unit files, let's make use of it for udevd.

What does systemd says about PrivateMounts? [3]

PrivateMounts=
Takes a boolean parameter. If set, the processes of this unit will be run in their own private file system (mount) namespace with all mount propagation from the processes towards the host's main file system namespace turned off. This means any file system mount points established or removed by the unit's processes will be private to them and not be visible to the host. However, file system mount points established or removed on the host will be propagated to the unit's processes. See mount_namespaces(7) for details on file system namespaces. Defaults to off.

When turned on, this executes three operations for each invoked process: a new CLONE_NEWNS namespace is created, after which all existing mounts are remounted to MS_SLAVE to disable propagation from the unit's processes to the host (but leaving propagation in the opposite direction in effect). Finally, the mounts are remounted again to the propagation mode configured with MountFlags=, see below.

File system namespaces are set up individually for each process forked off by the service manager. Mounts established in the namespace of the process created by ExecStartPre= will hence be cleaned up automatically as soon as that process exits and will not be available to subsequent processes forked off for ExecStart= (and similar applies to the various other commands configured for units). Similarly, JoinsNamespaceOf= does not permit sharing kernel mount namespaces between units, it only enables sharing of the /tmp/ and /var/tmp/ directories.

Other file system namespace unit settings — PrivateMounts=, PrivateTmp=, PrivateDevices=, ProtectSystem=, ProtectHome=, ReadOnlyPaths=, InaccessiblePaths=, ReadWritePaths=, … — also enable file system namespacing in a fashion equivalent to this option. Hence it's primarily useful to explicitly request this behaviour if none of the other settings are used.

This option is only available for system services, or for services running in per-user instances of the service manager when PrivateUsers= is enabled.

If PrivateMounts=true, then the process has its own mount namespace which will result in that the mounted filesystem is visable only for the process (udevd) itself and will not be propagated to the whole system.

Conclusion

There is reasons to not allow udev mount filesystems for sure, but if you still want to do it you have to revert these changes by modify /lib/systemd/system/systemd-udev.service with: