chroot and user namespaces

When playing around with libcamera [1] and br2-readonly-rootfs-overlay [2] I found something.. well.. unexpected. At least at first glance.

What happened was that I encountered this error:

 $ libcamera-still
Preview window unavailable
[0:02:54.785102683] [517]  INFO Camera camera_manager.cpp:299 libcamera v0.0.0+67319-2023.02-22-gd530afad-dirty (2024-02-20T16:56:34+01:00)
[0:02:54.885731084] [518] ERROR Process process.cpp:312 Failed to unshare execution context: Operation not permitted

Failed to unshare execution context: Operation not permitted... what?

I know that libcamera executes proprietary IPAs (Image Processing Algorithms) as black boxes, and that the execution is isolated in their own namespace. But.. not permitted..?

Lets look into the code for libcamera (src/libcamera/process.cpp) [3]:

int Process::isolate()
{
	int ret = unshare(CLONE_NEWUSER | CLONE_NEWNET);
	if (ret) {
		ret = -errno;
		LOG(Process, Error) << "Failed to unshare execution context: "
				    << strerror(-ret);
		return ret;
	}

	return 0;
}

Libcamera does indeed create a user and network namespace for the execution. In Linux, new namespaces is created with unshare(2).

The unshare(2) library call is used to dissassociate parts of its execution context by creating new namespaces. As you can see in the manpage [4], there are a few scenarios that can give us -EPERM as return value:

EPERM  The calling process did not have the required privileges for this operation.

EPERM  CLONE_NEWUSER  was  specified in flags, but either the effective user ID or the effective group ID of the caller does not have a mapping in
       the parent namespace (see user_namespaces(7)).

EPERM (since Linux 3.9)
       CLONE_NEWUSER was specified in flags and the caller is in a chroot environment (i.e., the caller's root directory does not match  the  root
       directory of the mount namespace in which it resides).

The last one caught my eyes. I was not aware of this, even if it makes very much sense to have it this way.

Unfortunately, the last step of the init-script in the initramfs for br2-readonly-rootfs-overlay is to do chroot(1) to the actual root filesystem. I did initially choose to use chroot because it works with both an initrd and initramfs root filesystem, and the code was intended to be generic.

pivot_root and switch_root can (and should) be used with advantage, but it requires you to now which type of filesystem you are working with.

As the br2-readonly-rootfs-overlay is now only designed to be used with initramfs, I switched to use switch_root and the problem was solved.

For the sake of fun, lets follow the code down to where the error occour.

Code digging

When the application makes a call to unshare(2), which simply is a wrapper in the libc implementation for the system call down to the kernel, it ends up with a syscall(2).

The system call on the kernel side is defined as follow:

SYSCALL_DEFINE1(unshare, unsigned long, unshare_flags)
{
	return ksys_unshare(unshare_flags);
}

It only calls ksys_unshare() which interpret the unshare_flags and "unshares" namespaces depending on flags. The flag for unsharing a user namespace is CLONE_NEWNS.

int ksys_unshare(unsigned long unshare_flags)
{
	struct fs_struct *fs, *new_fs = NULL;
	struct files_struct *new_fd = NULL;
	struct cred *new_cred = NULL;
	struct nsproxy *new_nsproxy = NULL;
	int do_sysvsem = 0;
	int err;

	/*
	 * If unsharing a user namespace must also unshare the thread group
	 * and unshare the filesystem root and working directories.
	 */
	if (unshare_flags & CLONE_NEWUSER)
		unshare_flags |= CLONE_THREAD | CLONE_FS;
	/*
	 * If unsharing vm, must also unshare signal handlers.
	 */
	if (unshare_flags & CLONE_VM)
		unshare_flags |= CLONE_SIGHAND;
	/*
	 * If unsharing a signal handlers, must also unshare the signal queues.
	 */
	if (unshare_flags & CLONE_SIGHAND)
		unshare_flags |= CLONE_THREAD;
	/*
	 * If unsharing namespace, must also unshare filesystem information.
	 */
	if (unshare_flags & CLONE_NEWNS)
		unshare_flags |= CLONE_FS;

	err = check_unshare_flags(unshare_flags);
	if (err)
		goto bad_unshare_out;
	/*
	 * CLONE_NEWIPC must also detach from the undolist: after switching
	 * to a new ipc namespace, the semaphore arrays from the old
	 * namespace are unreachable.
	 */
	if (unshare_flags & (CLONE_NEWIPC|CLONE_SYSVSEM))
		do_sysvsem = 1;
	err = unshare_fs(unshare_flags, &new_fs);
	if (err)
		goto bad_unshare_out;
	err = unshare_fd(unshare_flags, NR_OPEN_MAX, &new_fd);
	if (err)
		goto bad_unshare_cleanup_fs;
	err = unshare_userns(unshare_flags, &new_cred);
	....

unshare_userns() prepares credentials and then creates the user namespace by calling create_user_ns():

int unshare_userns(unsigned long unshare_flags, struct cred **new_cred)
{
	struct cred *cred;
	int err = -ENOMEM;

	if (!(unshare_flags & CLONE_NEWUSER))
		return 0;

	cred = prepare_creds();
	if (cred) {
		err = create_user_ns(cred);
		if (err)
			put_cred(cred);
		else
			*new_cred = cred;
	}

	return err;
}

Before create_user_ns() creates the actual user namespace, it makes various sanity checks, one of those are if(current_chrooted())

int create_user_ns(struct cred *new)
{
	struct user_namespace *ns, *parent_ns = new->user_ns;
	kuid_t owner = new->euid;
	kgid_t group = new->egid;
	struct ucounts *ucounts;
	int ret, i;

	ret = -ENOSPC;
	if (parent_ns->level > 32)
		goto fail;

	ucounts = inc_user_namespaces(parent_ns, owner);
	if (!ucounts)
		goto fail;

	/*
	 * Verify that we can not violate the policy of which files
	 * may be accessed that is specified by the root directory,
	 * by verifying that the root directory is at the root of the
	 * mount namespace which allows all files to be accessed.
	 */
	ret = -EPERM;
	if (current_chrooted())
		goto fail_dec;
	
	....

This is how we check if the if the current process is chrooted or not:

bool current_chrooted(void)
{
	/* Does the current process have a non-standard root */
	struct path ns_root;
	struct path fs_root;
	bool chrooted;

	/* Find the namespace root */
	ns_root.mnt = &current->nsproxy->mnt_ns->root->mnt;
	ns_root.dentry = ns_root.mnt->mnt_root;
	path_get(&ns_root);
	while (d_mountpoint(ns_root.dentry) && follow_down_one(&ns_root))
		;

	get_fs_root(current->fs, &fs_root);

	chrooted = !path_equal(&fs_root, &ns_root);

	path_put(&fs_root);
	path_put(&ns_root);

	return chrooted;
}

Summary

Do not create user namespaces for chroot:ed processes. It does not work for good reasons.