PID 1 in containers

What is PID 1

The top-most process in a UNIX system has PID (Process ID) 1 and is usually the init process.
The Init process is the first userspace application started on a system and is started by the kernel at boottime.
The kernel is looking in a few predefined paths (and the init kernel parameter). If no such application is found, the system will panic().

See init/main.c:kernel_init

if (!try_to_run_init_process("/sbin/init") ||
    !try_to_run_init_process("/etc/init") ||
    !try_to_run_init_process("/bin/init") ||
    !try_to_run_init_process("/bin/sh"))
    return 0;
panic("No working init found.  Try passing init= option to kernel. "
      "See Linux Documentation/admin-guide/init.rst for guidance.");

All processes in UNIX has a parent/child relationship which builds up a big relationship-tree.
Some resources and permissions are inherited from parent to child such as UID and cgroup restrictions.

As in the real world, with parenthood comes obligations.
For example: What is usually the last line of your main()-function? Hopfully something like

return EXIT_SUCCESS;

All processes exits with an exit code that tells us if the operation was sucessful or not.
Who is interested in this exit code anyway?
In the real world, the parents are interested in their children’s result, and so even here. The parent is responsible to wait(2) on their children to terminate just to fetch its exit code.
But what if the parent died before the child?

Lets go back to the init process.
The init process has several tasks, and one is to adopt "orphaned" (called zombie) child processes.
Why? Because all processes will return an exit code and will not terminate completely until someone is listen for what they have to say.
The init process is simply wait(2):ing on the exit code, throw it away and let the child die. Sad but true, but the child may not rest i peace otherwise.
The operating system expects the init process to reap adopted children. Otherwise the children will exist in the system as a zombie and taking up some kernel resources and consume a slot in the kernel process table.

PID 1 in containers

Containers is a concept that isolate processes in different namespaces. Example of such namespaces are PID, users, networking and filesystem.
To create a container is quite simple, just create a new process with clone(2) and provide relevant flags to create new namespaces for the process.

The flags related to namespaces are listed in include/uapi/linux/sched.h:

#define CLONE_NEWPID                0x20000000      // New pid namespace
#define CLONE_NEWCGROUP             0x02000000      // New cgroup namespace
#define CLONE_NEWUTS                0x04000000      // New utsname namespace
#define CLONE_NEWIPC                0x08000000      // New ipc namespace
#define CLONE_NEWUSER               0x10000000      // New user namespace
#define CLONE_NEWPID                0x20000000      // New pid namespace
#define CLONE_NEWNET                0x40000000      // New network namespace

All processes is running in a "container-context" because the processes allways executes in a namespace.
On a system "without containers", all processes still have one common namespace that all processes is using.

When using CLONE_NEWPID, the kernel will create a new PID namespace and let the newly created process has the PID 1.
As we already know, the PID 1 process has a very special task, namely to kill all orphaned children.
This PID 1 process could be any application (make, bash, nginx, ftp-server or whatever) that is missing this essential adopt-and-slay-mechanism.
If the reaping is not handled, it will result in zombie-processes. This was a real problem not long time ago for Docker containers (google Docker and zombies to see what I mean).
Nowadays we have the –init flag on docker run to tell the container to use tini (https://github.com/krallin/tini), a zombie-reaping init process to run with PID 1.

When PID 1 dies

This is the reason to why I’m writing this post. I was wondering who is killing PID 1 in a container since we learned that a PID 1 may not die under any circumstances.
PID 1 in cointainers is obviosly an exception from this golden rule, but how does the kernel differentiate between init processes in different PID namespaces?

Lets follow a process to its very last breath.

The call chain we will look at is the following:
do_exit()->exit_notify()->forget_original_parent()->find_child_reaper().

do_exit()

kernel/exit.c:do_exit() is called when a process is going to be cleaned up from the system after it has exited or being terminated.
The function is collecting the exit code, delete timers, free up resources and so on.
Here is an extract of the function:

......

<<<<< Collect exit code >>>>>
tsk->exit_code = code;
taskstats_exit(tsk, group_dead);


exit_mm(tsk);

if (group_dead)
    acct_process();
trace_sched_process_exit(tsk);

<<<<< Free up resources >>>>>
exit_sem(tsk);
exit_shm(tsk);
exit_files(tsk);
exit_fs(tsk);
if (group_dead)
    disassociate_ctty(1);
exit_task_namespaces(tsk);
exit_task_work(tsk);
exit_thread(tsk);

perf_event_exit_task(tsk);

sched_autogroup_exit_task(tsk);
cgroup_exit(tsk);

<<<<< Notify tasks in the same group >>>>>
exit_notify(tsk, group_dead);

.........

exit_notify() is to notifing our "dead group" that we are going down.
One important thing to notice is that almost all resources are freed at this point.
Even if the process is going into a zombie state, the footprint is relative small, but still, the zombie consumes a slot in the process table.

The size of the process table in Linux and defined by PID_MAX_LIMIT in include/linux/threads.h:

\*\*
\* A maximum of 4 million PIDs should be enough for a while.
\* [NOTE: PID/TIDs are limited to 2^29 ~= 500+ million, see futex.h.]
\*/
#define PID_MAX_LIMIT (CONFIG_BASE_SMALL ? PAGE_SIZE * 8 : \
(sizeof(long) > 4 ? 4 * 1024 * 1024 : PID_MAX_DEFAULT))

The process table is indeed quite big. But if you are running for example a webserver as PID 1 that is fork(2):ing on each HTTP request. All these forks will result in a zombie and the number will escalate quite fast.

exit_notify()

kernel/exit.c:exit_notify() is sending signals to all the closest relatives so that they know to properly mourn this process.
In the beginning of this function, a call is made to forget_original_parent():

static void exit_notify(struct task_struct *tsk, int group_dead)
{
    bool autoreap;
    struct task_struct *p, *n;
    LIST_HEAD(dead);

    write_lock_irq(&tasklist_lock);
  >>>>>  forget_original_parent(tsk, &dead);

forget_original_parent()

This function simply does two things

  1. Make init (PID 1) inherit all the child processes
  2. Check to see if any process groups have become orphaned as a result of our exiting, and if they have any stopped jobs, send them a SIGHUP and then a SIGCONT.

find_child_reaper() will help us find a proper reaper:

>>>>> reaper = find_child_reaper(father);
if (list_empty(&father->children))
    return;

find_child_reaper()

kernel/exit.c:find_child_reaper() is looking if a father is available.
If a father (or other relative) is not available at all, we must be the PID 1 process.

This is the interesting part:

if (unlikely(pid_ns == &init_pid_ns)) {
    panic("Attempted to kill init! exitcode=0x%08x\n",
        father->signal->group_exit_code ?: father->exit_code);
}
zap_pid_ns_processes(pid_ns);

init_pid_ns refers (declared in kernel/pid.c) to our real init process.
If the real init process exits, panic the whole system since it cannot continue without an init process.
If it is not, call zap_pid_ns_processes(), here we have our PID1-cannot-be-killed-exception we are looking for!
We contiue following the call chain down to zap_pid_ns_processes().

zap_pid_ns_processes()

zap_pid_ns_processes function is part of the PID namespace and is located in kernel/pid_namespace.c
The function iterates through all tasks in the same group and send signal SIGKILL to each of them.

nr = next_pidmap(pid_ns, 1);
while (nr > 0) {

    rcu_read_lock();

    task = pid_task(find_vpid(nr), PIDTYPE_PID);
    if (task && !__fatal_signal_pending(task))
    >>>>> send_sig_info(SIGKILL, SEND_SIG_FORCED, task);

    rcu_read_unlock();

    nr = next_pidmap(pid_ns, nr);

}

Conclusion

The PID 1 in containers is handled in a seperate way than the real init process.
This is obvious, but now we know where the codeflow differ for PID 1 in different namespaces.

We also see that if the PID1 in a PID namespace dies, all the subprocesses will be terminated with SIGKILL.
This behavior reflects the fact that the init process is essential for the correct operation of any PID namespace.

Memory management in the kernel

Memory management in the kernel

Memory management is among the most complex parts in the Linux kernel. There is so many critical parts such as page allocator, slab allocator, virtual memory handling, memory mapping, MMU, IOMMU and so on. All these parts has to work perfect (or at least allmost perfect 🙂 ) because all system use them either they want to or not.
If there is a bug or performance issue you will be noticed quite soon.

My goal is to produce a few posts on the topic and try to sort out the different parts and describe how they work and the connection between.
I will begin from the physical bottom and work myself up to how userspace allocates memory in their little blue world with pink clouds. (Everything is so easy on the user side)

struct page

A page is the smallest unit that matters in terms of virtual memory. This is because the MMU (Memory Management Unit, described in a upcoming post) is only dealing with those pages. A typical size for a page is 4KB, at least for 32bit architectures. The most 64-bit architectures uses 8KB pages.

Every one of those physical pages is represented by a struct page that is defined in include/linux/mm_types.h. That is a lot of pages. If we do a simple calculation:
We have a 32-bit system that has 512MB of physical memory, this memory is divided into 131,072 4KB pages. Think of that 512MB is not even so much memory on a modern system today.

What I want to say is that this struct page should be kept as small as possible because it scales up a lot when physical memory increases.

Ok, so there is a struct page somewhere that got allocated for each physical page, which is a lot, but what does it do?
It does a lot of housekeeping, lets look at a few set of members that I think is most interresting:

struct page {
    unsigned long flags;
    unsigned long private;
    void    *virtual;
    atomic_t    _count;
    pgoff_t    index;
#if USE_SPLIT_PTE_PTLOCKS
#if ALLOC_SPLIT_PTLOCKS
    spinlock_t  *ptl;
#else
    spinlock_t  ptl;
#endif
#endif
};

flags is keeping track of the page status which could be dirty (need to be written to media), locked in memory (not allowed to be paged out), permissions and so on. See enum pageflags in include/linux/page-flags.h for more information.

private is not a defined field. May be used as a long or interpreted as a pointer. (Shared with ptl in a union!)

virtual is the virtual address of the page. In case that the page belongs to the high memory (memory that is not permanently mapped) this field will be NULL and require dynamic mapping.

_count is a simple reference counter to determine when the page is free for allocation.

index is the offset within a mapping.

ptl is a interresting one! I think it requires a special section in this post. (Shared with private in a union!)

Page Table Lock

PTL stands for Page Table Lock and is a per-page lock.

In the next part of these memory management posts I will describe the struct mm_struct, how PGD, PMD and PTE are related, but for now it’s enough that you just have heard the words.

Ok, there is one thing that is good to know. The struct mm_struct (also defined in mm_types.h) is a structure that represent a process’s address space and contains all information related to the process memory. The structure has a pointer to virtual memory areas that refers to one or more struct page.
This structure also has the member mm->page_table_lock that is a spinlock that protects all page tables of the mm_struct. This was the original approach and is still used by several architectures.

However, this mm->page_table_lock is a little bit clumsy since it lock all pages at once. This is no real problem on a single-cpu without SMP system. But nowdays that is not a very common scenario.

Instead, the split page table lock was introduced and has a separate per-table lock to allow concurrency access to pages in the same mm_struct. Remember that the mm_struct is per process? So this increases page-fault/page-access performance in multi-threaded applications only.

When is split page table locks enabed?
It is enabled in compile-time if CONFIG_SPLIT_PTLOCK_CPUS (I have never seen another value but 4 on this one) is less or equal to NR_CPUS.

Here is a few defines int the beginning of the mm_types.h header file:

#define USE_SPLIT_PTE_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
#define USE_SPLIT_PMD_PTLOCKS (USE_SPLIT_PTE_PTLOCKS &&
    IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK))
#define ALLOC_SPLIT_PTLOCKS (SPINLOCK_SIZE > BITS_PER_LONG/8)

The ALLOC_SPLIT_PTLOCKS is a little bit clever. If the sizeof a spinlock is less or equal to the size of a long, the spinlock is embedded in the struct page and can therefor save a cache line by avoiding indirect access.
If a spinlock does not fit into a long, then the page->ptl is used as a pointer that points to a dynamic allocated spinlock. As I said, this is a clever construction since it allow us to increase the size of a spinlock and there is no problem. Exemple when sizeof spinlock does not fit is when using DEBUG_SPINLOCK, DEBUG_LOCK_ALLOC or applying the PREEMPT_RT patchset.

The spinlock_t is allocated in pgtable_page_ctor() for PTE tables and in pgtable_pmd_page_ctor for PMD tables. These function (and the corresponding free-functions) should be called in *every place that allocated/freeing pages. This is already done in mainline, but I know there is evil hardware vendors out there that do not. For example, if you use their evil code and apply the preempt_rt patchset (that increases the sizeof spinlock_t), you have to verify that thier code behaves.

Also, pgtable_*page_ctor() can fail, this must be handled properly.

Remember that the page-ptl should never be accessed directly, use appropriate helper functions for that.

Example on such helper functions is..

pte_offset_map_lock()
pte_unmap_lock()
pte_alloc_map_lock()
pte_lockptr()
pmd_lock()
pmd_lockptr()

MMAP memory between kernel and userspace

MMAP memory between kernel and userspace

Let kernel allocate memory and let userspace map is sounds like an easy task, and sure it is.
There are just a few things that is good to know about page mapping.

The MMU (Memory Management Unit) contains page tables with entries for mapping between virtual and physical addresses. These pages is the smallest unit that the MMU deals with.
The size of a page is given by the PAGE_SIZE macro in asm/page.h ans is typically 4k for most architectures.

There is a few more useful macros in asm/page.h:

PAGE_SHIFT: How many steps we should shift to left to get a PAGE_SIZE
PAGE_SIZE: Size of a page, defined as (1 << PAGE_SHIFT).
PAGE_ALIGN(len): Will round up the length to the closest alignment of PAGE_SIZE.

How does mmap(2) work?

Every page table entry has a bit that tells us if the entry is valid in supervisor mode (kernel mode) only. And sure, all memory allocated in kernel space will have this bit set.
What the mmap(2) system call do is simply creating a new page table entry with a different virtual address that points to the same physical memory page. The difference is that this supervisor-bit is not set.
This let userspace access the memory as if it was a part of the application, for now it is!
The kernel is not involved in those accesses at all, so it is really fast.

Magic? Kind of.
The magic is called remap_pfn_range().
What remap_pfn_range() do is just essentially to update the process’s specific page table with these new entries.

Example, please

Allocate memory

As we already know, the smallest unit that the MMU handle is the size of PAGE_SIZE and the mmap(2) only works with full pages. Even if you just want to share only 100 bytes, a whole page frame will be remapped and must therefor be allocated in the kernel.
The allocated memory must also be page aligned.

__get_free_pages()

One way to allocate pages is with __get_free_pages().:

unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)

gft_mask is commonly set to GFP_KERNEL in process/kernel context and GFP_ATOMIC in interrupt context. The order is the number of pages to allocate expressed in 2^order.

For example::

u8 *vbuf = __get_free_pages(GFP_KERNEL, size >> PAGE_SHIFT);

System Message: WARNING/2 (<stdin>, line 48); backlink

Inline emphasis start-string without end-string.

Allocated memory is freed with __free_pages().

vmalloc()

A more common (and preferred) way to allocate virtual continuous memory is with vmalloc().
vmalloc() will allways allocate whole set of pages, no matter what. This is exactly what we want!

Read about vmalloc() in kmalloc(9):

Allocated memory is freed with vfree().

alloc_page()

If you need only one page, alloc_page() will give you that.
If this is the case, insead of using remap_pfn_range(), vm_insert_page() will do the work you for you.
Notice that vm_insert_page() apparently only works on order-0 (single-page) allocation. So if you want to allocate N pages, you will hace to call vm_insert_page() N times.

Now some code

Allocation

priv->a_size = ATTRIBUTE_N * ATTRIBUTE_SIZE;

/* page align */
priv->a_size = PAGE_ALIGN(priv->a_size);
priv->a_area =vmalloc(priv->a_size);

file_operations.mmap

static int scan_mmap (struct file *file, struct vm_area_struct *vma)
{
struct mmap_priv *priv = file->private_data;
unsigned long start = vma->vm_start;
unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
unsigned long page;
size_t size = vma->vm_end - vma->vm_start;
if (size > priv->a_size)
           return -EINVAL;
page = vmalloc_to_pfn((void *)priv->a_area);
if (remap_pfn_range(vma, start, page, priv->a_size, PAGE_SHARED))
           return -EAGAIN;
vma->vm_flags |= VM_RESERVED; /* avoid to swap out this VMA */
return 0;
}