Checkpoint-restore in Linux

I'm working on power saving features for a project based on a Raspberry Pi Zero. Unfortunately, the RPi does not support features as hibernation to disk or suspend to RAM because how the processor is constructed (the GPU is actually the main processor). So I was looking for alternatives.

That's when I stumpled upon CRIU ( [1], [2] ), Checkpoint-Restore In Userspace. (I actually started to read about PTRACE_SEIZE [4] and ptrace parasite code [3] and found out that CRIU is one of their users.)

/media/CRIU.png

CRIU

CRIU is a project that implements checkpoint/restore functionality by freeze the state of the process and its sub tasks. CRIU makes use of ptrace [4] to stop the process by attach to the process by sending a PTRACE_SEIZE request. Then it injects parasitic code to dump the process's memory pages into image files to create a recoverable checkpoint.

Such process information is memory pages (collected from /proc/$PID/smaps, /proc/$PID/mapfiles/ and /proc/$PID/pagemap), but also information about opened files, credentials, registers, task states and more.

My first concern was that this could not work very well, how about open sockets (especially clients)? It turns out that CRIU alredy handle most of that stuff. There are only a few scenarios that cannot be dumped [5] yet.

Usage

CRIU has many possible use-cases. Some of those are:

  • Container live migration
  • Slow-boot services speed up
  • Seamless kernel upgrade
  • Seamless kernel upgrade
  • "Save" ability in apps (games), that don't have such
  • Snapshots of apps

My use case or now is just to save a snapshot of an application and poweroff the CPU module to later be able to power on and restore it.

PTRACE

For those not familiar with ptrace(2):

The  ptrace() system call provides a means by which one process (the "tracer") may observe and control the execution of an‐
other process (the "tracee"), and examine and change the tracee's memory and registers.  It's primarily used to  implement
breakpoint debugging and system call tracing.

ptrace is the only interface that the Linux kernel provides to poke around and fetch information from inside another application (think debugger and/or tracers).

The PTRACE_SEIZE was introduced in Linux 3.4:

PTRACE_SEIZE (since Linux 3.4)
       Attach  to  the  process  specified  in  pid,  making  it  a  tracee  of the calling process.  Unlike PTRACE_ATTACH,
       PTRACE_SEIZE does not stop the process.  Group-stops are reported as PTRACE_EVENT_STOP and WSTOPSIG(status)  returns
       the  stop  signal.  Automatically attached children stop with PTRACE_EVENT_STOP and WSTOPSIG(status) returns SIGTRAP
       instead of having SIGSTOP signal delivered  to  them.   execve(2)  does  not  deliver  an  extra  SIGTRAP.   Only  a
       PTRACE_SEIZEd  process can accept PTRACE_INTERRUPT and PTRACE_LISTEN commands.  The "seized" behavior just described
       is inherited by  children  that  are  automatically  attached  using  PTRACE_O_TRACEFORK,  PTRACE_O_TRACEVFORK,  and
       PTRACE_O_TRACECLONE.  addr must be zero.  data contains a bit mask of ptrace options to activate immediately.

       Permission to perform a PTRACE_SEIZE is governed by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see be‐
       low.

But it took a while until the checkpoint/restore capability was created for this purpose, see capabilities(7):

CAP_CHECKPOINT_RESTORE (since Linux 5.9)
       •  Update /proc/sys/kernel/ns_last_pid (see pid_namespaces(7));
       •  employ the set_tid feature of clone3(2);
       •  read the contents of the symbolic links in /proc/pid/map_files for other processes.

       This capability was added in Linux  5.9  to  separate  out  checkpoint/restore  functionality  from  the  overloaded
       CAP_SYS_ADMIN capability.

Example

I wrote a simple C application that just count a variable up each second and print the value:

    #include <stdio.h>
    #include <unistd.h>
    int main()
    {
        printf("My PID is %i\n", getpid());
        int count = 0;
        while (1) {
            printf("%d\n", count++);
            sleep(1);
        }
    }

Compile the code:

    gcc main.c -o main

Start The application:

    [17:26:03]marcus@goliat:~/tmp/count$ ./main 
    My PID is 2483855
    0
    1
    2
    3
    4
    5
    6

The process is started with process ID 2483855.

We can now dump the process and store its state. We have to add the --shell-job flag to tell that it was spawned from a shell (and therefor have some file descriptors open to PTYs that needs to be restored).

    [17:27:26]marcus@goliat:~/tmp/criu$ sudo criu dump -t 2483855 --shell-job
    Warn  (compel/arch/x86/src/lib/infect.c:356): Will restore 2483855 with interrupted system call

CRIU needs to have the CAP_SYS_ADMIN or the CAP_CHECKPOINT_RESTORE capability. Set it by:

    setcap cap_checkpoint_restore+eip /usr/bin/criu

The criu dump command will now generate a bunch of files to store the current state of the application. These includes open file descriptors, registers, stackframes, memorymaps and more:

    [17:28:00]marcus@goliat:~/tmp/criu$ ls -1
    core-2483855.img
    fdinfo-2.img
    files.img
    fs-2483855.img
    ids-2483855.img
    inventory.img
    mm-2483855.img
    pagemap-2483855.img
    pages-1.img
    pstree.img
    seccomp.img
    stats-dump
    timens-0.img
    tty-info.img

We can now restore the application from where we stopped:

    [17:29:07]marcus@goliat:~/tmp/criu$ sudo criu restore --shell-job
    27
    28
    29
    30

This is cool. But what is even cooler is that you may restore the application on a different host(!).

Summary

I do not know if CRIU is applicable for what I want to achieve right now, but it's a cool project that I will probably find usage for in the future, so it is a welcome tool to my toolbag.