Memory management in the kernel

Memory management is among the most complex parts in the Linux kernel. There is so many critical parts such as page allocator, slab allocator, virtual memory handling, memory mapping, MMU, IOMMU and so on.

All these parts has to work perfect (or at least allmost perfect :-) ) because all system do use them either they want to or not.

If there is a bug or performance issue you will be noticed quite soon.

My goal is to produce a few posts on the topic and try to sort out the different parts and describe how they work and the connection between. I will begin from the physical bottom and work myself up to how userspace allocates memory in their little blue world with pink clouds. (Everything is so easy on the user side)

struct page

A page is the smallest unit that matters in terms of virtual memory. This is because the MMU (Memory Management Unit) is only dealing with those pages. A typical size for a page is 4KB, at least for 32bit architectures. The most 64-bit architectures uses 8KB pages.

Every one of those physical pages is represented by a struct page that is defined in include/linux/mm_types.h. That is a lot of pages.

If we do a simple calculation: We have a 32-bit system that has 512MB of physical memory, this memory is divided into 131,072 4KB pages.

Think of that 512MB is not even so much memory on a modern system today.

What I want to say is that this struct page should be kept as small as possible because it scales up a lot when physical memory increases.

Ok, so there is a struct page somewhere that got allocated for each physical page, which is a lot, but what does it do? It does a lot of housekeeping, lets look at a few set of members that I think is most interresting:

struct page {
    unsigned long flags;
    unsigned long private;
    void    *virtual;
    atomic_t    _count;
    pgoff_t    index;
#if USE_SPLIT_PTE_PTLOCKS
#if ALLOC_SPLIT_PTLOCKS
    spinlock_t  *ptl;
#else
    spinlock_t  ptl;
#endif
#endif
};

flags is keeping track of the page status which could be dirty (need to be written to media), locked in memory (not allowed to be paged out), permissions and so on. See enum pageflags in include/linux/page-flags.h for more information.

private is not a defined field. May be used as a long or interpreted as a pointer. (Shared with ptl in a union!)

virtual is the virtual address of the page. In case that the page belongs to the high memory (memory that is not permanently mapped) this field will be NULL and require dynamic mapping.

_count is a simple reference counter to determine when the page is free for allocation.

index is the offset within a mapping.

ptl is a interresting one! I think it requires a special section in this post. (Shared with private in a union!)

Page Table Lock

PTL stands for Page Table Lock and is a per-page lock.

In the next part of these memory management posts I will describe the struct mm_struct, how PGD, PMD and PTE are related, but for now it's enough that you just have heard the words.

Ok, there is one thing that is good to know. The struct mm_struct (also defined in mm_types.h) is a structure that represent a process's address space and contains all information related to the process memory. The structure has a pointer to virtual memory areas that refers to one or more struct page. This structure also has the member mm->page_table_lock that is a spinlock that protects all page tables of the mm_struct. This was the original approach and is still used by several architectures.

However, this mm->page_table_lock is a little bit clumsy since it lock all pages at once. This is no real problem on a single-cpu without SMP system. But nowdays that is not a very common scenario.

Instead, the split page table lock was introduced and has a separate per-table lock to allow concurrency access to pages in the same mm_struct. Remember that the mm_struct is per process? So this increases page-fault/page-access performance in multi-threaded applications only.

When is split page table locks enabed? It's enabled in compile-time if CONFIG_SPLIT_PTLOCK_CPUS (I have never seen another value but 4 on this one) is less or equal to NR_CPUS.

Here is a few defines int the beginning of the mm_types.h header file:

#define USE_SPLIT_PTE_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
#define USE_SPLIT_PMD_PTLOCKS (USE_SPLIT_PTE_PTLOCKS && \
    IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK))
#define ALLOC_SPLIT_PTLOCKS (SPINLOCK_SIZE > BITS_PER_LONG/8)

The ALLOC_SPLIT_PTLOCKS is a little bit clever. If the sizeof a spinlock is less or equal to the size of a long, the spinlock is embedded in the struct page and can therefor save a cache line by avoiding indirect access. If a spinlock does not fit into a long, then the page->ptl is used as a pointer that points to a dynamic allocated spinlock. As I said, this is a clever construction since it allow us to increase the size of a spinlock and there is no problem. Exemple when sizeof spinlock does not fit's when using DEBUG_SPINLOCK, DEBUG_LOCK_ALLOC or applying the PREEMPT_RT patchset.

The spinlock_t is allocated in pgtable_page_ctor() for PTE tables and in pgtable_pmd_page_ctor for PMD tables. These function (and the corresponding free-functions) should be called in *every place that allocated/freeing pages. This is already done in mainline, but I know there is evil hardware vendors out there that do not. For example, if you use their evil code and apply the preempt_rt patchset (that increases the sizeof spinlock_t), you have to verify that thier code behaves.

Also, pgtable_*page_ctor() can fail, this must be handled properly.

Remember that the page-ptl should never be accessed directly, use appropriate helper functions for that.

Example on such helper functions is

pte_offset_map_lock()
pte_unmap_lock()
pte_alloc_map_lock()
pte_lockptr()
pmd_lock()
pmd_lockptr()