19

Let's say the buffer is allocated using a page based scheme. One way to implement mmap would be to use remap_pfn_range but LDD3 says this does not work for conventional memory. It appears we can work around this by marking the page(s) reserved using SetPageReserved so that it gets locked in memory. But isn't all kernel memory already non-swappable i.e. already reserved? Why the need to set the reserved bit explicitly?

Does this have something to do with pages allocated from HIGH_MEM?

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
ravi
  • 487
  • 2
  • 4
  • 12
  • 2
    Not sure if this helps but as far as I know, [Perf](http://lxr.free-electrons.com/source/tools/perf/design.txt) subsystem in the kernel provides a set of pages from the kernel memory (a ring buffer, actually) that can be mmap'ed by user-space applications. Its implementation could possibly give some hints concerning your question, may be it is worth it to look at its source code. – Eugene May 26 '12 at 18:40

3 Answers3

24

The simplest way to map a set of pages from the kernel in your mmap method is to use the fault handler to map the pages. Basically you end up with something like:

static int my_mmap(struct file *filp, struct vm_area_struct *vma)
{
    vma->vm_ops = &my_vm_ops;
    return 0;
}

static const struct file_operations my_fops = {
    .owner  = THIS_MODULE,
    .open   = nonseekable_open,
    .mmap   = my_mmap,
    .llseek = no_llseek,
};

(where the other file operations are whatever your module needs). Also in my_mmap you do whatever range checking etc. is needed to validate the mmap parameters.

Then the vm_ops look like:

static int my_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
    vmf->page = my_page_at_index(vmf->pgoff);
    get_page(vmf->page);

    return 0;
} 

static const struct vm_operations_struct my_vm_ops = {
    .fault      = my_fault
}

where you just need to figure out for a given vma / vmf passed to your fault function which page to map into userspace. This depends on exactly how your module works. For example, if you did

my_buf = vmalloc_user(MY_BUF_SIZE);

then the page you use would be something like

vmalloc_to_page(my_buf + (vmf->pgoff << PAGE_SHIFT));

But you could easily create an array and allocate a page for each entry, use kmalloc, whatever.

[just noticed that my_fault is a slightly amusing name for a function]

Roland
  • 6,227
  • 23
  • 29
  • Thank you. This is quite helpful. Don't we however need to call vm_insert_page in the fault handler? Also, who will undo the get_page to allow page to be freed later? I suppose once user-space does munmap, we can get some code exercised from vma_close in which we could put_page for all pages that fault-ed. Is this the right approach? – ravi May 28 '12 at 06:11
  • 3
    No, you don't need to do vm_insert_page if you set vmf->page. If you're doing fancier stuff around mapping non-page-backed device memory, then you might need vm_insert_pfn() but really you probably don't want to worry about that. The put_page() is handled by the core vm code when the mapping is torn down. Really, for a simple driver that maps kernel memory into userspace, I showed you pretty much everything you need. – Roland May 29 '12 at 02:57
  • Hello. What would be the body of my_fault() method if it would be impossible to vmalloc()-ate the my_buf buffer? (because too large). Imean a page-by-page allocation, on demand. – user1284631 Jun 29 '13 at 06:31
  • If you want to allocate a page on demand, then your fault routine should allocate that page and set vmf->page to the page you allocated. – Roland Jul 01 '13 at 17:14
  • Can you explain when callback fault() is called? – Micheal XIV Dec 03 '20 at 08:36
  • @Roland I want to do it mmap implementation with pci driver. Can u please take a look at this question https://stackoverflow.com/questions/65749351/in-order-to-write-pci-ethernet-driver-how-to-implement-mmap-in-the-pci-ethernet – user786 Jan 19 '21 at 06:19
13

Minimal runnable example and userland test

Kernel module:

#include <linux/fs.h>
#include <linux/init.h>
#include <linux/kernel.h> /* min */
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/proc_fs.h>
#include <linux/uaccess.h> /* copy_from_user, copy_to_user */
#include <linux/slab.h>

static const char *filename = "lkmc_mmap";

enum { BUFFER_SIZE = 4 };

struct mmap_info {
    char *data;
};

/* After unmap. */
static void vm_close(struct vm_area_struct *vma)
{
    pr_info("vm_close\n");
}

/* First page access. */
static vm_fault_t vm_fault(struct vm_fault *vmf)
{
    struct page *page;
    struct mmap_info *info;

    pr_info("vm_fault\n");
    info = (struct mmap_info *)vmf->vma->vm_private_data;
    if (info->data) {
        page = virt_to_page(info->data);
        get_page(page);
        vmf->page = page;
    }
    return 0;
}

/* After mmap. TODO vs mmap, when can this happen at a different time than mmap? */
static void vm_open(struct vm_area_struct *vma)
{
    pr_info("vm_open\n");
}

static struct vm_operations_struct vm_ops =
{
    .close = vm_close,
    .fault = vm_fault,
    .open = vm_open,
};

static int mmap(struct file *filp, struct vm_area_struct *vma)
{
    pr_info("mmap\n");
    vma->vm_ops = &vm_ops;
    vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
    vma->vm_private_data = filp->private_data;
    vm_open(vma);
    return 0;
}

static int open(struct inode *inode, struct file *filp)
{
    struct mmap_info *info;

    pr_info("open\n");
    info = kmalloc(sizeof(struct mmap_info), GFP_KERNEL);
    pr_info("virt_to_phys = 0x%llx\n", (unsigned long long)virt_to_phys((void *)info));
    info->data = (char *)get_zeroed_page(GFP_KERNEL);
    memcpy(info->data, "asdf", BUFFER_SIZE);
    filp->private_data = info;
    return 0;
}

static ssize_t read(struct file *filp, char __user *buf, size_t len, loff_t *off)
{
    struct mmap_info *info;
    ssize_t ret;

    pr_info("read\n");
    if ((size_t)BUFFER_SIZE <= *off) {
        ret = 0;
    } else {
        info = filp->private_data;
        ret = min(len, (size_t)BUFFER_SIZE - (size_t)*off);
        if (copy_to_user(buf, info->data + *off, ret)) {
            ret = -EFAULT;
        } else {
            *off += ret;
        }
    }
    return ret;
}

static ssize_t write(struct file *filp, const char __user *buf, size_t len, loff_t *off)
{
    struct mmap_info *info;

    pr_info("write\n");
    info = filp->private_data;
    if (copy_from_user(info->data, buf, min(len, (size_t)BUFFER_SIZE))) {
        return -EFAULT;
    } else {
        return len;
    }
}

static int release(struct inode *inode, struct file *filp)
{
    struct mmap_info *info;

    pr_info("release\n");
    info = filp->private_data;
    free_page((unsigned long)info->data);
    kfree(info);
    filp->private_data = NULL;
    return 0;
}

static const struct file_operations fops = {
    .mmap = mmap,
    .open = open,
    .release = release,
    .read = read,
    .write = write,
};

static int myinit(void)
{
    proc_create(filename, 0, NULL, &fops);
    return 0;
}

static void myexit(void)
{
    remove_proc_entry(filename, NULL);
}

module_init(myinit)
module_exit(myexit)
MODULE_LICENSE("GPL");

GitHub upstream.

Userland test:

#define _XOPEN_SOURCE 700
#include <assert.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h> /* uintmax_t */
#include <string.h>
#include <sys/mman.h>
#include <unistd.h> /* sysconf */

/* Format documented at:
 * https://github.com/torvalds/linux/blob/v4.9/Documentation/vm/pagemap.txt
 */
typedef struct {
    uint64_t pfn : 54;
    unsigned int soft_dirty : 1;
    unsigned int file_page : 1;
    unsigned int swapped : 1;
    unsigned int present : 1;
} PagemapEntry;

/* Parse the pagemap entry for the given virtual address.
 *
 * @param[out] entry      the parsed entry
 * @param[in]  pagemap_fd file descriptor to an open /proc/pid/pagemap file
 * @param[in]  vaddr      virtual address to get entry for
 * @return                0 for success, 1 for failure
 */
int pagemap_get_entry(PagemapEntry *entry, int pagemap_fd, uintptr_t vaddr)
{
    size_t nread;
    ssize_t ret;
    uint64_t data;

    nread = 0;
    while (nread < sizeof(data)) {
        ret = pread(pagemap_fd, ((uint8_t*)&data) + nread, sizeof(data),
                (vaddr / sysconf(_SC_PAGE_SIZE)) * sizeof(data) + nread);
        nread += ret;
        if (ret <= 0) {
            return 1;
        }
    }
    entry->pfn = data & (((uint64_t)1 << 54) - 1);
    entry->soft_dirty = (data >> 54) & 1;
    entry->file_page = (data >> 61) & 1;
    entry->swapped = (data >> 62) & 1;
    entry->present = (data >> 63) & 1;
    return 0;
}

/* Convert the given virtual address to physical using /proc/PID/pagemap.
 *
 * @param[out] paddr physical address
 * @param[in]  pid   process to convert for
 * @param[in] vaddr  virtual address to get entry for
 * @return           0 for success, 1 for failure
 */
int virt_to_phys_user(uintptr_t *paddr, pid_t pid, uintptr_t vaddr)
{
    char pagemap_file[BUFSIZ];
    int pagemap_fd;

    snprintf(pagemap_file, sizeof(pagemap_file), "/proc/%ju/pagemap", (uintmax_t)pid);
    pagemap_fd = open(pagemap_file, O_RDONLY);
    if (pagemap_fd < 0) {
        return 1;
    }
    PagemapEntry entry;
    if (pagemap_get_entry(&entry, pagemap_fd, vaddr)) {
        return 1;
    }
    close(pagemap_fd);
    *paddr = (entry.pfn * sysconf(_SC_PAGE_SIZE)) + (vaddr % sysconf(_SC_PAGE_SIZE));
    return 0;
}

enum { BUFFER_SIZE = 4 };

int main(int argc, char **argv)
{
    int fd;
    long page_size;
    char *address1, *address2;
    char buf[BUFFER_SIZE];
    uintptr_t paddr;

    if (argc < 2) {
        printf("Usage: %s <mmap_file>\n", argv[0]);
        return EXIT_FAILURE;
    }
    page_size = sysconf(_SC_PAGE_SIZE);
    printf("open pathname = %s\n", argv[1]);
    fd = open(argv[1], O_RDWR | O_SYNC);
    if (fd < 0) {
        perror("open");
        assert(0);
    }
    printf("fd = %d\n", fd);

    /* mmap twice for double fun. */
    puts("mmap 1");
    address1 = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (address1 == MAP_FAILED) {
        perror("mmap");
        assert(0);
    }
    puts("mmap 2");
    address2 = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (address2 == MAP_FAILED) {
        perror("mmap");
        return EXIT_FAILURE;
    }
    assert(address1 != address2);

    /* Read and modify memory. */
    puts("access 1");
    assert(!strcmp(address1, "asdf"));
    /* vm_fault */
    puts("access 2");
    assert(!strcmp(address2, "asdf"));
    /* vm_fault */
    strcpy(address1, "qwer");
    /* Also modified. So both virtual addresses point to the same physical address. */
    assert(!strcmp(address2, "qwer"));

    /* Check that the physical addresses are the same.
     * They are, but TODO why virt_to_phys on kernel gives a different value? */
    assert(!virt_to_phys_user(&paddr, getpid(), (uintptr_t)address1));
    printf("paddr1 = 0x%jx\n", (uintmax_t)paddr);
    assert(!virt_to_phys_user(&paddr, getpid(), (uintptr_t)address2));
    printf("paddr2 = 0x%jx\n", (uintmax_t)paddr);

    /* Check that modifications made from userland are also visible from the kernel. */
    read(fd, buf, BUFFER_SIZE);
    assert(!memcmp(buf, "qwer", BUFFER_SIZE));

    /* Modify the data from the kernel, and check that the change is visible from userland. */
    write(fd, "zxcv", 4);
    assert(!strcmp(address1, "zxcv"));
    assert(!strcmp(address2, "zxcv"));

    /* Cleanup. */
    puts("munmap 1");
    if (munmap(address1, page_size)) {
        perror("munmap");
        assert(0);
    }
    puts("munmap 2");
    if (munmap(address2, page_size)) {
        perror("munmap");
        assert(0);
    }
    puts("close");
    close(fd);
    return EXIT_SUCCESS;
}

GitHub upstream.

Tested on kernel 5.4.3.

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
  • 1
    Thx for the code. Userland test doesn't compile due to `#include "commom.h"` (do we need it?) Also, what does mean `#define _XOPEN_SOURCE 700`? – Mixaz Sep 18 '18 at 12:09
  • 1
    @Mixaz thanks for letting me know, I forgot that, let me know if fixed. Note that I had links to my upstream, and those pointed to: https://github.com/cirosantilli/linux-kernel-module-cheat/blob/8d668d6ed3617cc47425e1413513a2d1f99a25fd/kernel_module/user/common.h BTW, just use that repo and be forever happy: https://github.com/cirosantilli/linux-kernel-module-cheat/tree/e11483015813f720d0bc5e62bdc2e9ba00a9fd83#qemu-buildroot-setup :-) – Ciro Santilli OurBigBook.com Sep 18 '18 at 12:19
  • 1
    Thanks for prompt update, now it compiles and works just fine! Indeed I didn't notice the links, let me make them more visible in your post ) – Mixaz Sep 18 '18 at 12:47
  • 1
    From version 4.10, in `struct vm_operations_struct`, `vm_fault`'s prototype is changed. `vm_area_struct` should now be accessed from `vm_fault` (`vmf->vma`). [link](https://elixir.bootlin.com/linux/v4.10-rc1/source/include/linux/mm.h#L294) – Digvijay Chougale Jun 07 '20 at 06:47
  • @DigvijayChougale yes, thanks! I had updated it on my upstream but forgot to propagate it here: https://github.com/cirosantilli/linux-kernel-module-cheat/blob/8a4fc8e9aedae01ece6cb997099e2f3e8bb992ec/kernel_modules/mmap.c#L33 Updated here now. – Ciro Santilli OurBigBook.com Jun 07 '20 at 07:50
  • 1
    The code was really useful. Reading `/proc/lkmc_mmap` leads to *infinite-loop*. I think that you should update `off` in `read()`/`write()`. Using *anonymous mapping*s seems more appropriate. But the implementation will be much harder. Could I have your opinion? – TheAhmad Jul 21 '20 at 21:10
  • 1
    @TheAhmad yes, `cat /proc/lkmc_mmap` leads to an infinite loop, I was mostly focused on the very minimal `read`, but I guess it is better to have the more normal `read` [like I had in another example](https://github.com/cirosantilli/linux-kernel-module-cheat/blob/2ea5e17d23553334c23934d83965de8a47df3780/kernel_modules/fops.c), updated. About anonymous mapping, do you mean with `MAP_ANONYMOUS` on the `mmap` call? If so, how would you communicate with the device driver since that ignores the file pointer? – Ciro Santilli OurBigBook.com Jul 25 '20 at 11:44
  • Your suggestion (similar to `perf_events`) seems to be the **best** approach **without** modifying the *user/kernel interface*. I think that a *file-backed mapping* is **not** the best choice for such page sharing. Do you agree? This is the reason that the current *user/kernel interface* provides **no** suitable alternative (perhaps adding a *flag*, etc to `mmap()` would do this). We can mark the next `mmap()` operation using a new *syscall* and add a `KProbe` to hook `mmap()` and then save the **kernel address** for the shared memory. But this also *violates* the current *syscall interface*. – TheAhmad Jul 25 '20 at 21:23
0

Though the pages are reserved via a kernel driver, it is meant to be accessed via user space. As a result, the PTE (page table entries) do not know if the pfn belongs to user space or kernel space (even though they are allocated via kernel driver).

This is why they are marked with SetPageReserved.

Zombo
  • 1
  • 62
  • 391
  • 407