21

i'm trying to navigate the page tables for a process in linux. In a kernel module i realized the following function:

static struct page *walk_page_table(unsigned long addr)
{
    pgd_t *pgd;
    pte_t *ptep, pte;
    pud_t *pud;
    pmd_t *pmd;

    struct page *page = NULL;
    struct mm_struct *mm = current->mm;

    pgd = pgd_offset(mm, addr);
    if (pgd_none(*pgd) || pgd_bad(*pgd))
        goto out;
    printk(KERN_NOTICE "Valid pgd");

    pud = pud_offset(pgd, addr);
    if (pud_none(*pud) || pud_bad(*pud))
        goto out;
    printk(KERN_NOTICE "Valid pud");

    pmd = pmd_offset(pud, addr);
    if (pmd_none(*pmd) || pmd_bad(*pmd))
        goto out;
    printk(KERN_NOTICE "Valid pmd");

    ptep = pte_offset_map(pmd, addr);
    if (!ptep)
        goto out;
    pte = *ptep;

    page = pte_page(pte);
    if (page)
        printk(KERN_INFO "page frame struct is @ %p", page);

 out:
    return page;
}

This function is called from the ioctl and addr is a virtual address in process address space:

static int my_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, unsigned long addr)
{
   struct page *page = walk_page_table(addr);
   ...
   return 0;
}

The strange thing is that calling ioctl in a user space process, this segfaults...but it seems that the way i'm looking for the page table entry is correct because with dmesg i obtain for example for each ioctl call:

[ 1721.437104] Valid pgd
[ 1721.437108] Valid pud
[ 1721.437108] Valid pmd
[ 1721.437110] page frame struct is @ c17d9b80

So why the process can't complete correcly the `ioctl' call? Maybe i have to lock something before navigating the page tables?

I'm working with kernel 2.6.35-22 and three levels page tables.

Thank you all!

MirkoBanchi
  • 2,173
  • 5
  • 35
  • 52
  • Is it possible that ioctl syscall returns successfully and the code is segfaulting after that? – Rumple Stiltskin Jan 24 '12 at 00:19
  • No because the ioctl syscall is the last instruction in `main` before `return`. If i comment `ioctl` the process doesn't segfault. – MirkoBanchi Jan 24 '12 at 10:49
  • Why did you hide the part where you use the address of the `struct page`? Are you sure your segfaults does not come from here? Have you tried debugging this on qemu? – Quentin Casasnovas Jan 24 '12 at 23:07
  • After the call of `walk-page_table` i only do a `printk` if `page` is `NULL`. I tried also to keep only the call to `walk_page_table` but the process still segfaults. Maybe yes, the fastest way to discover the problem is debugging. Thank you Quentin. – MirkoBanchi Jan 25 '12 at 09:19
  • Compile the code with debugging and force a stack trace during dumps so that you absolutely know what is happening. Or use kgdb. Also are you positively sure you're not using the new unlocked_ioctl feature of the recent kernels? – sessyargc.jp Jan 25 '12 at 10:26
  • I never used kgdb. I'll will debug a UML kernel with gdb. However i'm not using `unlocked_ioctl`: kernel 2.6.35 still has `ioctl` function pointer in `struct file_operations`. Thanks sessyargc.jp! – MirkoBanchi Jan 25 '12 at 20:54

2 Answers2

11
pte_unmap(ptep); 

is missing just before the label out. Try to change the code in this way:

    ...
    page = pte_page(pte);
    if (page)
        printk(KERN_INFO "page frame struct is @ %p", page);

    pte_unmap(ptep); 

out:
  • Thank you. I was sure that kernel was compiled without CONFIG_HIGHPTE instead that define was set so `pte_offset_map` did a `kmap`. – MirkoBanchi Aug 23 '12 at 13:21
  • Thanks! I kept getting a crash with messages like " returned with preemption imbalance". Finally traced the preempt_count() increment to pte_offset_map() ! Adding the pte_unmap decremented it and all well. – kaiwan Feb 26 '14 at 10:36
6

Look at /proc/<pid>/smaps filesystem, you can see the userspace memory:

cat smaps 
bfa60000-bfa81000 rw-p 00000000 00:00 0          [stack]
Size:                136 kB
Rss:                  44 kB

and how it is printed is via fs/proc/task_mmu.c (from kernel source):

http://lxr.linux.no/linux+v3.0.4/fs/proc/task_mmu.c

   if (vma->vm_mm && !is_vm_hugetlb_page(vma))
               walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
               show_map_vma(m, vma.....);
        seq_printf(m,
                   "Size:           %8lu kB\n"
                   "Rss:            %8lu kB\n"
                   "Pss:            %8lu kB\n"

And your function is somewhat like that of walk_page_range(). Looking into walk_page_range() you can see that the smaps_walk structure is not supposed to change while it is walking:

http://lxr.linux.no/linux+v3.0.4/mm/pagewalk.c#L153

For eg:

                }
 201                if (walk->pgd_entry)
 202                        err = walk->pgd_entry(pgd, addr, next, walk);
 203                if (!err &&
 204                    (walk->pud_entry || walk->pmd_entry || walk->pte_entry

If memory contents were to change, then all the above checking may get inconsistent.

All these just mean that you have to lock the mmap_sem when walking the page table:

   if (!down_read_trylock(&mm->mmap_sem)) {
            /*
             * Activate page so shrink_inactive_list is unlikely to unmap
             * its ptes while lock is dropped, so swapoff can make progress.
             */
            activate_page(page);
            unlock_page(page);
            down_read(&mm->mmap_sem);
            lock_page(page);
    }

and then followed by unlocking:

up_read(&mm->mmap_sem);

And of course, when you issue printk() of the pagetable inside your kernel module, the kernel module is running in the process context of your insmod process (just printk the "comm" and you can see "insmod") meaning the mmap_sem is lock, it also mean the process is NOT running, and thus there is no console output till the process is completed (all printk() output goes to memory only).

Sounds logical?

Peter Teoh
  • 6,337
  • 4
  • 42
  • 58
  • Thank you Peter, i tried to held `mmap_sem` before the first instruction that reads the page tables but doesn't work...the same Segmentation fault error. However when i call `walk_page_table` i'm not in the context of the process `insmod`: i call it inside `my_ioctl` so i'm in the context of process invoking `ioctl` syscall. Could this make a difference? – MirkoBanchi Feb 04 '12 at 11:45
  • Yes, it makes a difference. Because different process have different per-process pagetable - if u walk the non-kernel part. But all process share the same pagetable when it comes to the kernel addresses. – Peter Teoh Dec 15 '12 at 00:26