18

Due to some obscure reasons which are not relevant for this question, I need to resort to use MAP_FIXED in order to obtain a page close to where the text section of libc lives in memory.

Before reading mmap(2) (which I should had done in the first place), I was expecting to get an error if I called mmap with MAP_FIXED and a base address overlapping an already-mapped area.

However that is not the case. For instance, here is part of /proc/maps for certain process

7ffff7299000-7ffff744c000 r-xp 00000000 08:05 654098                     /lib/x86_64-linux-gnu/libc-2.15.so

Which, after making the following mmap call ...

  mmap(0x7ffff731b000,
       getpagesize(),
       PROT_READ | PROT_WRITE | PROT_EXEC,
       MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED,
       0,
       0);

... turns into:

7ffff7299000-7ffff731b000 r-xp 00000000 08:05 654098                     /lib/x86_64-linux-gnu/libc-2.15.so
7ffff731b000-7ffff731c000 rwxp 00000000 00:00 0 
7ffff731c000-7ffff744c000 r-xp 00083000 08:05 654098                     /lib/x86_64-linux-gnu/libc-2.15.so

Which means I have overwritten part of the virtual address space dedicated to libc with my own page. Clearly not what I want ...

In the MAP_FIXED part of the mmap(2) manual, it clearly states:

If the memory region specified by addr and len overlaps pages of any existing mapping(s), then the overlapped part of the existing mapping(s) will be discarded.

Which explains what I am seeing, but I have a couple of questions:

  1. Is there a way to detect if something was already mapped to certain address? without accesing /proc/maps?
  2. Is there a way to force mmap to fail in the case of finding overlapping pages?
fons
  • 4,905
  • 4
  • 29
  • 49
  • 1
    (+1) This may be of *some* help: http://stackoverflow.com/questions/8362747/how-can-i-detect-whether-a-specific-page-is-mapped-in-memory – NPE Feb 18 '13 at 19:49
  • 4
    I think using "MAP_FIXED" is one of those things where "here, have this gun, but please be careful you don't shoot yourself in the foot". In other words, "it's your task to ensure you use it correctly". I've looked over some of the mmap code in the past, and as far as I know, it does what you ask for, whatever you actually ask for - as long as it's not breaching security [and your application destroying it's own copy of the C library is not a security breach, since all that will happen is that YOUR application dies...] – Mats Petersson Feb 18 '13 at 19:56
  • 1
    @MatsPetersson - fair enough, but is it even _possible_ to use it correct? I.e., if you are aware that `MAP_FIXED` overwrites existing mappings, and you are willing to _check_ for existing mappings, is there a reasonable way to do it? – BeeOnRope Dec 30 '16 at 21:58
  • @BeeOnRope If you know the correct addresses and know what is mapped in the system, it's not impossible to use it correctly. Just hard. – Mats Petersson Dec 31 '16 at 08:23
  • 1
    @MatsPetersson - the problem is how will you know what's mapped in the system? Especially things like the C runtime are going to map things at arbitrary locations beyond your control. – BeeOnRope Dec 31 '16 at 14:10
  • @BeeOnRope: Indeed, that can be a big problem. But it technically CAN be used... ;) – Mats Petersson Jan 01 '17 at 10:02

4 Answers4

9
  1. Use page = sysconf(SC_PAGE_SIZE) to find out the page size, then scan each page-sized block you wish to check using msync(addr, page, 0) (with (unsigned long)addr % page == 0, i.e. addr aligned to pages). If it returns -1 with errno == ENOMEM, that page is not mapped.

    Edited: As fons commented below, mincore(addr,page,&dummy) is superior to msync(). (The implementation of the syscall is in mm/mincore.c in the Linux kernel sources, with C libraries usually providing a wrapper that updates errno. As the syscall does the mapping check immediately after making sure addr is page aligned, it is optimal in the not-mapped case (ENOMEM). It does some work if the page is already mapped, so if performance is paramount, try to avoid checking pages you know are mapped.

    You must do this individually, separately per each page, because for regions larger than a single page, ENOMEM means that the region was not fully mapped; it might still be partially mapped. Mapping is always granular to page-sized units.

  2. As far as I can tell, there is no way to tell mmap() to fail if the region is already mapped, or contains already mapped pages. (The same applies to mremap(), so you cannot create a mapping, then move it to the desired region.)

    This means you run a risk of a race condition. It would be best to execute the actual syscalls yourself, instead of the C library wrappers, just in case they do memory allocation or change memory mappings internally:

    #define _GNU_SOURCE
    #include <unistd.h>
    #include <sys/syscall.h>
    
    static size_t page = 0;
    static inline size_t page_size(void)
    {
        if (!page)
            page = (size_t)sysconf(_SC_PAGESIZE);
        return page;
    }
    
    
    static inline int raw_msync(void *addr, size_t length, int flags)
    {
        return syscall(SYS_msync, addr, length, flags);
    }
    
    static inline void *raw_mmap(void *addr, size_t length, int prot, int flags)
    {
        return (void *)syscall(SYS_mmap, addr, length, prot, flags, -1, (off_t)0);
    }
    

However, I suspect that whatever it is you are trying to do, you eventually need to parse /proc/self/maps anyway.

  • I recommend avoiding standard I/O stdio.h altogether (as the various operations will allocate memory dynamically, and thus change the mappings), and instead use the lower-level unistd.h interfaces, which are much less likely to affect the mappings. Here is a set of simple, crude functions, that you can use to find out each mapped region and the protections enabled in that region (and discard the other info). In practice, it uses about a kilobyte of code and less than that in stack, so it is very useful even on limited architectures (say, embedded devices).

    #include <unistd.h>
    #include <fcntl.h>
    #include <errno.h>
    #include <string.h>
    
    #ifndef   INPUT_BUFFER
    #define   INPUT_BUFFER   512
    #endif /* INPUT_BUFFER */
    
    #ifndef   INPUT_EOF
    #define   INPUT_EOF     -256
    #endif /* INPUT_EOF */
    
    #define   PERM_PRIVATE  16
    #define   PERM_SHARED    8
    #define   PERM_READ      4
    #define   PERM_WRITE     2
    #define   PERM_EXEC      1
    
    typedef struct {
        int            descriptor;
        int            status;
        unsigned char *next;
        unsigned char *ends;
        unsigned char  buffer[INPUT_BUFFER + 16];
    } input_buffer;
    
    /* Refill input buffer. Returns the number of new bytes.
     * Sets status to ENODATA at EOF.
    */
    static size_t input_refill(input_buffer *const input)
    {
        ssize_t n;
    
        if (input->status)
            return (size_t)0;
    
        if (input->next > input->buffer) {
            if (input->ends > input->next) {
                memmove(input->buffer, input->next,
                        (size_t)(input->ends - input->next));
                input->ends = input->buffer + (size_t)(input->ends - input->next);
                input->next = input->buffer;
            } else {
                input->ends = input->buffer;
                input->next = input->buffer;
            }
        }
    
        do {
            n = read(input->descriptor, input->ends,
                     INPUT_BUFFER - (size_t)(input->ends - input->buffer));
        } while (n == (ssize_t)-1 && errno == EINTR);
        if (n > (ssize_t)0) {
            input->ends += n;
            return (size_t)n;
    
        } else
        if (n == (ssize_t)0) {
            input->status = ENODATA;
            return (size_t)0;
        }
    
        if (n == (ssize_t)-1)
            input->status = errno;
        else
            input->status = EIO;
    
        return (size_t)0;
    }
    
    /* Low-lever getchar() equivalent.
    */
    static inline int input_next(input_buffer *const input)
    {
        if (input->next < input->ends)
            return *(input->next++);
        else
        if (input_refill(input) > 0)
            return *(input->next++);
        else
            return INPUT_EOF;
    }
    
    /* Low-level ungetc() equivalent.
    */
    static inline int input_back(input_buffer *const input, const int c)
    {
        if (c < 0 || c > 255)
            return INPUT_EOF;
        else
        if (input->next > input->buffer)
            return *(--input->next) = c;
        else
        if (input->ends >= input->buffer + sizeof input->buffer)
            return INPUT_EOF;
    
        memmove(input->next + 1, input->next, (size_t)(input->ends - input->next));
        input->ends++;
        return *(input->next) = c;
    }
    
    /* Low-level fopen() equivalent.
    */
    static int input_open(input_buffer *const input, const char *const filename)
    {
        if (!input)
            return errno = EINVAL;
    
        input->descriptor = -1;
        input->status = 0;
        input->next = input->buffer;
        input->ends = input->buffer;
    
        if (!filename || !*filename)
            return errno = input->status = EINVAL;
    
        do {
            input->descriptor = open(filename, O_RDONLY | O_NOCTTY);
        } while (input->descriptor == -1 && errno == EINTR);
        if (input->descriptor == -1)
            return input->status = errno;
    
        return 0;
    }
    
    /* Low-level fclose() equivalent.
    */
    static int input_close(input_buffer *const input)
    {
        int result;
    
        if (!input)
            return errno = EINVAL;
    
        /* EOF is not an error; we use ENODATA for that. */
        if (input->status == ENODATA)
            input->status = 0;
    
        if (input->descriptor != -1) {
            do {
                result = close(input->descriptor);
            } while (result == -1 && errno == EINTR);
            if (result == -1 && !input->status)
                input->status = errno;
        }
    
        input->descriptor = -1;
        input->next = input->buffer;
        input->ends = input->buffer;
    
        return errno = input->status;
    }
    
    /* Read /proc/self/maps, and fill in the arrays corresponding to the fields.
     * The function will return the number of mappings, even if not all are saved.
    */
    size_t read_maps(size_t const n,
                     void **const ptr, size_t *const len,
                     unsigned char *const mode)
    {
        input_buffer    input;
        size_t          i = 0;
        unsigned long   curr_start, curr_end;
        unsigned char   curr_mode;
        int             c;
    
        errno = 0;
    
        if (input_open(&input, "/proc/self/maps"))
            return (size_t)0; /* errno already set. */
    
        c = input_next(&input);
        while (c >= 0) {
    
            /* Skip leading controls and whitespace */
            while (c >= 0 && c <= 32)
                c = input_next(&input);
    
            /* EOF? */
            if (c < 0)
                break;
    
            curr_start = 0UL;
            curr_end = 0UL;
            curr_mode = 0U;
    
            /* Start of address range. */
            while (1)
                if (c >= '0' && c <= '9') {
                    curr_start = (16UL * curr_start) + c - '0';
                    c = input_next(&input);
                } else
                if (c >= 'A' && c <= 'F') {
                    curr_start = (16UL * curr_start) + c - 'A' + 10;
                    c = input_next(&input);
                } else
                if (c >= 'a' && c <= 'f') {
                    curr_start = (16UL * curr_start) + c - 'a' + 10;
                    c = input_next(&input);
                } else
                    break;
            if (c == '-')
                c = input_next(&input);
            else {
                errno = EIO;
                return (size_t)0;
            }
    
            /* End of address range. */
            while (1)
                if (c >= '0' && c <= '9') {
                    curr_end = (16UL * curr_end) + c - '0';
                    c = input_next(&input);
                } else
                if (c >= 'A' && c <= 'F') {
                    curr_end = (16UL * curr_end) + c - 'A' + 10;
                    c = input_next(&input);
                } else
                if (c >= 'a' && c <= 'f') {
                    curr_end = (16UL * curr_end) + c - 'a' + 10;
                    c = input_next(&input);
                } else
                    break;
            if (c == ' ')
                c = input_next(&input);
            else {
                errno = EIO;
                return (size_t)0;
            }
    
            /* Permissions. */
            while (1)
                if (c == 'r') {
                    curr_mode |= PERM_READ;
                    c = input_next(&input);
                } else
                if (c == 'w') {
                    curr_mode |= PERM_WRITE;
                    c = input_next(&input);
                } else
                if (c == 'x') {
                    curr_mode |= PERM_EXEC;
                    c = input_next(&input);
                } else
                if (c == 's') {
                    curr_mode |= PERM_SHARED;
                    c = input_next(&input);
                } else
                if (c == 'p') {
                    curr_mode |= PERM_PRIVATE;
                    c = input_next(&input);
                } else
                if (c == '-') {
                    c = input_next(&input);
                } else
                    break;
            if (c == ' ')
                c = input_next(&input);
            else {
                errno = EIO;
                return (size_t)0;
            }
    
            /* Skip the rest of the line. */
            while (c >= 0 && c != '\n')
                c = input_next(&input);
    
            /* Add to arrays, if possible. */
            if (i < n) {
                if (ptr) ptr[i] = (void *)curr_start;
                if (len) len[i] = (size_t)(curr_end - curr_start);
                if (mode) mode[i] = curr_mode;
            }
            i++;
        }
    
        if (input_close(&input))
            return (size_t)0; /* errno already set. */
    
        errno = 0;
        return i;
    }
    

    The read_maps() function reads up to n regions, start addresses as void * into the ptr array, lengths into the len array, and permissions into the mode array, returning the total number of maps (may be greater than n), or zero with errno set if an error occurs.

    It is quite possible to use syscalls for the low-level I/O above, so that you don't use any C library features, but I don't think it is at all necessary. (The C libraries, as far as I can tell, use very simple wrappers around the actual syscalls for these.)

I hope you find this useful.

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
  • Wouldn't it be better performance-wise to use mincore() rather than msync() since it has no direct IO implications? – fons Feb 19 '13 at 08:38
  • @fons: `read_maps()` is implemented at end of the second source snippet. I wrote it from scratch (was bored), against the kernel docs. These are stable interfaces, so if procfs is mounted at `/proc`, it should work. As to `mincore()`, absolutely -- but again, only if you check each individual page. (I just checked the latest kernel sources for the `mincore()` syscall in `mm/mincore.c`, and it is pretty much optimal: it does the `access_ok()` immediately after verifying page alignment.) So yes; mincore() would be the better choice. – Nominal Animal Feb 19 '13 at 17:33
  • 2
    There is an easy way to tell `mmap` to fail if the address is already taken: omit `MAP_FIXED`. In that case it will "spuriously" succeed choosing a different address than the one you requested. If the differently-chosen address is acceptable, simply use it. If not, call `munmap` and try again at a different address. – R.. GitHub STOP HELPING ICE Sep 13 '14 at 20:31
  • 1
    Like the `dup2` function for file descriptors, `MAP_FIXED` should never be used except when you know the target is already allocated, won't be deallocated before/during your call, and you want to atomically replace it with something new. – R.. GitHub STOP HELPING ICE Sep 13 '14 at 20:32
  • NOTICE: both `msync()` and `mincore()` may return `0` if a page is NOT mapped, that is, for a mapped page, you will get result `0`, but for a unmapped page, you may get `0` or may get `-1` with `ENOMEM`. – Kelvin Hu Nov 22 '17 at 11:27
  • @KelvinHu: The [`mincore()`](http://man7.org/linux/man-pages/man2/mincore.2.html) man page explicitly states that if the range contains unmapped pages, it returns `errno == ENOMEM`; the comments for the kernel [`linux/mincore.c:sys_mincore()`](https://github.com/torvalds/linux/blob/master/mm/mincore.c) says the exact same. What is the basis of your claim? – Nominal Animal Nov 22 '17 at 12:33
  • @NominalAnimal Sorry for didn't make things clear, I mean that for a unmapped page, `mincore()` may NOT always return `errno == ENOMEM`, it may also return `errno == 0(no error)`, so one can not depend on the following code: `if (mincore(...) == -1 && errno == ENOMEM) printf("unmapped"); else printf("mapped");` In the `else` statement it should be `printf("might be mapped");` – Kelvin Hu Nov 23 '17 at 03:02
  • @KelvinHu: The only case where that can occur, as far as I can tell (based on the documentation and comments in the code; I haven't actually checked the entire call graph on the kernel side), is if another thread in the same process causes the page or pages to be mapped just after the `mincore()` syscall. If that is not what you mean, then it looks like it would be a kernel bug for that to occur. So, what is the basis of your claim? Have you observed this in some situation, or is it just your (or someone elses) personal belief that it might occur? – Nominal Animal Nov 23 '17 at 07:40
  • @NominalAnimal I observed this behavior, the code and description is posted at: https://pastebin.com/Zm52Y88w, my environment is **Ubuntu 16.04, kernel: 4.4.0** – Kelvin Hu Nov 23 '17 at 09:27
  • 1
    @KelvinHu: Ah, now I understand. You see, even in your case `mincore()` output **is correct**; it is just that the C library allocates extra guard pages that are not released when you unmap the original mapping. You can verify this by running `strace ./yourbinary`, looking at the `mmap()` calls. You'll notice that the `/proc/self/maps` output contains those guard pages, and its output matches what `mincore()` reports. In other words: **No, `mincore()` output IS reliable**; it's just that you may have unexpected "guard pages" mapped for you. – Nominal Animal Nov 23 '17 at 15:21
  • @KelvinHu: If you want, I can try and write a freestanding C program (using syscalls directly, avoiding the standard C library altogether), that demonstrates the issue. Essentially, it uses static buffers: reads and writes `/proc/self/maps` and `mincore()` test results before mapping, after mapping, and after unmapping. You'll see that the `/proc/self/maps` output and `mincore()` output match perfectly at all times. Because the former is kernel's view of the process mapping, this shows that `mincore()` is reliable, it's just that there are sometimes unexpected extra mappings nearby. – Nominal Animal Nov 23 '17 at 15:24
  • @NominalAnimal Wow, thanks for your great explanation! I never realized that the glibc wrapper is doing things not mentioned in man page! I will do further investigation myself, thanks. – Kelvin Hu Nov 24 '17 at 03:15
  • @KelvinHu: No worries! I do not recall seeing the extra mappings before, but then again, I haven't really looked for them, and might just have missed them as library-internal allocations. (Usually, in simulations, I do tend to do very large fixed-size mappings (with no-swap file backing for restart capability), so I haven't had opportunity to really see those few extra pages.) So, if you do find additional notable information, please do let me know; either in this comment chain, or direct to my email. Thanks! – Nominal Animal Nov 24 '17 at 10:56
7

"Which explains what I am seeing, but I have a couple of questions:"

"Is there a way to detect if something was already mapped to certain address? without accessing /proc/maps?"

Yes, use mmap without MAP_FIXED.

"Is there a way to force mmap to fail in the case of finding overlapping pages?"

Apparently not, but simply use munmap after the mmap if mmap returns a mapping at other than the requested address.

When used without MAP_FIXED, mmap on both linux and Mac OS X (and I suspect elsewhere also) obeys the address parameter iff no existing mapping in the range [address, address + length) exists. So if mmap answers a mapping at a different address to the one you supply you can infer there already exists a mapping in that range and you need to use a different range. Since mmap will typically answer a mapping at a very high address when it ignores the address parameter, simply unmap the region using munmap, and try again at a different address.

Using mincore to check for use of an address range is not only a waste of time (one has to probe a page at a time), it may not work. Older linux kernels will only fail mincore appropriately for file mappings. They won't answer anything at all for MAP_ANON mappings. But as I've pointed out, all you need is mmap and munmap.

I've just been through this exercise in implementing a memory manager for a Smalltalk VM. I use sbrk(0) to find out the first address at which I can map the first segment, and then use mmap and an increment of 1Mb to search for room for subsequent segments:

static long          pageSize = 0;
static unsigned long pageMask = 0;

#define roundDownToPage(v) ((v)&pageMask)
#define roundUpToPage(v) (((v)+pageSize-1)&pageMask)

void *
sqAllocateMemory(usqInt minHeapSize, usqInt desiredHeapSize)
{
    char *hint, *address, *alloc;
    unsigned long alignment, allocBytes;

    if (pageSize) {
        fprintf(stderr, "sqAllocateMemory: already called\n");
        exit(1);
    }
    pageSize = getpagesize();
    pageMask = ~(pageSize - 1);

    hint = sbrk(0); /* the first unmapped address above existing data */

    alignment = max(pageSize,1024*1024);
    address = (char *)(((usqInt)hint + alignment - 1) & ~(alignment - 1));

    alloc = sqAllocateMemorySegmentOfSizeAboveAllocatedSizeInto
                (roundUpToPage(desiredHeapSize), address, &allocBytes);
    if (!alloc) {
        fprintf(stderr, "sqAllocateMemory: initial alloc failed!\n");
        exit(errno);
    }
    return (usqInt)alloc;
}

/* Allocate a region of memory of at least size bytes, at or above minAddress.
 *  If the attempt fails, answer null.  If the attempt succeeds, answer the
 * start of the region and assign its size through allocatedSizePointer.
 */
void *
sqAllocateMemorySegmentOfSizeAboveAllocatedSizeInto(sqInt size, void *minAddress, sqInt *allocatedSizePointer)
{
    char *address, *alloc;
    long bytes, delta;

    address = (char *)roundUpToPage((unsigned long)minAddress);
    bytes = roundUpToPage(size);
    delta = max(pageSize,1024*1024);

    while ((unsigned long)(address + bytes) > (unsigned long)address) {
        alloc = mmap(address, bytes, PROT_READ | PROT_WRITE,
                     MAP_ANON | MAP_PRIVATE, -1, 0);
        if (alloc == MAP_FAILED) {
            perror("sqAllocateMemorySegmentOfSizeAboveAllocatedSizeInto mmap");
            return 0;
        }
        /* is the mapping both at or above address and not too far above address? */
        if (alloc >= address && alloc <= address + delta) {
            *allocatedSizePointer = bytes;
            return alloc;
        }
        /* mmap answered a mapping well away from where Spur prefers.  Discard
         * the mapping and try again delta higher.
         */
        if (munmap(alloc, bytes) != 0)
            perror("sqAllocateMemorySegment... munmap");
        address += delta;
    }
    return 0;
}

This appears to work well, allocating memory at ascending addresses while skipping over any existing mappings.

HTH

Eliot Miranda
  • 291
  • 2
  • 5
4

It seems that posix_mem_offset() is what I was looking for.

Not only it tells you if an address is mapped but also, in case it happens to be mapped, it implicitly gives you the boundaries of the mapped area to which it belongs (by providing SIZE_MAX in the len argument).

So, before enforcing MAP_FIXED, I can use posix_mem_offset() to verify that the address I am using is not mapped yet.

I could use msync() or mincore() too (checking for an ENOMEM error tells you that an address is already mapped), but then I would be blinder (no information about the area where the address is mapped). Also, msync() has side effects which may have a performance impact and mincore() is BSD-only (not POSIX).

fons
  • 4,905
  • 4
  • 29
  • 49
  • 2
    Current Linux kernels do not provide such a syscall, and at least `libc6-2.15-0ubuntu10.3` does not provide a `posix_mem_offset()` function, so `posix_mem_offset()` may not be as portable as you think. – Nominal Animal Feb 19 '13 at 17:26
  • @NominalAnimal, True. I wrote this before getting to test the function, I will use mincore then. – fons Feb 20 '13 at 08:35
2

MAP_FIXED_NOREPLACE exists since 4.17 and seems to me that it is exactly what you are looking for: "If the requested range would collide with an existing mapping, then this call fails with the error EEXIST."

(4.17 was released ~2 years after you posted this question.)

Saturnus
  • 99
  • 6