Why can we allocate a 1 PB (10^15) array and get access to the last element, but can't free it?

Question

As known: http://linux.die.net/man/3/malloc

By default, Linux follows an optimistic memory allocation strategy. This means that when malloc() returns non-NULL there is no guarantee that the memory really is available. In case it turns out that the system is out of memory, one or more processes will be killed by the OOM killer.

And we can successfully allocate 1 Petabyte of VMA (virtual memory area) by using malloc(petabyte);: http://ideone.com/1yskmB

#include <stdio.h>
#include <stdlib.h>

int main(void) {

    long long int petabyte = 1024LL * 1024LL * 1024LL * 1024LL * 1024LL;    // 2^50
    printf("petabyte %lld \n", petabyte);

    volatile char *ptr = (volatile char *)malloc(petabyte);
    printf("malloc() - success, ptr = %p \n", ptr);

    ptr[petabyte - 1LL] = 10;
    printf("ptr[petabyte - 1] = 10; - success \n");

    printf("ptr[petabyte - 1] = %d \n", (int)(ptr[petabyte - 1LL]));

    free((void*)ptr);   // why the error is here?
    //printf("free() - success \n");

    return 0;
}

Result:

Error   time: 0 memory: 2292 signal:6
petabyte 1125899906842624 
malloc() - success, ptr = 0x823e008 
ptr[petabyte - 1] = 10; - success 
ptr[petabyte - 1] = 10

And we can successfully get access (store/load) to the last member of petabyte, but why do we get an error on free((void*)ptr);?

Note: https://en.wikipedia.org/wiki/Petabyte

1000^5 PB petabyte
1024^5 PiB pebibyte - I use it

So really if we want to allocate more than RAM + swap and to work around overcommit_memory limit, then we can allocate memory by using VirtualAllocEx() on Windows, or mmap() on Linux, for example:

16 TiB (16 * 2^40 bytes) then we can use example from Nominal Animal's answer: https://stackoverflow.com/a/38574719/1558037
127 TiB (127 * 2^40 bytes) then we can use mmap() with flags MAP_NORESERVE | MAP_PRIVATE | MAP_ANONYMOUS and fd=-1: http://coliru.stacked-crooked.com/a/c69ce8ad7fbe4560

Any reason you qualify as `volatile`? It is very likely nonsense and defies optimisations. — too honest for this site, Jul 25 '16 at 12:55
@Olaf `volatile` to avoid optimization in which there is no memory access, and will work in registers. To show that we really are working with memory. — Alex, Jul 25 '16 at 12:56
That said, just nitpick, `(int)` cast is not required in `... %d \n", (int)(ptr[petabyte - 1LL]));`, it's implicit. — Sourav Ghosh, Jul 25 '16 at 12:56
Nitpick. Your `petabyte` comment is wrong. That's not 10^15. It is 2^50. (If I counted bits correctly). — Zan Lynx, Jul 25 '16 at 13:00
Memory allocation knows nothing about your data structures. When you write to the last byte of the array, only the last page of memory is actually being allocated. — stark, Jul 25 '16 at 13:00
@Alex: That is quite useless,as there will be something to access if `malloc` does not return a null pointer. — too honest for this site, Jul 25 '16 at 13:01
If you [uncomment the `free`](http://ideone.com/dlC4vw), you'll note that it crashes. By writing to `ptr[-1]`, you clobbered the data malloc stores (in this implementation) before the allocated memory block (of 0 bytes), and thus calling `free((void*)ptr)` crashes. — Daniel Fischer, Jul 27 '16 at 09:35
[don't cast malloc in C](http://stackoverflow.com/q/605845/995714) — phuclv, Apr 26 '17 at 15:35

Zan Lynx · Accepted Answer · 2016-07-25T14:28:15.920

24

I believe that your problem is that malloc() does not take a long long int as its argument. It takes a size_t.

After changing your code to define petabyte as a size_t your program no longer returns a pointer from malloc. It fails instead.

I think that your array access setting petabyte-1 to 10 is writing far, far outside of the array malloc returned. That's the crash.

Always use the correct data types when calling functions.

Use this code to see what's going on:

long long int petabyte = 1024LL * 1024LL * 1024LL * 1024LL * 1024LL;
size_t ptest = petabyte;
printf("petabyte %lld %lu\n", petabyte, ptest);

If I compile in 64 bit mode it fails to malloc 1 petabyte. If I compile in 32 bit mode it mallocs 0 bytes, successfully, then attempts to write outside its array and segfaults.

edited Jul 25 '16 at 14:28

answered Jul 25 '16 at 12:57

Zan Lynx

53,022
10
79
131

2

Disagree with "...your program no longer returns a pointer from malloc. It fails instead." Should `1024LL * 1024LL * 1024LL * 1024LL * 1024LL` exceed `SIZE_MAX`, the expected value of `size_t petabyte` would be 0. Certainly `malloc(0)` did not fail. `malloc(0)` returns a pointer or `NULL` and both are valid values for a pointer and do not indicate OOM. The error is in later code attempting to de-reference `ptr`. – chux - Reinstate Monica Jul 25 '16 at 14:06
@chux: On my 64 bit machine it refuses to allocate a petabyte, no matter if overcommit_memory is set to 0 or 1. As for the expected value being 0, why do you think so? 2^50 wraps around in 2^32, and I don't see how that ever becomes 0. Remember that signed integer wrap is *undefined behavior* in C. – Zan Lynx Jul 25 '16 at 14:09
2

The is no _signed_ integer wrap. `size_t` is an unsigned type. Assigning 2^50 to a `size_t` is well defined regardless of its bit-width. With `size_t` as a 50-bit or narrower type, `size_t petabyte = 1024LL ...` will be 0. – chux - Reinstate Monica Jul 25 '16 at 14:12
@chux When a *signed* integer of size 64 is forced into an *unsigned* 32 bit value, it wraps. Maybe. It's undefined. The key words to search for are `c integer narrowing conversion` – Zan Lynx Jul 25 '16 at 14:19
@chux: OK, I see what you're saying now. Yes, on regular x86 hardware it appears to come out as a zero. – Zan Lynx Jul 25 '16 at 14:25
2

6.3.1.3 Signed and unsigned integers 2 "if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type." It is not _signed_ overflow. – chux - Reinstate Monica Jul 25 '16 at 14:26

score 9 · Answer 2 · answered Jul 25 '16 at 18:11

(This is not an answer, but an important note on anybody working with large datasets in Linux)

That is not how you use very large -- on the order of terabytes and up -- datasets in Linux.

When you use malloc() or mmap() (the GNU C library will use mmap() internally for large allocations anyway) to allocate private memory, the kernel limits the size to the size of (theoretically) available RAM and SWAP, multiplied by the overcommit factor.

Simply put, we know that larger-than-RAM datasets may have to be swapped out, so the size of the current swap will affect how large allocations are allowed.

To work around that, we create a file to be used as "swap" for the data, and map it using the MAP_NORESERVE flag. This tells the kernel that we don't want to use standard swap for this mapping. (It also means that if, for any reason, the kernel cannot get a new backing page, the application will get a SIGSEGV signal and die.)

Most filesystems in Linux support sparse files. This means that you can have a terabyte-sized file, that only takes a few kilobytes of actual disk space, if most of its contents are not written (and are thus zeroes). (Creating sparse files is easy; you simply skip over long runs of zeroes. Hole-punching is more difficult, as writing zeroes does use normal disk space, other methods need to be used instead.)

Here is an example program that you can use for exploration, mapfile.c:

#define _POSIX_C_SOURCE 200809L
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
    const char    *filename;
    size_t         page, size;
    int            fd, result;
    unsigned char *data;
    char           dummy;

    if (argc != 3 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s MAPFILE BYTES\n", argv[0]);
        fprintf(stderr, "\n");
        return EXIT_FAILURE;
    }

    page = sysconf(_SC_PAGESIZE);
    if (page < 1) {
        fprintf(stderr, "Unknown page size.\n");
        return EXIT_FAILURE;
    }

    filename = argv[1];
    if (!filename || !*filename) {
        fprintf(stderr, "No map file name specified.\n");
        return EXIT_FAILURE;
    }

    if (sscanf(argv[2], " %zu %c", &size, &dummy) != 1 || size < 3) {
        fprintf(stderr, "%s: Invalid size in bytes.\n", argv[2]);
        return EXIT_FAILURE;
    }

    if (size % page) {
        /* Round up to next multiple of page */
        size += page - (size % page);
        fprintf(stderr, "Adjusted to %zu pages (%zu bytes)\n", size / page, size);
    }

    do {
        fd = open(filename, O_RDWR | O_CREAT | O_EXCL, 0600);
    } while (fd == -1 && errno == EINTR);
    if (fd == -1) {
        fprintf(stderr, "Cannot create %s: %s.\n", filename, strerror(errno));
        return EXIT_FAILURE;
    }

    do {
        result = ftruncate(fd, (off_t)size);
    } while (result == -1 && errno == EINTR);
    if (result == -1) {
        fprintf(stderr, "Cannot resize %s: %s.\n", filename, strerror(errno));
        unlink(filename);
        close(fd);
        return EXIT_FAILURE;
    }

    data = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0);
    if ((void *)data == MAP_FAILED) {
        fprintf(stderr, "Mapping failed: %s.\n", strerror(errno));
        unlink(filename);
        close(fd);
        return EXIT_FAILURE;
    }

    fprintf(stderr, "Created file '%s' to back a %zu-byte mapping at %p successfully.\n", filename, size, (void *)data);

    fflush(stdout);
    fflush(stderr);

    data[0] = 1U;
    data[1] = 255U;

    data[size-2] = 254U;
    data[size-1] = 127U;

    fprintf(stderr, "Mapping accessed successfully.\n");

    munmap(data, size);
    unlink(filename);
    close(fd);

    fprintf(stderr, "All done.\n");
    return EXIT_SUCCESS;
}

Compile it using e.g.

gcc -Wall -O2 mapfile.c -o mapfile

and run it without arguments to see the usage.

The program simply sets up a mapping (adjusted to a multiple of the current page size), and accesses the first two and last two bytes of the mapping.

On my machine, running a 4.2.0-42-generic #49~14.04.1-Ubuntu SMP kernel on x86-64, on an ext4 filesystem, I cannot map a full petabyte. The maximum seems to be about 17,592,186,040,320 bytes (2⁴⁴-4096) -- 16 TiB - 4 KiB --, which comes to 4,294,967,296 pages of 4096 bytes (2³² pages of 2¹² bytes each). It looks like the limit is imposed by the ext4 filesystem, as the failure occurs in the ftruncate() call (before the mapping is even tried).

(On a tmpfs I can get up to about 140,187,732,541,440 bytes or 127.5 TiB, but that's just a gimmick, because tmpfs is backed by RAM and swap, not an actual storage device. So it's not an option for real big data work. I seem to recall xfs would work for really large files, but I'm too lazy to format a partition or even look up the specs; I don't think anybody will actually read this post, even though the information herein has been very useful to me over the last decade or so.)

Here's how that example run looks on my machine (using a Bash shell):

$ ./mapfile datafile $[(1<<44)-4096]
Created file 'datafile' to back a 17592186040320-byte mapping at 0x6f3d3e717000 successfully.
Mapping accessed successfully.
All done.

.

Thank you! Can we simply map `/dev/zero` (or use MAP_ANONYMOUS) instead of `datafile`, to allocate uncommited array, and to work around limits of filesystem: 2^32 pages and supporting sparse files? As said there: http://stackoverflow.com/questions/2782628/any-way-to-reserve-but-not-commit-memory-in-linux/2782910#2782910 — Alex, Jul 25 '16 at 18:58
`I don't think anybody will actually read this post` I read the entire post and learned something. — Aloha, Jul 25 '16 at 18:59
@Alex: No, not really. The idea is, after all, to have the kernel *page out* pages when they have not been accessed recently, because we do not have enough RAM for the entire dataset. You can, instead, just use `mmap(NULL, bytes, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_NORESERVE, -1, 0)` to get private anonymous pages, up to as many as the kernel can support (unless limited for the current process). If you run out of RAM, you then get a `SIGSEGV`, because there is nowhere to evict the unused pages to to make room for new ones. — Nominal Animal, Jul 25 '16 at 19:14
Yes, you are right, so we will be able to allocate **16 TiB** memory is much larger than the size of the RAM, and when it will full, then **it will swap** (to datafile) instead of falling entire application. Also we can allocate even more memory - **127 TiB** if used flags `MAP_NORESERVE | MAP_PRIVATE | MAP_ANONYMOUS` and `fd=-1`, **but when RAM will full, then application will crash** with `SIGSEGV`: http://coliru.stacked-crooked.com/a/c69ce8ad7fbe4560 — Alex, Jul 25 '16 at 19:46
@Alex: Exactly. The precise limits do vary between Linux architectures, file systems, and kernel versions, and can even be set per process ([ulimit](http://man7.org/linux/man-pages/man3/ulimit.3.html), often set in a PAM module at login time. You can check these in Bash using `ulimit -a` -- `ulimit` being a built-in function in Bash, not a separate binary.) — Nominal Animal, Jul 25 '16 at 22:24

Why can we allocate a 1 PB (10^15) array and get access to the last element, but can't free it?

2 Answers2