1

I asked this question about how to read a text file starting from offset pos to offset end through mmap(). In particular the text file is read by multiple threads with the following code:

void getNextKeyValue() {
    key = pos;//value is the actual file offset
    char *mmappedData = (char*) mmap(NULL, end-pos+1, PROT_READ, MAP_PRIVATE , fd, pos);
    assert(mmappedData != NULL);
    value.assign(mmappedData);
    assert(munmap(mmappedData, end-pos+1)==0);
    morePairs = false;
}

The unreported variables are declared and initialized somewhere else. What By the way, the following code read the whole text file, and not from offset pos to end.

In addiction, the program terminates abruptly (no error output) with multiple threads, while it terminates correctly with only one thread that read the whole file.

UPDATE:

Following this example (you can try my version, using cout insted of write, HERE with ./main main.cpp 10 20) I found out that what I was doing wrong was that I printed the read data through cout<<mmappedData<<endl. Insted if I use write(STDOUT_FILENO, mmappedData+pos-pa_offset, end-pos); the right portion of text is printed.

What I still do not understand is why the whole text is stored inside mmappedData (or addr following the linked example): the mmap usage clearly states that the number of bytes read are the 2nd arg starting from the 4th arg.

Community
  • 1
  • 1
justHelloWorld
  • 6,478
  • 8
  • 58
  • 138

3 Answers3

1

Update for new info:

Your problem is that you misunderstand how mmap works. mmap maps pages of memory, not bytes; even if you only ask to map 24 bytes, it will actually map several kilobytes (on most systems 4KB), with no guarantee that the data is NUL terminated (and in fact, unless the mapping reaches the end of the file, for a text input file it probably isn't NUL terminated). And the std::string::assign method that takes just a char * as the argument is using C-style string logic; it just keeps reading characters until it hits a NUL. If you're "lucky", the page after the mmap is readable and contains a NUL and it just copies everything up to that NUL into the string (or equivalently when doing cout << mmappedData, writes it out), if not, you segfault trying to access unmapped memory addresses after your mapping.

If the goal is to assign a specific number of bytes to a std::string from a char*, you need to use the two-arg form of assign to make the start and length explicit just like you do with write, then it will only use the data you need:

value.assign(mmappedData+pos-pa_offset, end-pos);

Original speculative answer:

The code given is not clear enough to rule out other issues, but if you're constantly rereading values, and NDEBUG is set (disabling asserts), then you've got two overlapping problems:

  1. You're not checking the return value for mmap (and even with asserts enabled, it's not being checked correctly; the error return for mmap is MAP_FAILED, not NULL)
  2. You're never munmap-ing; the munmap call is inside the assert, and with asserts disabled, the code you're asserting on is removed by the preprocessor.

So if you've got a ton of threads calling this function over and over (particularly if the region being mapped is large), you'll eventually run out of virtual memory space; if you map 1 MB each call, even on a 64 bit system (which actually has only 47 bits of user mode virtual address space), you'd run out of virtual address space after ~134M calls. If you map 1 GB each call, you'd run out after ~134K calls. On a 32 bit system of course you'd hit the limits much faster.

Aside from that, I've got no way to know whether there are other problems; value.assign could easily be the issue as well if it's shared data.

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
1

Reading data from the file this way is going to be effectively single-threaded. Each mmap/munmap pair requires two modifications to the process's address space memory mappings - and only one thread can cause changes to the address space mappings at any one time. Even worse, changing the address space mappings of a process is an expensive operation in and of itself.

Either map the entire file - and leave it mapped - and copy data from the mapping as needed, or use low-level IO calls that are not dependent upon any shared value or state, like pread(). Since you already have an open file descriptor fd to the file , something like this:

void getNextKeyValue() {
    char buffer[end-pos+1];
    ssize_t result = pread(fd, buffer, end-pos+1L, pos);
    assert(result == ( end-pos+1L ) );
    value.assign(buffer);
    morePairs = false;
}

Note that you can not use lseek() then read() - that is not an atomic operation - another thread can modify the file pointer after a thread's call to lseek() but before its call to read().

And do not use new/delete or malloc()/free() for the buffer on each call to getNextKeyValue() - again that effectively single-threads reading data.

Andrew Henle
  • 32,625
  • 3
  • 24
  • 56
0

There are a few things that concern me with your approach. Firstly, I can't imagine that the file in question is really a text file, because it seems to work if you just call assign() with a raw pointer and somehow the size seems to be determined. This means that either you have embedded NULs or that you always copy everything up to the next NUL (which is probably somewhere beyond the mapped range). In any case, the code smells! This is also the reason why cout << ptr "fails" while write(stdout, ptr, size) works, you have buffer overflows in a strlen() call in the first case.

Secondly, why do you use mmap() at all? I guess you hope to gain some benefits concerning the size or time required. However, plain old C++ IOStreams will probably mmap() a sliding window into the file themselves, which is pretty effective. However, you will have to make it easy for IOStreams to do so:

// turn off some syncing that you shouldn't need
std::ios_base::sync_with_stdio(false);
// disable any character conversions
std::ifstream in;
in.imbue(std::locale::classic());
in.open(filename, std::ios_base::binary);

The classic locale doesn't contain any non-trivial character conversion facets. This shouldn't be necessary, but it also doesn't hurt to make it explicit. The binary flag is used to tell the streambuffer that it shouldn't perform any CR/LF conversions, which makes a difference on MS Windows in particular.

Last thing which is a bit tricky with C++ IOStreams is positioning. Other than the start position (a default-constructed streampos), you are only supposed to give values to seekp() and seekg() that you received from tellp() or tellg(). However, there is hope: Without any character conversions in place, the byte offset might just work for you. If not, you could still structure the code that keeps track of positions around that, though it might be a bit tricky. I could imagine that you will have to rely on implementation-defined behaviour here even.

Ulrich Eckhardt
  • 16,572
  • 3
  • 28
  • 55