10

I am working on an application which needs to deal with large amounts of data(in GBs). I don't need all the data at once at any moment of time. It is ok to section the data and work only on(and thus bring it into memory) a section at any given instance.

I have read that most applications which need to manipulate large amounts of data, usually do so by making use of memory mapped files. Reading further about memory mapped files, I found that reading/writing data from/into memory mapped files is faster than normal file IO because we end up using highly optimized page file algorithms for performing the read write.

Here are the queries that I have:

  1. How different is using memory mapped files(I am planning to use boost::file_mapping and I am working on windows) for file IO than using file streams?
  2. How much faster can I expect the data read/writes to be in case of memory mapped files when compared to using file streams(on a traditional hard disk 7200 rpm)?
  3. Is memory mapped files the only way to deal with such huge amounts of data? Are there better ways of doing this(considering my use case)?
Arun
  • 3,138
  • 4
  • 30
  • 41
  • Interesting case: https://stackoverflow.com/a/25150519/85371 (where fmapping turned out slower than streambased IO) – sehe Nov 23 '15 at 15:16
  • Possible duplicate of [mmap() vs. reading blocks](http://stackoverflow.com/questions/45972/mmap-vs-reading-blocks) – Colonel Thirty Two Nov 23 '15 at 22:01

4 Answers4

9

(Disclaimer: I am the author of proposed Boost.AFIO)

How different is using memory mapped files(I am planning to use boost::file_mapping and I am working on windows) for file IO than using file streams?

Grossly simplified answer:

Memory mapped files do reads in 4Kb chunks lazily i.e. when you first access that 4Kb page. File streams do the read when you ask for the data.

More accurate answer:

Memory mapped files give you direct access to the kernel page cache for file i/o. You see exactly what the kernel keeps cached for some open file. Reads and writes are directly to the kernel page cache - one can go no faster for buffered i/o.

How much faster can I expect the data read/writes to be in case of memory mapped files when compared to using file streams(on a traditional hard disk 7200 rpm)?

Probably not noticeable. If you benchmark a difference, it's likely confounding factors like differing caching algorithms. A hard drive is so slow it'll always be the dominant factor.

Now if you were really asking how efficient the two are compared from the point of load on the system, then memory mapped files are likely to be far more efficient. STL iostreams copies memory at least once, plus on Windows most "immediate" i/o is really a memcpy from a small internal memory map configured by the Windows kernel for your process, so that's two memory copies of everything you read, minimum.

The most efficient of all is always O_DIRECT/FILE_FLAG_NO_BUFFERING with all the gotchas that comes with, but it is very rare you'll write a caching algorithm much better than the operating system's. They have, after all, spent decades tuning their algorithms.

Is memory mapped files the only way to deal with such huge amounts of data? Are there better ways of doing this(considering my use case)?

Memory mapped files lets the kernel cache a very large dataset for you using general purpose caching algorithms which make use of all free memory in your system. Generally speaking you will not beat them with your own algorithms for most use cases.

Niall Douglas
  • 9,212
  • 2
  • 44
  • 54
3
  • contents of the file will never end up in the swap file
  • once the file is mapped, there is no need for a system call
  • the system will optimize the usage of RAM
  • in case of writing to a memory mapped file and your process crashes, the contents of the file will match the contents of the memory without any need to perform a final (write/flush) system call
  • multiple processes (on the same machine) can see the contents of the same file and have changes propagate immediately (reader/writer). And the contents of the file will not end up in the swap file for every reader/writer.
  • multiple processes will share the same RAM for mappings of the same file
2

How different is using memory mapped files(I am planning to use boost::file_mapping and I am working on windows) for file IO than using file streams?

It's very different. When using memory mapped file you just access the file as it were memory. There is no explicit loading or saving of the file.

This puts requirements on your application and data storage. You have to make sure you can access your data in this way. You also have to make sure that you can fit the data in addressable memory - with 32-bit system you would be limited to a few Gb's of data.

How much faster can I expect the data read/writes to be in case of memory mapped files when compared to using file streams(on a traditional hard disk 7200 rpm)?

Don't expect that. If you have the pointers page aligned it could very well be about the same performance. Also note that if you read in the data and it doesn't fit in physical RAM it would be swapped out just as it would if you're having memory mapped the file.

Is memory mapped files the only way to deal with such huge amounts of data? Are there better ways of doing this(considering my use case)?

That depends on what your actual case is.

skyking
  • 13,817
  • 1
  • 35
  • 57
  • If I had not used memory mapped files, and instead used traditional file IO, apart from the advantage of ability to read/write as if writing to memory(in the back end still file needs to be updated by calling a flush) what benefit does memory mapped files offer? – Arun Nov 23 '15 at 15:09
  • 1
    @Arun It mainly offers the benefit that you will learn how to use it and what benefits it offers... The answer is pretty descriptive and generally correct. You just need to go **try it** and **profile** the results for yourself. – sehe Nov 23 '15 at 15:13
  • @sehe : I actually have tried using it and it does seem to offer some performance benefit compared to reading/writing to a file but not very significant(2 X). I posted this question to know the kind of performance gain I am expected to see... maybe 10X :-) – Arun Nov 23 '15 at 15:22
-1

1

Basically a memory mapped file is just a block from hard-disk moved to the memory. So It just copies whatever size of block you made and then manipulating that block is as fast as your memory can go compared to how fast your hard disk can go.

2

As I said the difference between your memory speed and hard disk speed basically.

3

Don't have much experience with Big Data so I don't feel qualified to answer this one.

Community
  • 1
  • 1
Neijwiert
  • 985
  • 6
  • 19
  • Its not quite that simple... memory mapped files still need to be backed by a file on disk...and the data written into it(as if it is memory) still needs to be flushed onto the disk which has the disk write overhead and it is not equivalent to writing to memory – Arun Nov 23 '15 at 15:11