3

Currently I am debugging a program in linux that looks like this:

int main(){
    loadHugeFile();
    processTheDataOfTheFile();
    return 0;
}

The thing is loadHugeFile function needs to load a very huge file in gigabytes that takes about 5 minutes, while processTheDataOfTheFile takes less than 10 seconds to calculate the required data and return some values. In the future, the file's size might increase even further, and it will take even more time to load. The file is an invert index, so the whole file is needed.

Is it possible to have one process load this file into the RAM, retain it and have any other process access this part of the loaded file? This is to skip that many minutes loading. I recall Windows has this function that allows you to access/modify another process's memory, but what are my available choices here in linux?

lucian.pantelimon
  • 3,673
  • 4
  • 29
  • 46
Karl
  • 5,613
  • 13
  • 73
  • 107
  • 4
    You might consider using Memory Mapped Files, so you can access very, very, very large files very quickly without having to load the whole thing into RAM. – Kieren Johnstone Feb 13 '13 at 15:42
  • 1
    You definitely want to be using mmap() to both speed up your file access and share the same instance of the file in memory; see http://stackoverflow.com/questions/258091/when-should-i-use-mmap-for-file-access – tgies Feb 13 '13 at 15:54

4 Answers4

4

You can use mmap function.

In computing, mmap(2) is a POSIX-compliant Unix system call that maps files or devices into memory. It is a method of memory-mapped file I/O.

You got 2 advantages. Extreme speed in loading file and the content will be in a memory area that can be shared between many other processes (just use mmap with the flag MAP_SHARED).

You can test the speed of mmap with this short and dirty code. Just compile it and exec it passing the file you want to load as paramenter.

#include <stdio.h>
#include <stdint.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>

int main(int argc, char *argv[])
{
    struct stat sb;

    int fd = open(argv[1], O_RDONLY);

    // get the size in bytes of the file
    fstat (fd, &sb);

    // map the file in a memory area
    char *p = mmap (0, sb.st_size, PROT_READ, MAP_SHARED, fd, 0);

    // print 3 char of the file to demostrate it is loaded ;)
    printf("first 3 chars of the file: %c %c %c\n", p[0], p[1], p[2]);

    close(fd);

    // detach
    munmap(p, sb.st_size);
}
Davide Berra
  • 6,387
  • 2
  • 29
  • 50
2

There's more than one way to do this, but a direct way would be to mmap the file and use shared memory amongst the other processes so they can access the file.

You could also, implement a high level socket read/write API around the file itself and allow users to access it via the API. But, you may want to think about loading the file into a SQL db or something so that you can use an actual database backend if possible as they're designed for this type of thing.

And in case you need to detect changes to your file, you can use inotify/dnotify

Community
  • 1
  • 1
Steve Lazaridis
  • 2,210
  • 1
  • 15
  • 15
1

I am guessing that if your file is multiple gigabytes, it is taking so long to load since it is overflowing the RAM and causing data from the RAM to be pushed onto the swap section of the hard drive.

One way of achieving your aim of reading the file once and keeping it in ram would be to copy the file to the /dev/shm/ directory. Files in /dev/shm/ are actually stored in RAM and are available to multiple processes. If your files are a significant amount or more than the amount of RAM on your system though, this would still run into the same problems of swapping to the hard drive, so I wouldn't recommend it.

I would suggest using a memory mapped file with mmap. This gives you several advantages:

  • The file looks like and is addressed just like an array of data in RAM.
  • Only parts of the file that are currently being read are loaded into RAM.
  • The OS takes care of pulling data in from the ram and pushing it back out to disk so it is pretty easy to use once it is set up.

The other option is to update your processing function to operate in a streaming mode, which may or may not be possible.

Jason B
  • 12,835
  • 2
  • 41
  • 43
  • Our server has 128 GB RAM. – Karl Feb 13 '13 at 16:01
  • Is your processing looking at every byte in the file, or just different sections? If it is only looking at sections, mmap is probably still the best choice. If it is looking at the whole file, you may want to put the file in a ramdisk like /dev/shm or something similar. – Jason B Feb 13 '13 at 16:05
  • My processing will look at very little portion in each run, but in each processing, with different parameters, it will eventually access all bytes. – Karl Feb 13 '13 at 16:08
1

Thinking outside the box, why don't you just use a database? Databases are optimized for searching large files, and thanks to caching they will keep part of that file in memory for better performance. Multiple processes can access the file simultaneously.

Kluge
  • 3,567
  • 3
  • 24
  • 21