2

I would like to read big (3.5GB) file as fast as possible - thus I think I should load it into RAM first, instead of using ifstream and getline().

My goal is to find lines of data with same string. Example

textdata abc123 XD0AA
textdata abc123 XD0AB
textdata abc123 XD0AC
textdata abc123 XD0AA

So I would need to read first line, then iterate through all file until I find the fourth (in this example) line with same XD0AA string.

This is what I did so far:

    string line;
    ifstream f("../BIG_TEXT_FILE.txt");
    stringstream buffer;
    buffer << f.rdbuf();
    string f_data = buffer.str();
    for (int i = 0; i < f_data.length(); i++)
    {
        getline(buffer, line);//is this correct way to get the line (for iteration)?
        line = line.substr(0, line.find("abc"));
        cout << line << endl;
    }
    f.close();
    return 0;

But it takes twice more RAM usage than file (7GB).

Here is fixed code:

    string line, token;
    int a;
    ifstream osm("../BIG_TEXT_FILE.txt");
    stringstream buffer;
    buffer << f.rdbuf();
    //string f_data = buffer.str();
    f.close();
    while (true)
    {
        getline(buffer, line);
        if (line.length() == 0)
            break;
        //string delimiter = "15380022";
        if (line.find("15380022") != std::string::npos)
            cout << line << endl;
    }
    return 0;

But how do I make getline() read all over again?

Ri Di
  • 163
  • 5
  • 5
    OS specific: best of both worlds - memory map the file. – Richard Critten Oct 12 '22 at 10:47
  • Read the entire file as a character array for example. – user253751 Oct 12 '22 at 10:49
  • 2
    According to this answer, if you are just reading a file sequentially, reading it to memory first does not improve performance significantly. Have you measured if your new approach is faster? https://stackoverflow.com/a/58674894/2527795 – VLL Oct 12 '22 at 10:50
  • 1
    Why not read the whole file into a `std::vector`, then close the file and do your processing. RAM consumption should go to ~3.5GB (the size of the vector) as soon as the file stream is closed. – wohlstad Oct 12 '22 at 10:50
  • @wohlstad `std::vector` seems really close to `std::string` but I get the point. – Quimby Oct 12 '22 at 10:51
  • the amount of time it takes to load to RAM is not really relevant - once it is done I can use my program, which needs to run continuously – Ri Di Oct 12 '22 at 10:54
  • @wohlstad Shouldn't that be `std::vector>`? – Paul Sanders Oct 12 '22 at 10:56
  • @RiDi What do you mean, "continuously"? Are you scanning through the file many times, or just performing a single pass? – user229044 Oct 12 '22 at 11:00
  • no sane text editor would load the whole huge file into RAM. [Even Notepad uses memory mapped file](https://superuser.com/a/1148410/241386) – phuclv Oct 12 '22 at 11:07
  • yes, I need to constantly scan through file. I am making a mapping program, which reads OSM file and displays roads on screen. I figured it out - I put f.close() before loop (which is now while) and got rid of string f_data = buffer.str(); which doubled my RAM usage – Ri Di Oct 12 '22 at 11:08
  • 3
    You have the string both in `buffer` and in `f_data`, hence 7GB. – lorro Oct 12 '22 at 11:10
  • yes. How do I make getline() go back to beginning? I want to read it again. – Ri Di Oct 12 '22 at 11:11
  • 2
    Not only the shown code takes up twice the amount of RAM, it is completely broken, too. The `for` loop iterates the same number of bytes as the entire file, but the `for` loop reads an entire line at a time. If the file has a million bytes, but a hundred thousand lines the `for` loop will iterate a million times, reading the entire file entirely during the first hundred thousand times, and then spend the next nine hundred thousand iterations doing absolutely nothing useful, at all, whatsoever. – Sam Varshavchik Oct 12 '22 at 11:11
  • @PaulSanders it can indeed be a `std::vector>` if you read the files line by line. But you can also read it into one buffer (e.g. `std::vector` or even simply a `std::string`) and then parse the lines (during your processing). – wohlstad Oct 12 '22 at 11:12
  • The second code snippet that you posted does not compile, even if you add the function `main` and all `#include` directives, because there is not variable `f` declared. Please provide a [mre]. – Andreas Wenzel Oct 12 '22 at 12:53
  • 1
    With this amount of data it might be worthwhile to use SQLite or some similar database package rather than just reading from a text file manually. Or short of that, you might be able to scan through the file once and compute some sort of an index so that future searches no longer need to scan through the file every time. – Jeremy Friesner Oct 12 '22 at 14:31

2 Answers2

0

I would like to read big (3.5GB) file as fast as possible - thus I think I should load it into RAM first

You will most likely not experience any significant performance benefit by loading the entire file into memory.

All modern common operating systems have a disk cache, which automatically keeps recent and frequently used disk reads in RAM.

Even if you do load the entire file into memory, in most common modern operating systems, this merely means that you are loading the file into virtual memory. It does not guarantee that the file is actually in physical memory, because virtual memory that is not used is often swapped to disk by the operating system. Therefore, it is generally best to simply let the operating system handle everything.

If you really want to ensure that the file is actually in physical memory (which I do not recommend), then you will have to use OS-specific functionality, such as the function mlock on Linux or VirtualLock on Microsoft Windows, which prevents the operating system from swapping the memory to disk. However, depending on the system configuration, locking such a large amount of memory will probably not be possible for a normal user with default priviledges, because it could endanger system stability. Therefore, special user priviledges may be required.

But how do I make getline() read all over again?

The problem is that using operator << on an object of type std::stringstream will consume the input. In that respect, it is no different than reading from a file using std::ifstream. However, when reading from a file, you can simply seek back to the beginning of the file, using the function std::istream::seekg. Therefore, the best solution would probably be to read directly from the file using std::ifstream.

Andreas Wenzel
  • 22,760
  • 4
  • 24
  • 39
0

I have used compression in those situations. Decompressing has been faster than IO speed. The text compresses pretty well.

An example of reading gzipped file is here:

How to read a .gz file line-by-line in C++?

Paxmees
  • 1,540
  • 1
  • 15
  • 28