How is memory managed when creating a map > by reading files

Question

I am wondering how is memory managed when different files are stored in a map of string vectors. I have tried to read different files of 10 mo each to put them in memory and when I check the memory with KSySGuard, the memory appearing is more than twice the memory of my files (~70mo). I give you a code example for it: There is a function readfile():

std::vector<std::string> read_file(std::string& path){
    ifstream fichier(path);
    std::vector<std::string> fich;
    if(fichier){
       string ligne;
        while(getline(fichier, ligne)){
           fich.push_back(ligne);
        }
     }
    fichier.close();
    return fich;
}

This function is used in another which is building my map:

std::map<std::string, std::vector<std::string>> buildmap(std::string folder){
    std::map<std::string,std::vector<std::string>> evaluations; std::vector<std::string> vecFiles = {"file1","file2","file3"};
    for( auto i = 0; i < vecFiles.size(); i++ )
    {
        std::stringstream strad;
        strad <<vecFiles[i] ;
        std::string path(folder+vecFiles[i]);
        std::vector<std::string> a = read_file(path);
        evaluations[strad.str()]=a;
    }
    return evaluations;   
}

So, I do not understand why the memory is so high compared to the files sizes. Is-there a more efficient way to construct this kind of container?

Totally unrelated to your question, but why the string stream `strad`? Why not use `vecFile[i]` directly? You don't even need the `path` variable (if you make `read_file` take a **`const`** reference instead, which you really should). — Some programmer dude, Jan 16 '19 at 09:13
Possible duplicate of [Why is vector array doubled?](https://stackoverflow.com/questions/1424826/why-is-vector-array-doubled) — Tom, Jan 16 '19 at 09:13
`std::vector a = read_file(path); evaluations[strad.str()]=a;` That's a copy, you could move it (actuallyu use an move insertion instead). — Matthieu Brucher, Jan 16 '19 at 09:30
A stack of papers doesn't take up much space in a drawer, but it's awfully slow to find the one you want. A stack of papers organized for efficient location is going to take up a lot more space. — David Schwartz, Jan 16 '19 at 09:40
A `std::string` will have some overhead. A lot of lines in your files would mean a lot of `std::string`s and more overhead. You should be able to reduce the overhead some by using `std::vector` instead but I'm not so sure it's something you want to do. — super, Jan 16 '19 at 09:46
some programmer dude. Right, it s a mistake of my paste here, I modify. tom, i will read it. Matthieu, yes, I can change this, but it does not reduce the memory. David, even by storing it in a vector of vector does not reduce the memory. super, i will think of it, I do not really know what it implies — froz, Jan 16 '19 at 10:18
@Tom I wouldn't say _duplicate_. The question you linked is _relavant_, but I dare to guess that a non-negligible amount of memory overhead is caused by a lot of separate `std::string` objects as well. — Daniel Langr, Jan 16 '19 at 10:36

Daniel Langr · Accepted Answer · 2019-01-16T10:50:51.603

There is a lot of memory overhead in your scenario:

You store each file line as a separate std::string object. Each such object occupies some space (typically 24 or 32 bytes on a 64-bit architecture) itself, however, the stored string (line characters) are stored inside it only when the string is short and small/short string optimization (SSO) is applied (usually is by common Standard library implementations from C++11). If lines are long, the space for string is dynamically allocated and each allocation also has some additional memory overhead.
You push_back these std::string objects into a std::vector, which typically increase the size of the internal buffer exponentially (such as doubling it when it run out of space). That is why reserving space (std::vector::reserve) is used when you know the number of vector elements in advance.

This is the price for such a "comfortable" approach. What might help is to store the whole file contents as a single std::string and then store just indexes/pointers into beginnings of individual lines in a separate array/vector (though you then cannot treat these pointers as strings since they won't be null-terminated; or, you can in fact, if you substitute new-line characters by null characters).

In C++17, you can store lines as instances of std::string_view to the whole file contents stored in a single std::string.

Just note that std::string_view will likely by larger than a pointer/index. For instance, with libstdc++ and x86_64, sizeof(std::string_view) is 16 bytes, but pointer/index will occupy 8 bytes. And for files smaller than 4 GB, you can even use 32-bit indexes. If you have a lot of lines in processed files, these differences can matter.

UPDATE

This question is highly relevant: C++ Fast way to load large txt file in vector.

I will try to implement these suggestions and accept the answer if it works. If you have an example, mainly for point 2) (i am not sure point 1) will release a lot of memory as I do not have a lot of files), it would be great — froz, Jan 16 '19 at 11:58
@froz Point 1) might save a lot of memory as well, maybe much more than point 2). I would read whole file into a single `std::string`, substitute new-line-characters by null-characters while counting number of lines, then reserving space for vector indexes using that counter, and finally passing again and storing indexes. These 2 structures (string with file contents and vector of indexes) would be a result of `read_file` function. — Daniel Langr, Jan 16 '19 at 13:01
@froz BTW it's not a number of files that matters, it's a number of lines in each file. — Daniel Langr, Jan 16 '19 at 13:02

How is memory managed when creating a map > by reading files

1 Answers1