C++ Fast way to load large txt file in vector

Question

I have a file ~12.000.000 hex lines and 1,6GB Example of file:

999CBA166262923D53D3EFA72F5C4E8EE1E1FF1E7E33C42D0CE8B73604034580F2
889CBA166262923D53D3EFA72F5C4E8EE1E1FF1E7E33C42D0CE8B73604034580F2

Example of code:

vector<string>  buffer;

ifstream fe1("strings.txt");
string line1;
    while (getline(fe1, line1)) {
        buffer.push_back(line1);
    }

Now the loading takes about 20 minutes. Any suggestions how to speed up? Thanks a lot in advance.

Reserve space for vector upfront, if you know (approximate) amount of data that will be read. And you may consider reading raw data (i.e. `std::array`) instead of strings, especially if lines are of constant length. — Yksisarvinen, Oct 08 '18 at 09:39
Do you need to have it all at once in memory anyway? Maybe you can only load the parts that you need each time? — jdehesa, Oct 08 '18 at 09:40
`push_back` -> `emplace_back` and reserving memory up front should help. There's also a few suggestions on [Fast textfile reading in c++](https://stackoverflow.com/questions/17925051/fast-textfile-reading-in-c) that you could benchmark. Also make sure that you have optimisations on. — George, Oct 08 '18 at 09:40
Can you change the `vector`? Are all the lines hex strings like that? There are faster ways to do the actual reading, but that is still a lot of strings and if you can either store them in another format, or even better process sequentially at the same time would be much better. Can you use C++17 `string_view` if you really need strings (or maybe raw `char*`, but not as nice)? — Fire Lancer, Oct 08 '18 at 09:40
To get a better answer you should probably explain what you're going to do with these lines ... — OrenIshShalom, Oct 08 '18 at 09:51
@Yksisarvinen i know amount of strings, but string length between 40 - 130 chars — Sabonett, Oct 08 '18 at 09:52
To keep GUI responsive, use a separate thread for the file transfer and use `fread` to transfer chunks of a data at a time with progress indicator. — seccpur, Oct 08 '18 at 10:00
I have a hard time believing it takes 20 min with the given code, did you enable O3 optimization? — Hugo Maxwell, Oct 08 '18 at 10:01

Maxim Egorushkin · Accepted Answer · 2018-10-08T12:06:02.980

Loading a large text file into std::vector<std::string> is rather inefficient and wasteful because it allocates heap memory for each std::string and re-allocates the vector multiple times. Each of these heap allocations requires heap book-keeping information under the hood (normally 8 bytes per allocation on a 64-bit system), and each line requires an std::string object (8-32 bytes depending on the standard library), so that a file loaded this way takes a lot more space in RAM than on disk.

One fast way is to map the file into memory and implement iterators to walk over lines in it. This sidesteps the issues mentioned above.

Working example:

#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/mapped_region.hpp>
#include <boost/iterator/iterator_facade.hpp>
#include <boost/range/iterator_range_core.hpp>

#include <iostream>

class LineIterator
    : public boost::iterator_facade<
          LineIterator,
          boost::iterator_range<char const*>,
          boost::iterators::forward_traversal_tag,
          boost::iterator_range<char const*>
          >
{
    char const *p_, *q_;
    boost::iterator_range<char const*> dereference() const { return {p_, this->next()}; }
    bool equal(LineIterator b) const { return p_ == b.p_; }
    void increment() { p_ = this->next(); }
    char const* next() const { auto p = std::find(p_, q_, '\n'); return p + (p != q_); }
    friend class boost::iterator_core_access;

public:
    LineIterator(char const* begin, char const* end) : p_(begin), q_(end) {}
};

inline boost::iterator_range<LineIterator> crange(boost::interprocess::mapped_region const& r) {
    auto p = static_cast<char const*>(r.get_address());
    auto q = p + r.get_size();
    return {LineIterator{p, q}, LineIterator{q, q}};
}

inline std::ostream& operator<<(std::ostream& s, boost::iterator_range<char const*> const& line) {
    return s.write(line.begin(), line.size());
}

int main() {
    boost::interprocess::file_mapping file("/usr/include/gnu-versions.h", boost::interprocess::read_only);
    boost::interprocess::mapped_region memory(file, boost::interprocess::read_only);

    unsigned n = 0;
    for(auto line : crange(memory))
        std::cout << n++ << ' ' << line;
}

Can you give an example how to map it? I agree memory re-allocation will make it slow to read. The next issue would be slow processing because of memory fragmentation of the lines. — Frederik De Ruyck, Oct 08 '18 at 09:59

score 2 · Answer 2 · answered Oct 08 '18 at 10:05

2

You can read the entire file into memory. This can be done with C++ streams, or you may be able to get even more performance by using platform specific API's, such as memory mapped files or their own file reading API's.

Once you have this block of data, for performance you want to avoid any further copies and use it in-place. In C++17 you have std::string_view which is similar to std::string but uses existing string data, avoiding the copy. Otherwise you might just work with C style char* strings, either by replacing the newline with a null (\0), using a pair of pointers (begin/end) or a pointer and size.

Here I used string_view, I also assumed newlines are always \n and that there is a newline at the end. You may need to adjust the loop if this is not the case. Guessing the size of the vector will also gain a little performance, you could maybe do so from the file length. I also skipped some error handling.

std::fstream is("data.txt", std::ios::in | std::ios::binary);
is.seekg(0, std::ios::end);
size_t data_size = is.tellg();
is.seekg(0, std::ios::beg);
std::unique_ptr<char[]> data(new char[data_size]);
is.read(data.get(), data_size);


std::vector<std::string_view> strings;
strings.reserve(data_size / 40); // If you have some idea, avoid re-allocations as general practice with vector etc.
for (size_t i = 0, start = 0; i < data_size; ++i)
{
    if (data[i] == '\n') // End of line, got string
    {
        strings.emplace_back(data.get() + start, i - start);
        start = i + 1;
    }
}

To get a little more performance, you might run the loop to do CPU work in parallel of the file IO. This can be done with threads or using platform-specific async file IO. However in this case the loop will be very fast, so there would not be much to gain.

answered Oct 08 '18 at 10:05

Fire Lancer

29,364
31
116
182

Just a thought. Will smart pointers' severe time performance penalty forbid using in such fast file transfers ? – seccpur Oct 08 '18 at 10:09
What penalty? The reason to use a `unique_ptr` rather than `string` or `vector` is to avoid the initialisation of the entire buffer. All it does is call `delete[]` for me. – Fire Lancer Oct 08 '18 at 10:13
@seccpur `std::unique_ptr::operator[]` and `std::unique_ptr::get()` should be `inlined`, the only overhead when using a reasonable compiler will be construction for the `unique_ptr`, which is entirely negligible here, especially when the op claims their code took 20mins. – George Oct 08 '18 at 10:21
Tried these std:::unique_ptrs in a recursive minimax game tree with 100 thousand nodes, the time overhead was very significant compared to raw pointers. – seccpur Oct 08 '18 at 10:35
@seccpur Did you look at why the unique_ptr's were more expensive? – George Oct 08 '18 at 10:38
@George: Yes and discussed several times in SO. So I prefer to exclude these smart pointers for routines where speed is paramount. – seccpur Oct 08 '18 at 10:41
@seccpur Your claims are incredible. Share a link with code and benchmarks. – Maxim Egorushkin Oct 08 '18 at 11:11
@FireLancer You only add a `std::string_view` to `strings` if the line ends with `'\n'`. If the file does not end with a final `'\n'` you do not include the last line. Perhaps use `(buffer[i] == '\n' || i == file_length - 1)`, or add a check at the end of the loop. Be sure to account for the fact that you do not want to exclude the last character in the special case of the last line without trailing `'\n'`. – Robert Jacobson Oct 04 '19 at 15:14

score 0 · Answer 3 · answered Jul 20 '19 at 21:43

0

You can simply allocate enough RAM memory and read the whole text file almost at once. Than you can access the data in RAM by memory pointer. I read the whole 4GB text file in about 3 seconds.

answered Jul 20 '19 at 21:43

Ziggi

9
1
5

C++ Fast way to load large txt file in vector

3 Answers3

Linked