Efficiently reading a very large text file in C++

Question

I have a very large text file(45GB). Each line of the text file contains two space separated 64bit unsigned integers as shown below.

4624996948753406865 10214715013130414417

4305027007407867230 4569406367070518418

10817905656952544704 3697712211731468838 ... ...

I want to read the file and perform some operations on the numbers.

My Code in C++:

void process_data(string str)
{
    vector<string> arr;
    boost::split(arr, str, boost::is_any_of(" \n"));
    do_some_operation(arr);
}

int main()
{
    unsigned long long int read_bytes = 45 * 1024 *1024;
    const char* fname = "input.txt";
    ifstream fin(fname, ios::in);
    char* memblock;

    while(!fin.eof())
    {
        memblock = new char[read_bytes];
        fin.read(memblock, read_bytes);
        string str(memblock);
        process_data(str);
        delete [] memblock;
    }
    return 0;
}

I am relatively new to c++. When I run this code, I am facing these problems.

Because of reading the file in bytes, sometimes the last line of a block corresponds to an unfinished line in the original file("4624996948753406865 10214" instead of the actual string "4624996948753406865 10214715013130414417" of the main file).
This code runs very very slow. It takes around 6secs to run for one block operations in a 64bit Intel Core i7 920 system with 6GB of RAM. Is there any optimization techniques that I can use to improve the runtime?
Is it necessary to include "\n" along with blank character in the boost split function?

I have read about mmap files in C++ but I am not sure whether it's the correct way to do so. If yes, please attach some links.

Using `eof` is [not a good thing](http://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong). — molbdnilo, Nov 04 '14 at 14:08
Edit your question to show your compilation command (and version of compiler & `libstdc++`) — Basile Starynkevitch, Nov 04 '14 at 14:36
isn't it simpler to save the data in binary form instead of decimally formatted text; I've done something similar in LabVIEW and had like a factor 10 improvement for reading binary files instead of parsing text files (of course it depends on the parser implementation, but still..). As an extra benefit, you wouldn't need `20*8=160 `bytes to store e.g. the number `4624996948753406865` + space, but just 8 byte — Gunther Struyf, Apr 28 '17 at 09:40

score 19 · Accepted Answer · edited May 23 '17 at 11:47

I'd redesign this to act streaming, instead of on a block.

A simpler approach would be:

std::ifstream ifs("input.txt");
std::vector<uint64_t> parsed(std::istream_iterator<uint64_t>(ifs), {});

If you know roughly how many values are expected, using std::vector::reserve up front could speed it up further.

Alternatively you can use a memory mapped file and iterate over the character sequence.

How to parse space-separated floats in C++ quickly? shows these approaches with benchmarks for floats.

Update I modified the above program to parse uint32_ts into a vector.

When using a sample input file of 4.5GiB^[1] the program runs in 9 seconds^[2]:

sehe@desktop:/tmp$ make -B && sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test smaller.txt
g++ -std=c++0x -Wall -pedantic -g -O2 -march=native test.cpp -o test -lboost_system -lboost_iostreams -ltcmalloc
parse success
trailing unparsed: '
'
data.size():   402653184
0:08.96 elapsed, 6 context switches

Of course it allocates at least 402653184 * 4 * byte = 1.5 gibibytes. So when you read a 45 GB file, you will need an estimated 15GiB of RAM to just store the vector (assuming no fragmentation on reallocation): The 45GiB parse completes in 10min 45s:

make && sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test 45gib_uint32s.txt 
make: Nothing to be done for `all'.
tcmalloc: large alloc 17570324480 bytes == 0x2cb6000 @  0x7ffe6b81dd9c 0x7ffe6b83dae9 0x401320 0x7ffe6af4cec5 0x40176f (nil)
Parse success
Trailing unparsed: 1 characters
Data.size():   4026531840
Time taken by parsing: 644.64s
10:45.96 elapsed, 42 context switches

By comparison, just running wc -l 45gib_uint32s.txt took ~12 minutes (without realtime priority scheduling though). wc is blazingly fast

Full Code Used For Benchmark

#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <chrono>

namespace qi = boost::spirit::qi;

typedef std::vector<uint32_t> data_t;

using hrclock = std::chrono::high_resolution_clock;

int main(int argc, char** argv) {
    if (argc<2) return 255;
    data_t data;
    data.reserve(4392580288);   // for the  45 GiB file benchmark
    // data.reserve(402653284); // for the 4.5 GiB file benchmark

    boost::iostreams::mapped_file mmap(argv[1], boost::iostreams::mapped_file::readonly);
    auto f = mmap.const_data();
    auto l = f + mmap.size();

    using namespace qi;

    auto start_parse = hrclock::now();
    bool ok = phrase_parse(f,l,int_parser<uint32_t, 10>() % eol, blank, data);
    auto stop_time = hrclock::now();

    if (ok)   
        std::cout << "Parse success\n";
    else 
        std::cerr << "Parse failed at #" << std::distance(mmap.const_data(), f) << " around '" << std::string(f,f+50) << "'\n";

    if (f!=l) 
        std::cerr << "Trailing unparsed: " << std::distance(f,l) << " characters\n";

    std::cout << "Data.size():   " << data.size() << "\n";
    std::cout << "Time taken by parsing: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop_time-start_parse).count() / 1000.0 << "s\n";
}

^[1] generated with od -t u4 /dev/urandom -A none -v -w4 | pv | dd bs=1M count=$((9*1024/2)) iflag=fullblock > smaller.txt

^[2] obviously, this was with the file cached in the buffer cache on linux - the large file doesn't have this benefit

I tried the istream approach. My system has 5GB RAM and 64-bit architecture. On a 4.5GB file, the istream code took only 1 min and 10 secs. But when I tried on a 45GB file it terminated with the following error - "terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc. Aborted". I guess I will have to try the mmap method and see. — Pattu, Nov 04 '14 at 21:28
@Pattu By all means, but `bad_alloc` indicates that either you simply don't have the memory for the vector in the first place, _or_ the allocations fail due to fragmentation. Try calling `reserve()` with the final capacity (or slightly more) to avoid having to reallocate. — sehe, Nov 04 '14 at 23:43
@Pattu Make sure you have enough memory. I parsed the 4.5 GiB in 9 seconds and the 45 GiB in under 11 minutes (see **Updated** answer with code). (Note the 9s is due to cached disk reads, of course. Consider these real life timings, not so much parsing benchmarks) — sehe, Nov 05 '14 at 11:04
One great answer. Is _`wc -l 45gib_uint32s.txt` took ~12 **minutes**_ right? — Maxim Egorushkin, Nov 09 '16 at 21:58
@MaximEgorushkin Let me reproduce that. Generating the large data is gonna take ~30 minutes. — sehe, Nov 09 '16 at 22:17
Sadly my disk maxed out at 30GiB. So here's the timing for 30GiB: 1m12s. However, that's on a system with 32GiB of RAM. I haven't checked for the cache effect. — sehe, Nov 09 '16 at 23:06

score 1 · Answer 2 · edited May 23 '17 at 11:55

1

I can only guess that the bottleneck is in:

string str(memblock);

-Because you allocate a 45MB long segment in memory.

You should read the file line by line, as described in here:

Read file line by line

In order to profile your program, you can print clock() between each line, as described in:

Easily measure elapsed time

edited May 23 '17 at 11:55

Community

1
1

answered Nov 04 '14 at 13:56

Oren Kishon

509
6
8

3

Re-allocating the 45MB buffer each time through the loop isn't helping, either. Allocate it once, delete it once, outside the loop. – Paul Roub Nov 04 '14 at 13:58
It's 45**GB**, not MB. It would take hours if use the linked method. And print time between each line...? Please don't... – starriet Aug 20 '22 at 01:17

score 0 · Answer 3 · answered Nov 04 '14 at 14:15

0

You can memory map the file into memory, but that would be platform dependent (on unix that would be mmap on windows CreateFileMapping/MapViewIntoFile); still if in 32 bit system you may have problems if there is not a big enough virtual memory block left (64 bit systems would not have that problem).

Memory mapping is supposed to be faster than reading the data block by block from disk.

answered Nov 04 '14 at 14:15

MichaelMoser

3,172
1
26
26

The question is tagged Linux. – Basile Starynkevitch Nov 04 '14 at 14:35

Basile Starynkevitch · Answer 4 · 2014-11-04T14:51:32.683

On Linux, using C <stdio.h> instead of C++ streams might help performance (because C++ streams are built above FILE-s). You could use readline(3) or fgets(3) or fscanf(3). You might set a larger buffer (e.g. 64Kbytes or 256Kbytes) using setbuffer(3) etc... But I guess your (improved) program would be I/O bound, not CPU bound. Then you could play with posix_fadvise(2)

You might consider using memory mapping mmap(2) & madvise(2) (see also m mode for fopen(3)). See also readahead(2)

At last, if your algorithm permits it, you might csplit the files in smaller pieces and process each of them in parallel processes.

Efficiently reading a very large text file in C++

My Code in C++:

4 Answers4

Full Code Used For Benchmark

Linked