Efficient read of a 3MB text file incl. parsing

Question

I have a couple of ~3MB textfiles that I need to parse in C++.

The text file looks like this (1024x786):

12,23   45,78   90,12   34,56   78,90   ...
12,23   45,78   90,12   34,56   78,90   ...
12,23   45,78   90,12   34,56   78,90   ...
12,23   45,78   90,12   34,56   78,90   ...
12,23   45,78   90,12   34,56   78,90   ...

means "number blocks" separated by Tab, and the numbers itself containing a , (insted of a .) decimal marker.

First of all I need to read the file. Currently I'm using this:

#include <boost/tokenizer.hpp>

string line;
ifstream myfile(file);
if (myfile.is_open())
{
    char_separator<char> sep("\t");
    tokenizer<char_separator<char>> tokens(line, sep); 
}
myfile.close();

which is working nice in terms of getting me the "number block" but I still need to convert this char to an float but handling the , as a decimal marker. Due to the filesize I think its not a good idea to tokenize this as well. Further I need to add all this values to an data structure that I can access afterwards by location (e.g. [x][y]). Any ideas how to fulfil this?

If this is really about a 3MB file then I'd just load it into memory. That's not really that large compared to memory available on most machines. — edA-qa mort-ora-y, Jul 16 '17 at 22:25
Memory mapped file (boost.iostreams) and a simple boost.spirit parser. Threadpool to load all the files. Probably and overkill, but if you wanna be fast... — Dan Mašek, Jul 16 '17 at 23:37

Cosme · Accepted Answer · 2017-07-17T03:30:04.350

You can use Boost.Spirit to parse the content of the file and as a final result you may get from the parser the data structured as you like, for example, a std::vector<std::vector<float>>. IMO, your common file's size is not big. I believe it's better to read the whole file to the memory and execute the parser. An efficient solution to read files is showed below at read_file.

The qi::float_ parses a real number with a length and size limited by the float type and it uses a .(dot) as a separator. You can customize the separator through the qi::real_policies<T>::parse_dot. Below I am using a code snippet from spirit/example/qi/german_floating_point.cpp.

Take a look at this demo:

#include <boost/spirit/include/qi.hpp>
#include <fstream>
#include <iostream>
#include <string>
#include <vector>

std::string read_file(std::string path)
{
    std::string str;
    std::ifstream file( path, std::ios::ate);
    if (!file) return str;
    auto size(file.tellg());
    str.resize(size);
    file.seekg(0, std::ios::beg);
    file.rdbuf()->sgetn(&str[0], size);
    return str;
}

using namespace boost::spirit;

//From Boost.Spirit example `qi/german_floating_point.cpp`
//Begin
template <typename T>
struct german_real_policies : qi::real_policies<T>
{
    template <typename Iterator>
    static bool parse_dot(Iterator& first, Iterator const& last)
    {
        if (first == last || *first != ',')
            return false;
        ++first;
        return true;
    }
};

qi::real_parser<float, german_real_policies<float> > const german_float;
//End

int main()
{
    std::string in(read_file("input"));
    std::vector<std::vector<float>> out;
    auto ret = qi::phrase_parse(in.begin(), in.end(),
                                +(+(german_float - qi::eol) >> qi::eol),
                                boost::spirit::ascii::blank_type{},
                                out);
    if(ret && in.begin() == in.end())
        std::cout << "Success" << std::endl;
}

Yes. See the benchmarks here (both Qi and X3) https://stackoverflow.com/questions/17465061/how-to-parse-space-separated-floats-in-c-quickly/17479702#17479702. Also see [these benchmarks for comparison](https://stackoverflow.com/questions/26736742/efficiently-reading-a-very-large-text-file-in-c/26737146#26737146) — sehe, Jul 17 '17 at 14:27

score 0 · Answer 2 · answered Jul 16 '17 at 22:53

0

What I would do straight forward (no need for boost::tokenizer at all):

std::setlocale(LC_NUMERIC, "de_DE"); // Use ',' as decimal point
std::vector<std::vector<double>> dblmat;
std::string line;
while(std::getline(myfile,line)) {
    dblmat.push_back(std::vector<double>());
    std::istringstream iss(line);
    double val;
    while(iss >> val) {
        dblmat.back().push_back(val);
    } 
}

answered Jul 16 '17 at 22:53

user0042

7,917
3
24
39

1

Recreating an `istringstream` at each row isn't going to be cheap. Also, setting the global locale just to read a file is a bad idea (especially in a multithreaded program, where locale cannot be set safely). I'd just `imbue` the correct locale in the file stream, and decide whether to begin a new row or add to the last one based on what delimiter you find after each extraction. – Matteo Italia Jul 17 '17 at 03:38
@Matteo I'm pretty sure there's room for optimizations. I just wanted to point out that there are pretty straightforward ways to do what the OP wants, without `boost::tokenizer` or other sophisticated parsing tools. – user0042 Jul 17 '17 at 15:10

Efficient read of a 3MB text file incl. parsing

2 Answers2