C++: Fast way to read mapped file into a matrix

Question

I'm trying to read a mapped file into a matrix. The file is something like this:

name;phone;city\n
Luigi Rossi;02341567;Milan\n
Mario Bianchi;06567890;Rome\n
....

and it's quiet big. The code I've written works properly but it's not so fast:

#include <iostream>
#include <fstream>
#include <string>
#include <boost/iostreams/device/mapped_file.hpp>

using namespace std;

int main() {

    int i;
    int j=0;
    int k=0;

    vector< vector<char> > M(10000000, vector<string>(3));

    mapped_file_source file("file.csv");

    // Check if file was successfully opened
    if(file.is_open()) {

      // Get pointer to the data
      const char * c = (const char *)file.data();

      int size=file.size();

      for(i = 0; i < (size+1); i++){

       if(c[i]=='\n' || i==size){
        j=j+1;
        k=0;
       }else if(c[i]==';'){
        k=k+1;
       }else{
        M[j][k]+=c[i];
       }    
     }//end for


   }//end if    

 return(0)


}

Is there a faster way? I've read something about memcyp but I don't know how to use it to speed up my code.

Unless you have close to ten million entries in the file, you are wasting quite a lot of memory. You might want to consider `std::deque` instead perhaps? Or just letting `std::vector` handle it. And if you're just loading this collections once, will the "effectiveness" of this part of the code really impact the total runtime of the program by such a large percentage? — Some programmer dude, Feb 05 '15 at 09:36
Oh, and when writing an [MCVE](http://stackoverflow.com/help/mcve), you should perhaps make sure that it actually builds. The declaration of `M` is inconsistent with the initialization. — Some programmer dude, Feb 05 '15 at 09:38
I'm not sure if boost's `mapped_file_source` supports it, but on POSIX `mmap`'s `MAP_PRIVATE` means you can write to the memory representing the file without affecting the file (or other processes memory mapping it too). With that, one very fast option is to overwrite `;` and `\n` characters with NULs, then keep `const char*`s and/or integral offsets to the start of the textual values instead of separate `std::string` objects. If you're ***really*** serious about speed and memory efficiency, I'd suggest trying that (even if it means bypassing boost). — Tony Delroy, Feb 05 '15 at 09:52
@JoachimPileborg: soory, this is the right code: vector< vector > M(1000000, vector(3)); — Davide, Feb 05 '15 at 10:42

score 7 · Answer 1 · edited May 23 '17 at 10:29

I have numerous examples doing this/similar written up on SO.

Let me list the most relevant:

I've done quite a few of these benchmarks. Yes, for sequential freading, read/scanf have a tiny edge (see e.g. scanf/iostreams and files vs. mappings, and parsing floats, or read being slightly faster for 1-pass sequential read).
An interesting approach is to do parsing lazily (why copy the whole input into memory? What's the point memory mapping then). The answer here shows this approach (emulating a multimap there):
- Using boost::iostreams::mapped_file_source with std::multimap (approach #2)

In all other cases, consider slamming a Spirit Qi job on it, potentially using boost::string_ref instead of vector<char> (unless the mapped file is not "const", of course).

The string_ref is also shown int the last answer linked before. Another interesting example of this (with lazy conversions to un-escaped string values) is here How to parse mustache with Boost.Xpressive correctly?

DEMO

Here's that Qi job slammed on it:

it parses a 994 MiB file of ~32 million lines in 2.9s into a vector of
```
struct Line {
    boost::string_ref name, city;
    long id;
};
```
note that we parse the number, and store the strings by referring to their location in the memory map + length (string_ref)
it pretty-prints the data from 10 random lines
it can run as fast as 2.5s if you reserve 32m elements in the vector at once; the program does only a single memory allocation in that case.
NOTE: on a 64 bit system, the memory representation grows larger than the input size if the average line length is less than 40 bytes. This is because a string_ref is 16 bytes.

Live On Coliru

#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/utility/string_ref.hpp>

namespace qi = boost::spirit::qi;
using sref   = boost::string_ref;

namespace boost { namespace spirit { namespace traits {
    template <typename It>
    struct assign_to_attribute_from_iterators<sref, It, void> {
        static void call(It f, It l, sref& attr) { attr = { f, size_t(std::distance(f,l)) }; }
    };
} } }

struct Line {
    sref name, city;
    long id;
};

BOOST_FUSION_ADAPT_STRUCT(Line, (sref,name)(long,id)(sref,city))

int main() {
    boost::iostreams::mapped_file_source mmap("input.txt");

    using namespace qi;

    std::vector<Line> parsed;
    parsed.reserve(32000000);
    if (phrase_parse(mmap.begin(), mmap.end(), 
                omit[+graph] >> eol >>
                (raw[*~char_(";\r\n")] >> ';' >> long_ >> ';' >> raw[*~char_(";\r\n")]) % eol,
                qi::blank, parsed))
    {
        std::cout << "Parsed " << parsed.size() << " lines\n";
    } else {
        std::cout << "Failed after " << parsed.size() << " lines\n";
    }

    std::cout << "Printing 10 random items:\n";
    for(int i=0; i<10; ++i) {
        auto& line = parsed[rand() % parsed.size()];
        std::cout << "city: '" << line.city << "', id: " << line.id << ", name: '" << line.name << "'\n";
    }
}

With input generated like

do grep -v "'" /etc/dictionaries-common/words | sort -R | xargs -d\\n -n 3 | while read a b c; do echo "$a $b;$RANDOM;$c"; done

The output is e.g.

Parsed 31609499 lines
Printing 10 random items:
city: 'opted', id: 14614, name: 'baronets theosophy'
city: 'denominated', id: 24260, name: 'insignia ophthalmic'
city: 'mademoiselles', id: 10791, name: 'smelter orienting'
city: 'ducked', id: 32155, name: 'encircled flippantly'
city: 'garotte', id: 3080, name: 'keeling South'
city: 'emirs', id: 14511, name: 'Aztecs vindicators'
city: 'characteristically', id: 5473, name: 'constancy Troy'
city: 'savvy', id: 3921, name: 'deafer terrifically'
city: 'misfitted', id: 14617, name: 'Eliot chambray'
city: 'faceless', id: 24481, name: 'shade forwent'

Added a live demo that parsed ~500mb/s and pretty prints 10 random records to prove it. — sehe, Feb 06 '15 at 01:59
If you find this method taking too much memory I'd suggest taking either approach #1 or #2 from [the linked answer](http://stackoverflow.com/questions/28217301/using-boostiostreamsmapped-file-source-with-stdmultimap/28220864#28220864). Of course it's hard to tell without knowing what you actually _want to do_ with the data (and how it's organized) — sehe, Feb 06 '15 at 12:25
I'm guessing this depended on an older Boost version, although I can't tell which. — Michael Foukarakis, May 09 '18 at 06:37

C++: Fast way to read mapped file into a matrix

1 Answers1

DEMO

Linked