3

I'm trying to read a mapped file into a matrix. The file is something like this:

name;phone;city\n
Luigi Rossi;02341567;Milan\n
Mario Bianchi;06567890;Rome\n
.... 

and it's quiet big. The code I've written works properly but it's not so fast:

#include <iostream>
#include <fstream>
#include <string>
#include <boost/iostreams/device/mapped_file.hpp>

using namespace std;

int main() {

    int i;
    int j=0;
    int k=0;

    vector< vector<char> > M(10000000, vector<string>(3));

    mapped_file_source file("file.csv");

    // Check if file was successfully opened
    if(file.is_open()) {

      // Get pointer to the data
      const char * c = (const char *)file.data();

      int size=file.size();

      for(i = 0; i < (size+1); i++){

       if(c[i]=='\n' || i==size){
        j=j+1;
        k=0;
       }else if(c[i]==';'){
        k=k+1;
       }else{
        M[j][k]+=c[i];
       }    
     }//end for


   }//end if    

 return(0)


}

Is there a faster way? I've read something about memcyp but I don't know how to use it to speed up my code.

Davide
  • 27
  • 2
  • Unless you have close to ten million entries in the file, you are wasting quite a lot of memory. You might want to consider `std::deque` instead perhaps? Or just letting `std::vector` handle it. And if you're just loading this collections once, will the "effectiveness" of this part of the code really impact the total runtime of the program by such a large percentage? – Some programmer dude Feb 05 '15 at 09:36
  • Oh, and when writing an [MCVE](http://stackoverflow.com/help/mcve), you should perhaps make sure that it actually builds. The declaration of `M` is inconsistent with the initialization. – Some programmer dude Feb 05 '15 at 09:38
  • 2
    I'm not sure if boost's `mapped_file_source` supports it, but on POSIX `mmap`'s `MAP_PRIVATE` means you can write to the memory representing the file without affecting the file (or other processes memory mapping it too). With that, one very fast option is to overwrite `;` and `\n` characters with NULs, then keep `const char*`s and/or integral offsets to the start of the textual values instead of separate `std::string` objects. If you're ***really*** serious about speed and memory efficiency, I'd suggest trying that (even if it means bypassing boost). – Tony Delroy Feb 05 '15 at 09:52
  • @JoachimPileborg: soory, this is the right code: vector< vector > M(1000000, vector(3)); – Davide Feb 05 '15 at 10:42
  • @TonyD boost supports it – sehe Feb 05 '15 at 10:50

1 Answers1

7

I have numerous examples doing this/similar written up on SO.

Let me list the most relevant:

In all other cases, consider slamming a Spirit Qi job on it, potentially using boost::string_ref instead of vector<char> (unless the mapped file is not "const", of course).

The string_ref is also shown int the last answer linked before. Another interesting example of this (with lazy conversions to un-escaped string values) is here How to parse mustache with Boost.Xpressive correctly?

DEMO

Here's that Qi job slammed on it:

  • it parses a 994 MiB file of ~32 million lines in 2.9s into a vector of

    struct Line {
        boost::string_ref name, city;
        long id;
    };
    
  • note that we parse the number, and store the strings by referring to their location in the memory map + length (string_ref)

  • it pretty-prints the data from 10 random lines
  • it can run as fast as 2.5s if you reserve 32m elements in the vector at once; the program does only a single memory allocation in that case.
  • NOTE: on a 64 bit system, the memory representation grows larger than the input size if the average line length is less than 40 bytes. This is because a string_ref is 16 bytes.

Live On Coliru

#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/utility/string_ref.hpp>

namespace qi = boost::spirit::qi;
using sref   = boost::string_ref;

namespace boost { namespace spirit { namespace traits {
    template <typename It>
    struct assign_to_attribute_from_iterators<sref, It, void> {
        static void call(It f, It l, sref& attr) { attr = { f, size_t(std::distance(f,l)) }; }
    };
} } }

struct Line {
    sref name, city;
    long id;
};

BOOST_FUSION_ADAPT_STRUCT(Line, (sref,name)(long,id)(sref,city))

int main() {
    boost::iostreams::mapped_file_source mmap("input.txt");

    using namespace qi;

    std::vector<Line> parsed;
    parsed.reserve(32000000);
    if (phrase_parse(mmap.begin(), mmap.end(), 
                omit[+graph] >> eol >>
                (raw[*~char_(";\r\n")] >> ';' >> long_ >> ';' >> raw[*~char_(";\r\n")]) % eol,
                qi::blank, parsed))
    {
        std::cout << "Parsed " << parsed.size() << " lines\n";
    } else {
        std::cout << "Failed after " << parsed.size() << " lines\n";
    }

    std::cout << "Printing 10 random items:\n";
    for(int i=0; i<10; ++i) {
        auto& line = parsed[rand() % parsed.size()];
        std::cout << "city: '" << line.city << "', id: " << line.id << ", name: '" << line.name << "'\n";
    }
}

With input generated like

do grep -v "'" /etc/dictionaries-common/words | sort -R | xargs -d\\n -n 3 | while read a b c; do echo "$a $b;$RANDOM;$c"; done

The output is e.g.

Parsed 31609499 lines
Printing 10 random items:
city: 'opted', id: 14614, name: 'baronets theosophy'
city: 'denominated', id: 24260, name: 'insignia ophthalmic'
city: 'mademoiselles', id: 10791, name: 'smelter orienting'
city: 'ducked', id: 32155, name: 'encircled flippantly'
city: 'garotte', id: 3080, name: 'keeling South'
city: 'emirs', id: 14511, name: 'Aztecs vindicators'
city: 'characteristically', id: 5473, name: 'constancy Troy'
city: 'savvy', id: 3921, name: 'deafer terrifically'
city: 'misfitted', id: 14617, name: 'Eliot chambray'
city: 'faceless', id: 24481, name: 'shade forwent'
Community
  • 1
  • 1
sehe
  • 374,641
  • 47
  • 450
  • 633
  • Added a live demo that parsed ~500mb/s and pretty prints 10 random records to prove it. – sehe Feb 06 '15 at 01:59
  • I like this approach, I'll try it and let you know! Thanks! – Davide Feb 06 '15 at 12:24
  • If you find this method taking too much memory I'd suggest taking either approach #1 or #2 from [the linked answer](http://stackoverflow.com/questions/28217301/using-boostiostreamsmapped-file-source-with-stdmultimap/28220864#28220864). Of course it's hard to tell without knowing what you actually _want to do_ with the data (and how it's organized) – sehe Feb 06 '15 at 12:25
  • I'm guessing this depended on an older Boost version, although I can't tell which. – Michael Foukarakis May 09 '18 at 06:37