1

I have a really huge file with 17 million records in it.

Here is a sample of the file:

Actor Movie
1,2
2,2
3,1
4,3
2,3

I would want to skip the first line and start the parsing from second line onward. I am trying to create two things.
1. Movies to actors map
vector<uint64_t> *movie_map = new vector<uint64_t>[1200000];
2. Actors to movies map
vector<uint64_t> *actor_movie_map = new vector<uint64_t>[2000000];

I purposefully did not want a HashMap since it takes some time for computing hash. I tried to use Boost library. It reads the file(~250MB) in about 3 seconds, but a lot of time is consumed while creating the maps. In fact the time is worse than normal getline() way of reading the file. Here is my implementation so far.

using CsvField = boost::string_ref;
using CsvLine  = std::vector<CsvField>;
using CsvFile  = std::vector<CsvLine>;

namespace qi = boost::spirit::qi;
struct CsvParser : qi::grammar<char const*, CsvFile()> {
    CsvParser() : CsvParser::base_type(lines)
    {
        using boost::phoenix::construct;
        using boost::phoenix::begin;
        using boost::phoenix::size;

        using namespace qi;

        field = raw [*~char_(",\r\n")] [ _val = construct<CsvField>(begin(_1), size(_1)) ];
        line  = field % ',';
        lines = line  % eol;
    }
  private:
    qi::rule<char const*, CsvField()> field;
    qi::rule<char const*, CsvLine()>  line;
    qi::rule<char const*, CsvFile()>  lines;
};

int main()
{
    srand(time(0));
    boost::iostreams::mapped_file_source csv("playedin.csv");

    CsvFile parsed;
    parsed.reserve(18*1000*1000);
    if (qi::parse(csv.data(), csv.data() + csv.size(), CsvParser(), parsed))
    {
        using boost::lexical_cast;
        for(uint64_t i=1; i < parsed.size(); i++){
        auto& line = parsed[i];
        uint64_t sample = lexical_cast<uint64_t>(line[0]);
        movie_map[lexical_cast<uint64_t>(line[1])].push_back(lexical_cast<uint64_t>(line[0]));
        actor_movie_map[lexical_cast<uint64_t>(line[0])].push_back(lexical_cast<uint64_t>(line[1]));

        }
    }
}


I do not want to use the normal way of reading file because of the large size of the file. Please suggest a way of implementing this so that the whole file reading and preparing map for 17 million records should happen in less than 2-3 seconds.I understand that the expectation is little too much, but I am sure it is possible. I am really looking at the most efficient way of doing this.

Thanks for your help!

sehe
  • 374,641
  • 47
  • 450
  • 633
Harshita
  • 11
  • 4

2 Answers2

1
  • vector *movie_map = new vector[1200000];

    ✍ Never use new or delete in modern c++

  • I purposefully did not want a HashMap since it takes some time for computing hash.

    Exactly how much time does calculating the hash take? I mean, chances are that a hash-map aren't the best choice here, but your reasoning is off. On may implementations std::hash<> of a 64 bit integer is a no-op (on a system where size_t is 64 bits)¹.

But The Kicker Is:

You ... read all of the data into a CsvFile first (that's a vector of vectors of string_refs...) only to then convert to a map?!

This is ridiculous. Simply skip the middle man, you don't need it!

Mind you, spirit is a parser generator. It's ridiculous in every respect to parse the text first, only to use lexical_cast on the result.

Demo

Here's a c++14 demo that cuts the middle man, using Boost Spirit X3 for good measure. I chose flat_multimap a bit randomly to make two points:

  • your data structure is "like a" multi-map, perhaps even an adjacency list. What does your use-case require of it?
  • there are (many) existing datastructures that have subtly varying performance characteristics

Live On Coliru

#include <boost/container/flat_map.hpp>
#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/spirit/home/x3.hpp>

using Table  = boost::container::flat_multimap<uint64_t, uint64_t>;
using Record = Table::value_type;

namespace Parsing {
    using namespace boost::spirit::x3;

    auto const ignore_header_row 
        = !uint_ >> *(char_ - eol) >> eol;

    auto const record 
        = rule<struct _rl, Record> {"record"}
        = uint_ >> ',' >> uint_ >> eol;

    auto const file 
     // = rule<struct _file, Table> {"file"}
        = omit [*ignore_header_row] >> *record >> eoi;
}

#include <iostream>

int main() {
    boost::iostreams::mapped_file_source mfs("playedin.csv");

    Table table;
    table.reserve(18*1000*1000);
    if (parse(mfs.begin(), mfs.end(), Parsing::file, table)) {
        std::cout << "Parsed " << table.size() << " records\n";
    } else {
        std::cout << "Parse failed\n";
    }
}

Prints

Parsed 5 records

Caveat In latest boost versions there is a regression in X3 attribute handling, you will want to use the fix from this answer: https://stackoverflow.com/a/48393573/85371

Benchmark + Query

Benchmarking predictably shows that inserting 17+million unsorted rows is not optimal using the flat-map approach:

  • 1,000,000 unsorted rows are read in ~4m39s,
  • with the input sorted it takes only 0.113s to read those same rows output screenshot

The obvious bottle neck is sorting while parsing. That's easily fixed: we don't need to sort while parsing. Just sort it after parsing:

  • all 17.4 million rows are now parsed and sorted in 1.922s, or 1.284s if presorted (output screenshot again)

Benchmarked Code Listing

The final version Live On Coliru

#include <boost/container/flat_map.hpp>
#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/spirit/home/qi.hpp>

using Table  = boost::container::vector<std::pair<uint64_t, uint64_t> >;
using Record = Table::value_type;

namespace Parsing {
    using namespace boost::spirit::qi;
    using Iterator = char const*;

    static const rule<Iterator> ignore_header_row
        = !uint_ >> *(char_ - eol) >> eol;
    static const rule<Iterator, Record()> record 
        = uint_ >> ',' >> uint_ >> eol;

    static const rule<Iterator, Table()> file 
        = omit [*ignore_header_row] >> *record >> eoi;
}

Table parse_data(std::string const& fname) {
    boost::iostreams::mapped_file_source mfs(fname);

    Table table;
    table.reserve(18*1000*1000);
    if (!parse(mfs.begin(), mfs.end(), Parsing::file, table))
        throw std::runtime_error("Parse failed");

    sort(table.begin(), table.end());
    return table;
}

template <typename It> struct iterator_range {
    It b, e;
    iterator_range() = default;
    iterator_range(std::pair<It, It> p) : b(p.first), e(p.second) {}
    It begin() const { return b; }
    It end() const   { return e; }
};

struct by_actor {
    template <typename T, typename U>
    bool operator()(T const& a, U const& b) const { return actor(a) < actor(b); }
  private:
    static uint64_t actor(Record const& r) { return r.first; }
    static uint64_t actor(uint64_t i) { return i; }
};

#include <iostream>

int main(int argc, char** argv) {
    Table const& table = parse_data("playedin.csv");

    auto query = [&table](uint64_t actor) { 
        return iterator_range<Table::const_iterator>(equal_range(table.begin(), table.end(), actor, by_actor{}));
    };

    for (auto actor : std::vector<std::string>{argv+1, argv+argc}) {
        std::cout << "Actor " << actor << " played in:";
        for (auto movie : query(std::stoull(actor)))
            std::cout << " " << movie.second;
        std::cout << "\n";
    }
}

¹ likewise, boost's boost::hash<> defers to boost::hash_value(unsigned long long) which is documented to return val when abs(val) <= std::numeric_limits<std::size_t>::max().

sehe
  • 374,641
  • 47
  • 450
  • 633
  • I tried changing `using CsvField = boost::string_ref;` to `using CsvField = std::uint64_t;`, but it is giving me compilation errors. It would be great if you can host the whole program on your site. Thanks :) – Harshita Jan 31 '18 at 01:51
  • Which site would that be? Anyhow, here's a [version that skips the middle-man and randomly chose `boost::container::flat_multimap<>`](http://coliru.stacked-crooked.com/a/13bda74a9c41388e) because you might not know it exists. I'll update my answer. If you include the `playedin.csv` file somewhere I'll do some benchmarking (likely `std::multimap` vs. `flat_multimap`, maybe vs. `boost::adjacency_list` or something - your question spells of a graph problem) – sehe Jan 31 '18 at 01:56
  • Thank you very much! The file can be downloaded at [link](https://drive.google.com/file/d/1HftwiuS9ljH6t3ZGFktiDTGnYURAqYZK/view). I tried to run the code on my local. It is giving me compilation errors even after applying the fix from git. – Harshita Jan 31 '18 at 02:57
  • Testing with that large file and this [slightly extended test program](http://coliru.stacked-crooked.com/a/e7fa9e984e401b5d) prints [query results in 1.894s](https://i.imgur.com/ZzW36fI.png). Parsing [_alone_](http://coliru.stacked-crooked.com/a/b7e274d9f745b68e) takes [0.819s](https://i.imgur.com/4u9dFyl.png). – sehe Jan 31 '18 at 13:44
  • Here's a [version using just C++11](http://coliru.stacked-crooked.com/a/748d1e798b8ef31d) (Boost Spirit Qi) that runs [in the same amount of time](https://i.imgur.com/oDMlULa.png). It compiles fine [on MSVC](http://rextester.com/UPLL59025) (of course it doesn't run correctly because there's no input file there) – sehe Jan 31 '18 at 13:52
  • Added some benchmark results/analysis to the answer. – sehe Jan 31 '18 at 15:50
  • The benchmark scores look really promising. However, the program returns a Table which is pairs of actors and movies in each row which is sort of useless. The whole point of this was to return two maps. `vector *movie_to_actor_map = new vector[1200000]; vector *actor_to_movie_map = new vector[2000000];`. This will give me the list of movies each actor has played in and also the list of actors of each movie. I wanted to create these maps while parsing itself so I save some time in transferring. – Harshita Feb 01 '18 at 00:51
  • If you have noticed in my implementation, I have used `movie_to_actor_map[movie].push_back(actor); actor_to_movie_map[actor].push_back(movie));` where `actor` and `movie` are the first and second columns in the dataset respectively. This allows me to fetch either actors played in a particular movie or movies in which actor has acted by using `[]` indexing. I will store Actor n's movies in the nth index. This way my lookups will be easy and fast ;) Could you please make these last changes if you don't mind? :) – Harshita Feb 01 '18 at 00:57
  • Here is how I have successfully implemented this without memory maps. File reading, parsing, and finding Bacon number for 50 test cases(called on threads) using Bidirectional BFS works in ~ 13 seconds. `std::ifstream file(lPath); string dummyline; std::getline(file,dummyline); int64_t fir,sec; while(file >> fir && file.ignore(std::numeric_limits::max(), ',') && file >> sec){ movie_to_actor_map[sec].push_back(fir); actor_to_movie_map[fir].push_back(sec); } ` – Harshita Feb 01 '18 at 01:10
  • Haha. So I was exactly right that I smelled a graph problem. The multimap is _exactly_ like an adjacency-list representation of that, so my sample did already do the building while parsing. – sehe Feb 01 '18 at 01:38
  • Which id is Bacon? If you give it on the command line here, this [test program](http://coliru.stacked-crooked.com/a/1198bb1321245872) calculates the Bacon rank for **every** other movie/actor in the dataset in 12.7s here. As a demo, [it shows the most remote relation (which usually is an actor, but could be a movie), with the connecting path](https://i.imgur.com/iGW7A9e.png). For fun I used Boost Graph this time around. – sehe Feb 01 '18 at 01:47
  • Please find my implementation on [Coliru](http://coliru.stacked-crooked.com/a/73f1dd5f531b2f57). Dijkstra's algorithm is only as good as BFS when the graph is unweighted. I am using Bidirectional BFS which is way faster. Also, I have tweaked the algorithm so it uses two of the maps that I have come up with. My [benchmark](https://imgur.com/AaAYSjh) on bigger dataset. Please note, this is for TWO test cases ;) Using memory maps to read the file will only make it better. Please please implement that? – Harshita Feb 01 '18 at 03:17
  • Can you post that at [CodeReview.SE]? – sehe Feb 01 '18 at 15:07
  • Please find the code [here](https://codereview.stackexchange.com/questions/186530/program-to-parse-csv-and-use-bidirectional-bfs-to-find-the-shortest-degree-of-co) – Harshita Feb 01 '18 at 15:58
0

A few thoughts:

  • Lexical_cast tends to be slow.
  • Memory mapping the file does not help sequential access. It might even be worse than regular reading-by-chunks.
  • An array of vectors is inefficient if the vectors are small. Each vector consumes at least three pointers worth of space when it is empty. You'd be better off with a different container, like std::unordered_map.
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Do you recommend implementing thread on Memory mapping? Would that be better? If yes, do you have any resources? I have tried with `std::unordered_map`. This gave better result. – Harshita Jan 31 '18 at 01:30
  • "Implementing thread on Memory mapping?" It sounds a bit like you're regurgitating words you heard. If unodered_map gave good results, why not use it? There's little use in finding complicated optimizations that you don't understand, IMO – sehe Jan 31 '18 at 02:02
  • @sehe I meant, if I should consider having threads to parse/read the memory mapped file. If yes, would that improve the performance? – Harshita Jan 31 '18 at 03:10
  • @sehe unordered maps gave worse results than array of vectors. That is why I stuck to this – Harshita Jan 31 '18 at 03:12
  • 1
    @Harshita: I don't think threading is going to help you at all. Your code is already overly complex relative to an optimized version. – John Zwinck Jan 31 '18 at 08:20