I have a really huge file with 17 million records in it.
Here is a sample of the file:
Actor Movie
1,2
2,2
3,1
4,3
2,3
I would want to skip the first line and start the parsing from second line onward. I am trying to create two things.
1. Movies to actors map
vector<uint64_t> *movie_map = new vector<uint64_t>[1200000];
2. Actors to movies map
vector<uint64_t> *actor_movie_map = new vector<uint64_t>[2000000];
I purposefully did not want a HashMap since it takes some time for computing hash. I tried to use Boost library. It reads the file(~250MB) in about 3 seconds, but a lot of time is consumed while creating the maps. In fact the time is worse than normal getline()
way of reading the file. Here is my implementation so far.
using CsvField = boost::string_ref;
using CsvLine = std::vector<CsvField>;
using CsvFile = std::vector<CsvLine>;
namespace qi = boost::spirit::qi;
struct CsvParser : qi::grammar<char const*, CsvFile()> {
CsvParser() : CsvParser::base_type(lines)
{
using boost::phoenix::construct;
using boost::phoenix::begin;
using boost::phoenix::size;
using namespace qi;
field = raw [*~char_(",\r\n")] [ _val = construct<CsvField>(begin(_1), size(_1)) ];
line = field % ',';
lines = line % eol;
}
private:
qi::rule<char const*, CsvField()> field;
qi::rule<char const*, CsvLine()> line;
qi::rule<char const*, CsvFile()> lines;
};
int main()
{
srand(time(0));
boost::iostreams::mapped_file_source csv("playedin.csv");
CsvFile parsed;
parsed.reserve(18*1000*1000);
if (qi::parse(csv.data(), csv.data() + csv.size(), CsvParser(), parsed))
{
using boost::lexical_cast;
for(uint64_t i=1; i < parsed.size(); i++){
auto& line = parsed[i];
uint64_t sample = lexical_cast<uint64_t>(line[0]);
movie_map[lexical_cast<uint64_t>(line[1])].push_back(lexical_cast<uint64_t>(line[0]));
actor_movie_map[lexical_cast<uint64_t>(line[0])].push_back(lexical_cast<uint64_t>(line[1]));
}
}
}
I do not want to use the normal way of reading file because of the large size of the file. Please suggest a way of implementing this so that the whole file reading and preparing map for 17 million records should happen in less than 2-3 seconds.I understand that the expectation is little too much, but I am sure it is possible. I am really looking at the most efficient way of doing this.
Thanks for your help!