3

When I read from files in C++(11) I map them in to memory using:

boost::interprocess::file_mapping* fm = new file_mapping(path, boost::interprocess::read_only);
boost::interprocess::mapped_region* region = new mapped_region(*fm, boost::interprocess::read_only);
char* bytes = static_cast<char*>(region->get_address());

Which is fine when I wish to read byte by byte extremely fast. However, I have created a csv file which I would like to map to memory, read each line and split each line on the comma.

Is there a way I can do this with a few modifications of my above code?

(I am mapping to memory because I have an awful lot of memory and I do not want any bottleneck with disk/IO streaming).

user997112
  • 29,025
  • 43
  • 182
  • 361

2 Answers2

5

Here's my take on "fast enough". It zips through 116 MiB of CSV (2.5Mio lines[1]) in ~1 second.

The result is then randomly accessible at zero-copy, so no overhead (unless pages are swapped out).

For comparison:

  • that's ~3x faster than a naive wc csv.txt takes on the same file
  • it's about as fast as the following perl one liner (which lists the distinct field counts on all lines):

    perl -ne '$fields{scalar split /,/}++; END { map { print "$_\n" } keys %fields  }' csv.txt
    
  • it's only slower than (LANG=C wc csv.txt) which avoids locale functionality (by about 1.5x)

Here's the parser in all it's glory:

using CsvField = boost::string_ref;
using CsvLine  = std::vector<CsvField>;
using CsvFile  = std::vector<CsvLine>;  // keep it simple :)

struct CsvParser : qi::grammar<char const*, CsvFile()> {
    CsvParser() : CsvParser::base_type(lines)
    {
        using namespace qi;

        field = raw [*~char_(",\r\n")] 
            [ _val = construct<CsvField>(begin(_1), size(_1)) ]; // semantic action
        line  = field % ',';
        lines = line  % eol;
    }
    // declare: line, field, fields
};

The only tricky thing (and the only optimization there) is the semantic action to construct a CsvField from the source iterator with the matches number of characters.

Here's the main:

int main()
{
    boost::iostreams::mapped_file_source csv("csv.txt");

    CsvFile parsed;
    if (qi::parse(csv.data(), csv.data() + csv.size(), CsvParser(), parsed))
    {
        std::cout << (csv.size() >> 20) << " MiB parsed into " << parsed.size() << " lines of CSV field values\n";
    }
}

Printing

116 MiB parsed into 2578421 lines of CSV values

You can use the values just as std::string:

for (int i = 0; i < 10; ++i)
{
    auto l     = rand() % parsed.size();
    auto& line = parsed[l];
    auto c     = rand() % line.size();

    std::cout << "Random field at L:" << l << "\t C:" << c << "\t" << line[c] << "\n";
}

Which prints eg.:

Random field at L:1979500    C:2    sateen's
Random field at L:928192     C:1    sackcloth's
Random field at L:1570275    C:4    accompanist's
Random field at L:479916     C:2    apparel's
Random field at L:767709     C:0    pinks
Random field at L:1174430    C:4    axioms
Random field at L:1209371    C:4    wants
Random field at L:2183367    C:1    Klondikes
Random field at L:2142220    C:1    Anthony
Random field at L:1680066    C:2    pines

The fully working sample is here Live On Coliru


[1] I created the file by repeatedly appending the output of

while read a && read b && read c && read d && read e
do echo "$a,$b,$c,$d,$e"
done < /etc/dictionaries-common/words

to csv.txt, until it counted 2.5 million lines.

sehe
  • 374,641
  • 47
  • 450
  • 633
  • Note: you can do faster than this, but at the cost of inconvenience. In particular there's `madvise`, you can parse lazily (lines only first, and on-demand). Also, see my other answers for much more versatile CSV parsing (e.g. allowing for quoted values and escapes). Next up: you can avoid a lot of allocations if you know the number of columns up front. – sehe May 16 '14 at 20:23
  • where to find quoting and escapes ? – Roby Oct 10 '15 at 19:50
  • My other answers, e.g. http://stackoverflow.com/a/9405546/85371, http://stackoverflow.com/questions/10289985/parse-quoted-strings-with-boostspirit/10294577#10294577, http://stackoverflow.com/questions/18365463/how-to-parse-csv-using-boostspirit/18366335?s=6|0.0000#18366335, http://stackoverflow.com/questions/31536086/parse-tab-delimited-file-with-boost-spirit-where-entries-may-contain-whitespace/31536501?s=21|0.0000#31536501 etc. – sehe Oct 10 '15 at 19:59
  • Most of these were cherry-picked from a simple search like [`user:85371 [boost-spirit] escape`](http://stackoverflow.com/search?page=2&tab=votes&q=user%3a85371%20%5bboost-spirit%5d%20escape) – sehe Oct 10 '15 at 20:00
1

Simply create an istringstream from your memory mapped bytes and parse that using :

const std::string stringBuffer(bytes, region->get_size());
std::istringstream is(stringBuffer);
typedef boost::tokenizer< boost::escaped_list_separator<char> > Tokenizer;
std::string line;
std::vector<std::string> parsed;
while(getline(is, line))
{
    Tokenizer tokenizer(line);
    parsed.assign(tokenizer.begin(),tokenizer.end());
    for (auto &column: parsed)
    {
        // 
    }
}

Note that on many systems memory mapping isn't providing any speed benefit compared to sequential read. In both cases you will end up reading the data from the disk page by page, probably with the same amount of read ahead, and both the IO latency and bandwidth will be the same in both cases. Whether you have lots of memory or not won't make any difference. Also, depending on the system, memory_mapping, even read-only, might lead to surprising behaviours (e.g. reserving swap space) that don't that sometimes keep people busy troubleshooting.

Come Raczy
  • 1,590
  • 17
  • 26
  • So memory-mapping doesn't load the file upfront? I thought it did! – user997112 May 16 '14 at 18:18
  • loading it into a stringstream makes sure you copy all the memory first – sehe May 16 '14 at 18:45
  • @user997112 no it does not load the file upfront into physical memory and this is a good thing. First there would be a problem with very large files, and then it would prevent you from overlaying IO and processing. Even though loading it into a stringstream would load each page in memory as sehe said, it won't guarantee that the whole file is in RAM at that point (memory pages might be discarded). – Come Raczy May 16 '14 at 20:07
  • I've cooked up a sample using Boost to parse through entire source, while not settling for slowness and copying with `std::string` and `std::stringstream`. It can be done faster, but this would surely be a good start. – sehe May 16 '14 at 20:14