1

I'm looking to read from std::in with a syntax as below (it is always int, int, int, char[]/str). What would be the fastest way to parse the data into an int array[3] and either a string or char array.

#NumberOfLines(i.e.10000000)
1,2,2,'abc'
2,2,2,'abcd'
1,2,3,'ab'
...1M+ to 10M+ more lines, always in the form of (int,int,int,str)

At the moment, I'm doing something along the lines of.

//unsync stdio
std::ios_base::sync_with_stdio (false);
std::cin.tie(NULL);
//read from cin
for(i in amount of lines in stdin){
    getline(cin,str);
    if(i<3){
       int commaindex = str.find(',');
       string substring = str.substr(0,commaindex);
       array[i]=atoi(substring.c_str());
       str.erase(0,commaindex+1)
    }else{
       label = str;
    }
    //assign array and label to other stuff and do other stuff, repeat
}

I'm quite new to C++ and recently learned profiling with Visual Studio however not the best at interpreting it. IO takes up 68.2% and kernel takes 15.8% of CPU usage. getline() covers 35.66% of the elapsed inclusive time.

Is there any way I can do something similar to reading large chunks at once to avoid calling getline() as much? I've been told fgets() is much faster, however, I'm unsure of how to use it when I cannot predict the number of characters to specify.

I've attempted to use scanf as follows, however it was slower than getline method. Also have used `stringstreams, but that was incredibly slow.

scanf("%i,%i,%i,%s",&array[0],&array[1],&array[2],str);

Also if it matters, it is run on a server with low memory available. I think reading the entire input to buffer would not be viable? Thanks!

Update: Using @ted-lyngmo approach, gathered the results below.

time wc datafile

real    4m53.506s
user    4m14.219s
sys     0m36.781s

time ./a.out < datafile

real    2m50.657s
user    1m55.469s
sys     0m54.422s

time ./a.out datafile

real    2m40.367s
user    1m53.523s
sys     0m53.234s
  • It's certainly possible to read larger chunks from the input, but then you have to be able to handle partial lines as they have no fixed length. – Some programmer dude Sep 12 '21 at 09:32
  • 1
    dont do the string splitting manually, but use `std::getline`. See here: https://stackoverflow.com/questions/1120140/how-can-i-read-and-parse-csv-files-in-c – 463035818_is_not_an_ai Sep 12 '21 at 09:33
  • In most cases when speed matters you will read from a file, not from user input. So there is no point using std::in. So I recommend using `fscanf()` or other metods that read from a file. – Botond Horváth Sep 12 '21 at 09:36
  • I would read from a file. However, the input comes from the output of a GPU. Where the GPU dumps its output to stdout and my program shall read it from stdin – badatpython Sep 12 '21 at 10:27

1 Answers1

0

You could use std::from_chars (and reserve() the approximate amount of lines you have in the file, if you store the values in a vector for example). I also suggest adding support for reading directly from the file. Reading from a file opened by the program is (at least for me) faster than reading from std::cin (even with sync_with_stdio(false)).

Example:

#include <algorithm> // std::for_each
#include <cctype>    // std::isspace
#include <charconv>  // std::from_chars
#include <cstdio>    // std::perror
#include <fstream>
#include <iostream>
#include <iterator>  // std::istream_iterator
#include <limits>    // std::numeric_limits

struct foo {
    int a[3];
    std::string s;
};

std::istream& operator>>(std::istream& is, foo& f) {
    if(std::getline(is, f.s)) {
        std::from_chars_result fcr{f.s.data(), {}};
        const char* end = f.s.data() + f.s.size();

        // extract the numbers
        for(unsigned i = 0; i < 3 && fcr.ptr < end; ++i) {
            fcr = std::from_chars(fcr.ptr, end, f.a[i]);
            if(fcr.ec != std::errc{}) {
                is.setstate(std::ios::failbit);
                return is;
            }
            // find next non-whitespace
            do ++fcr.ptr;
            while(fcr.ptr < end &&
                  std::isspace(static_cast<unsigned char>(*fcr.ptr)));
        }

        // extract the string
        if(++fcr.ptr < end)
            f.s = std::string(fcr.ptr, end - 1);
        else
            is.setstate(std::ios::failbit);
    }
    return is;
}

std::ostream& operator<<(std::ostream& os, const foo& f) {
    for(int i = 0; i < 3; ++i) {
        os << f.a[i] << ',';
    }
    return os << '\'' << f.s << "'\n";
}

int main(int argc, char* argv[]) {
    std::ifstream ifs;
    if(argc >= 2) {
        ifs.open(argv[1]); // if a filename is given as argument
        if(!ifs) {
            std::perror(argv[1]);
            return 1;
        }
    } else {
        std::ios_base::sync_with_stdio(false);
        std::cin.tie(nullptr);
    }

    std::istream& is = argc >= 2 ? ifs : std::cin;

    // ignore the first line - it's of no use in this demo
    is.ignore(std::numeric_limits<std::streamsize>::max(), '\n');

    // read all `foo`s from the stream
    std::uintmax_t co = 0;
    std::for_each(std::istream_iterator<foo>(is), std::istream_iterator<foo>(),
                  [&co](const foo& f) {
                      // Process each foo here
                      // Just counting them for demo purposes:
                      ++co;
                  });
    std::cout << co << '\n';
}

My test runs on a file with 1'000'000'000 lines with content looking like below:

2,2,2,'abcd'
2, 2,2,'abcd'
2, 2, 2,'abcd'
2, 2, 2, 'abcd'

Unix time wc datafile

1000000000  2500000000 14500000000 datafile

real    1m53.440s
user    1m48.001s
sys     0m3.215s

time ./my_from_chars_prog datafile

1000000000

real    1m43.471s
user    1m28.247s
sys     0m5.622s

From this comparison I think one can see that my_from_chars_prog is able to successfully parse all entries pretty fast. It was consistently faster at doing so than wc - a standard unix tool whos only purpose is to count lines, words and characters.

Ted Lyngmo
  • 93,841
  • 5
  • 60
  • 108
  • Thanks! This is certainly interesting, I will have to look into this more and try to implement it. – badatpython Sep 12 '21 at 10:28
  • @badatpython You're welcome! I haven't made any benchmarking myself so I'm very curious about the result. :-) – Ted Lyngmo Sep 12 '21 at 10:30
  • @badatpython I made a test reading 10M of lines like those in your question. That took ~1 second. Did you test to see if this approach improved the speed on your side? – Ted Lyngmo Sep 13 '21 at 05:08
  • Yes! i managed to implement it into my current use case. It took a lot of time as I've never worked with istream_iterators and the charconv library. Surprisingly, it improved my total runtime by ~4-5s. – badatpython Sep 14 '21 at 08:12
  • @badatpython Cool! What's the total runtime now? – Ted Lyngmo Sep 14 '21 at 10:05
  • The total runtime on the biggest data set (~990M lines) went from 20 mins to 16mins! I think this is enough optimisation of the input, though I wonder if there is a faster way to print the lines in the same format to stdout. Currently using `puts()` – badatpython Sep 14 '21 at 10:50
  • is it possible to adapt your code to handle whitespaces? Annoyingly, some datasets provided are in the format `0, 0, 0, 'foo'` or `0,w0,w0,w'bar'` (where w represents a whitespace) – badatpython Sep 14 '21 at 10:56
  • @badatpython 990M lines ... wow, that's quite a lot. I'm guessing that you are not storing these internally since you "_run on a server with low memory available_", but that you process the entries one-by-one as you read them from the file? The whitespace case should be simple to fix (but will of course make it a little slower). I'll add a fix for that later. – Ted Lyngmo Sep 14 '21 at 14:18
  • @badatpython I now made it skip whitespaces and added a comparison between running `wc` and `my_from_chars_prog` on the same file with 1000M entries. – Ted Lyngmo Sep 14 '21 at 15:01
  • thank you very much for your guidance! Currently (for extremely large files), my implementation reads 10m lines and does some processing, then reads another 10M lines, so on. This hopefully avoid overflow, would you say this is a sound implementation? – badatpython Sep 14 '21 at 15:20
  • @badatpython Yeah, that could perhaps (depending on disks and caches) be more effective than reading and processing one entry at a time. If you run my program (as it is right now), how long does it take to count all entries in your 990M lines file (if you also give it the filename as an argument)? – Ted Lyngmo Sep 14 '21 at 15:23
  • I've just update the post with my timings. It was rather fast. I'm unsure why, but whenever I passed the file as an argument, it would just idle and do nothing. That said, I am very pleased with the current performance! – badatpython Sep 14 '21 at 16:47
  • @badatpython Nice times :-) Odd that reading from the file doesn't work though. Perhaps adding printing an error if it doesn't succeed opening it would be a first step at finding the reason. I added that to the answer too. – Ted Lyngmo Sep 14 '21 at 16:53
  • Rather odd, I've rerun it with the new answer. It just prints 0, presumably nothing happened. Regardless, I think this satisfies my performance times! Thank you very much for your ongoing support Ted! – badatpython Sep 15 '21 at 01:31
  • @badatpython 0 means that it failed extracting the first line in `datafile`, not that it failed opening the file. That part is unchanged in the answer. Can the old version of the program still extract from the file? – Ted Lyngmo Sep 15 '21 at 03:14
  • Oh! Right that makes so much sense, I totally forgot. The file given provides a header file! the header file gives the number of line the file has. it looks something like `#numberOfLines`. I'm sorry if this would have been handy to know, I've updated the question – badatpython Sep 15 '21 at 03:53
  • @badatpython Aha, got it :) In that case, just do `is.ignore(std::numeric_limits::max(), '\n');` before the `std::for_each` (since my demo program doesn't need to know how many lines there are). – Ted Lyngmo Sep 15 '21 at 05:18
  • @badatpython I added that to the answer to make it fit the question too. – Ted Lyngmo Sep 15 '21 at 05:31
  • 1
    It worked! I've updated the times now. Thank you! – badatpython Sep 15 '21 at 06:45