-1

I am trying to read a .csv file which has around 50-thousands lines. In the code below I am reading each one line at a time, parsing it into pre-define struct called NinjaInfo_t and then pushing it into a list. The whole process of reading this file took me more than 1 minute. Is there a another way for it to be faster ? Also this is a homework so i cannot use any libraries that support reading csv file and I have to use a linked list to store data.

ifstream is("data.csv",ifstream::binary);
is.seekg(0, ios_base::end);
size_t size = is.tellg();
is.seekg(0, ios_base::beg);
char* buffer = new char[size];
is.read(buffer, size);
stringstream ss;
ss.str(buffer);
// skip the file line in data.csv
string data;
getline(ss, data);
while (getline(ss, data)) {
    NinjaInfo_t newData;
    stringstream dataStream;
    dataStream.str(data);
    string tmp;
    string timestamp;
    string id;
    string longtitude, latitude;

    getline(dataStream, tmp, ',');
    getline(dataStream, timestamp, ',');
    getline(dataStream, id, ',');
    getline(dataStream, longtitude, ',');
    getline(dataStream, latitude, ',');
    getline(dataStream, tmp);

    istringstream(longtitude) >> newData.longitude;

    istringstream(latitude) >> newData.latitude;

    while (id.length() < 4) {
        id = '0' + id;
    }
    istringstream(id) >> newData.id;

    stringstream ss;
    struct tm _tm = {};

    string t = timestamp;
    ss.str(t);
    ss >> std::get_time(&_tm, "%d/%m/%Y %H:%M:%S");
    _tm.tm_isdst = -1;
    time_t time = std::mktime(&_tm);
    newData.timestamp = time;
    db.push_back(newData); //DB IS A LINKED LIST.

P.s: sorry for the bad english :P.

  • 4
    Don't reinvent the wheel. Just use one of the loads and loads of existing CSV parsing libraries. They'll probably be faster. (Not even related to file i/o and parsing here is the fact that your code does *a lot* of run-time memory allocation and deallocation, which might really become tedious at this scale) – Marcus Müller Nov 26 '17 at 14:37
  • 3
    Possible duplicate of [How can I read and parse CSV files in C++?](https://stackoverflow.com/questions/1120140/how-can-i-read-and-parse-csv-files-in-c) – Marcus Müller Nov 26 '17 at 14:40
  • TThere isn't much detail in the question on how getline() works, but technically if you read line by line, most of the time of that minute goes in IO. If you buffer and read chunks of lines at a time, there should be less IO and it should run faster. – Juan Nov 26 '17 at 14:41
  • 1
    @Juan the OS does that for you. – Marcus Müller Nov 26 '17 at 14:42
  • (by the way, `getline` should be used to ... get lines. You probably just want to use normal file IO stream operators `>> variable`, or good ol' `sscanf` (be aware of security implications). – Marcus Müller Nov 26 '17 at 14:44
  • and your way of finding the file size is a *really* bad idea, too. – Marcus Müller Nov 26 '17 at 14:50
  • 1
    Why read the file into a character buffer and then read from that? Why not just .... read from the file? – Martin Bonner supports Monica Nov 26 '17 at 14:59
  • @Marcus i am not allowed to use any others libraries than the ones given. What should i do here if my way is that bad? – Dang Nguyen Nov 26 '17 at 15:03
  • @Martin well it was my attempt to reduce the reading time and it obviously didn't work so i just left it there. – Dang Nguyen Nov 26 '17 at 15:06
  • @DangNguyen "not allowed to use any other libraries" is class-A bullshit; either you need to solve a problem or not. If this is homework, mark it as such. And: if something doesn't work: remove it. Just don't let it stand around. Have you actually even tried to profile your code? In any case you **must** state all restrictions on your code, otherwise how are we supposed to answer? – Marcus Müller Nov 26 '17 at 15:15
  • And: C++17 comes with `string_view` which you could easily use to adapt a lot of the solutions given on the linked question. – Marcus Müller Nov 26 '17 at 15:19

1 Answers1

1

There are two main reasons for poor performance:

  • You generate a lot of overhead by creating throw-away istringstream at every iteration (i.e. constructor/initialization/destructor), instead of just doing a conversion (for example with std::stoi() and alike).
  • You push_back() each record to a db vector (I assume). But as its size grows, it requires frequent reallocation and moving of the data it contains. You could reserve() expected space at the beginning, just to reduce this memory management overhead.

Note 1: It's not clear, if reading the file at once as you do will bring significant performance boost, as file streams are buffered anyway.

Note 2: Just be aware that your CSV processing is not fully aligned with RFC 4180, which allows a newline in the data to be enclosed between quotes.

Christophe
  • 68,716
  • 7
  • 72
  • 138
  • @ Christophe db is a linked list :p. Anymore suggestions for me? – Dang Nguyen Nov 26 '17 at 15:12
  • @DangNguyen try replacing the linked list with a vector (and of course a preliminary reserve() ): for the linked list, you have a new memory allocation for every item that you add a new item. While this avoids moving data around as with growing vectors, this requires nevertheless intensive heap management operations (including for the deletion of the linked list at the end). – Christophe Nov 26 '17 at 15:20
  • @DangNguyen using a linked list for something that is fixed size, continuously allocated is already a performance mistake. – Marcus Müller Nov 26 '17 at 15:21
  • @Christophe by the way, don't underestimate the time it takes to (re)allocate the individual `std::strings`; instead, a view into the string should clearly be used here instead of copying the file data (at least) three times (when reading the whole file at the beginning, when reading from that into a line string, and when reading the elements in that line string into individual strings); the same logic could be implemented without a single byte being copied. – Marcus Müller Nov 26 '17 at 15:23
  • @MarcusMüller yes, I agree that the temporary strings are also time consuming. Just iterating trough the initial string would be better. Or alternatively, use the C++17 [`string_view`](http://en.cppreference.com/w/cpp/string/basic_string_view) – Christophe Nov 26 '17 at 17:10