Parsing a CSV file - C++

Question

C++14

Generally, the staff in university has recommended us to use Boost to parse the file, but I've installed it and not succeeded to implement anything with it.

So I have to parse a CSV file line-by-line, where each line is of 2 columns, separated of course by a comma. Each of these two columns is a digit. I have to take the integral value of these two digits and use them to construct my Fractal objects at the end.

The first problem is: The file can look like for example so:

1,1
<HERE WE HAVE A NEWLINE>
<HERE WE HAVE A NEWLINE>

This format of file is okay. But my solution outputs "Invalid input" for that one, where the correct solution is supposed to print only once the respective fractal - 1,1.

The second problem is: The file can look like:

1,1
<HERE WE HAVE A NEWLINE>
1,1

This is supposed to be an invalid input but my solution treats it like a correct one - and just skips over the middle NEWLINE.

Maybe you can guide me how to fix these issues, it would really help me as I'm struggling with this exercise for 3 days from morning to evening.

This is my current parser:

#include <iostream>
#include "Fractal.h"
#include <fstream>
#include <stack>
#include <sstream>
const char *usgErr = "Usage: FractalDrawer <file path>\n";
const char *invalidErr = "Invalid input\n";
const char *VALIDEXT = "csv";
const char EXTDOT = '.';
const char COMMA = ',';
const char MINTYPE = 1;
const char MAXTYPE = 3;
const int MINDIM = 1;
const int MAXDIM = 6;
const int NUBEROFARGS = 2;
int main(int argc, char *argv[])
{
    if (argc != NUBEROFARGS)
    {
        std::cerr << usgErr;
        std::exit(EXIT_FAILURE);
    }
    std::stack<Fractal *> resToPrint;
    std::string filepath = argv[1]; // Can be a relative/absolute path
    if (filepath.substr(filepath.find_last_of(EXTDOT) + 1) != VALIDEXT)
    {
        std::cerr << invalidErr;
        exit(EXIT_FAILURE);
    }
    std::stringstream ss; // Treat it as a buffer to parse each line
    std::string s; // Use it with 'ss' to convert char digit to int
    std::ifstream myFile; // Declare on a pointer to file
    myFile.open(filepath); // Open CSV file
    if (!myFile) // If failed to open the file
    {
        std::cerr << invalidErr;
        exit(EXIT_FAILURE);
    }
    int type = 0;
    int dim = 0;
    while (myFile.peek() != EOF)
    {
        getline(myFile, s, COMMA); // Read to comma - the kind of fractal, store it in s
        ss << s << WHITESPACE; // Save the number in ss delimited by ' ' to be able to perform the double assignment
        s.clear(); // We don't want to save this number in s anymore as we won't it to be assigned somewhere else
        getline(myFile, s, NEWLINE); // Read to NEWLINE - the dim of the fractal
        ss << s;
        ss >> type >> dim; // Double assignment
        s.clear(); // We don't want to save this number in s anymore as we won't it to be assigned somewhere else

        if (ss.peek() != EOF || type < MINTYPE || type > MAXTYPE || dim < MINDIM || dim > MAXDIM) 
        {
            std::cerr << invalidErr;
            std::exit(EXIT_FAILURE);
        }

        resToPrint.push(FractalFactory::factoryMethod(type, dim));
        ss.clear(); // Clear the buffer to update new values of the next line at the next iteration
    }

    while (!resToPrint.empty())
    {
        std::cout << *(resToPrint.top()) << std::endl;
        resToPrint.pop();
    }

    myFile.close();

    return 0;
}

`while (myFile.peek() != EOF)`i is a bad idea. Many things can go wrong while reading the file before reaching EOF. Prefer `while (getline(myFile, s, COMMA))`. For more details read [Why is iostream::eof inside a loop condition (i.e. `while (!stream.eof())`) considered wrong?](https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-i-e-while-stream-eof-cons) — user4581301, Jan 04 '20 at 23:06
@user4581301 Thanks for the remark - I have changed it. But it doesn't help me overcome the above problems actually... — CCPPSup, Jan 04 '20 at 23:10
Does this answer your question? [How can I read and parse CSV files in C++?](https://stackoverflow.com/questions/1120140/how-can-i-read-and-parse-csv-files-in-c) — Retired Ninja, Jan 04 '20 at 23:13
@RetiredNinja Actually no. I have tried using this topic while struggling to use Boost features to my needs, but I'm really new to C++ and I got complicated by that... — CCPPSup, Jan 04 '20 at 23:17
Considering working line based and with a state machine. Use `getline` to read the whole line, then split up the line with a stringstream. — user4581301, Jan 04 '20 at 23:18
By the way, what was giving you problems with Boost? Boost's a big tool, lots of things could go wrong, but it's a freaking Swiss Army knife. Good to have it in your pocket when you need it. Might be worth investigating solving your problem with it. — user4581301, Jan 04 '20 at 23:19
@user4581301 Not sure how it helps me overcome those problems. What do you mean also by a 'state machine'? — CCPPSup, Jan 04 '20 at 23:19
If you work line-based it's really easy to tell an empty line from one with data on it. Also ties in well with @Ted 's comment. YOu have function that handles the line. It's much easier to test a bunch of functions that do one thing than a whole program. When the functions all work alone, you're a lot closer to a program that works. — user4581301, Jan 04 '20 at 23:20
I barely understand the syntax of its features, the use of iterators; as I said before, I'm a beginner to C++ and iterators are maybe our next topic in the class. I have tried copy-paste some 'useful' lines, but it doesn't work well for me. — CCPPSup, Jan 04 '20 at 23:22
Understood. If it was something like "I can't build it!" We could walk you through that, but it's not easy to wrangle and I've seen many fail simple tasks because using Boost to do the job was more complicated than the job. — user4581301, Jan 04 '20 at 23:24
A State Machine is a a program broken up into a bunch of states such as "looking for fractal" or "looking for blank line". Each state has simple rules defining what it does with inputs and which inputs change the state. For example if you're looking for fractals and you get a blank line, then you don't do anything with the line, but you change to the Looking for blank line state because you need to have two blank lines in a row. If you are in the looking for blank lines state and you find a fractal you log the error. If you find another blank line, no error. Go back to looking for fractal. — user4581301, Jan 04 '20 at 23:32
This keeps all the logic in nice, compartmentalized chunks, each chunk responsible for one part of the job and easily testable without the other chunks of code getting in the way. — user4581301, Jan 04 '20 at 23:35
@user4581301 I can get the whole line and see if it's an empty one or not. If it's - continue the loop. If it's not - parse it accordingly. But there is nothing intrinsically different in this approach, to my understanding. If it's indeed empty - I have to do nothing, but if the current one is empty and the next one isn't - it's an invalid input... — CCPPSup, Jan 04 '20 at 23:36
You output *"Invalid output!"* in multiple places(i.e. file error, parse error). Try to isolate which part actually gives you problems. It might just be your passing the wrong filename. — SacrificerXY, Jan 04 '20 at 23:57
Also, `ss.clear()` doesn't clear the stream https://stackoverflow.com/questions/20731/how-do-you-clear-a-stringstream-variable You can just move it inside the loop so every iteration, an empty one is created. — SacrificerXY, Jan 04 '20 at 23:59
@FrankMancini C++ has many safer and better tools than `strtok` — user4581301, Jan 05 '20 at 00:03
@SacrificerXY Thanks for the remark regarding the clear - I have tried replacing each clear with str(std::string) but the compiler doesn't find the method member called str() actually. And I know where the bug is - it's the if condition in the while loop. When we encounter an empty line - the values we actually assign to dim and type are 0, so, this if tells us there is an invalid input - type and dim are supposed to be strictly positive values. But in case of an empty line - we don't actually have values for dim and type at all. — CCPPSup, Jan 05 '20 at 00:10
You're missing parentheses: `ss.str(std::string());` *(assign a new empty string)* — SacrificerXY, Jan 05 '20 at 00:29

score 3 · Answer 1 · answered Jan 05 '20 at 03:01

You do not need anything special to parse .csv files, the STL containers from C++11 on provide all the tools necessary to parse virtually any .csv file. You do not need to know the number of values per-row you are parsing before hand, though you will need to know the type of value you are reading from the .csv in order to apply the proper conversion of values. You do not need any third-party library like Boost either.

There are many ways to store the values parsed from a .csv file. The basic "handle any type" approach is to store the values in a std::vector<std::vector<type>> (which essentially provides a vector of vectors holding the values parsed from each line). You can specialize the storage as needed depending on the type you are reading and how you need to convert and store the values. Your base storage can be struct/class, std::pair, std::set, or just a basic type like int. Whatever fits your data.

In your case you have basic int values in your file. The only caveat to a basic .csv parse is the fact you may have blank lines in between the lines of values. That's easily handled by any number of tests. For instance you can check if the .length() of the line read is zero, or for a bit more flexibility (in handling lines with containing multiple whitespace or other non-value characters), you can use .find_first_of() to find the first wanted value in the line to determine if it is a line to parse.

For example, in your case, your read loop for your lines of value can simply read each line and check whether the line contains a digit. It can be as simple as:

    ...
    std::string line;       /* string to hold each line read from file  */
    std::vector<std::vector<int>> values {};    /* vector vector of int */
    std::ifstream f (argv[1]);                  /* file stream to read  */

    while (getline (f, line)) { /* read each line into line */
        /* if no digits in line - get next */
        if (line.find_first_of("0123456789") == std::string::npos)
            continue;
        ...
    }

Above, each line is read into line and then line is checked on whether or not it contains digits. If so, parse it. If not, go get the next line and try again.

If it is a line containing values, then you can create a std::stringstream from the line and read integer values from the stringstream into a temporary int value and add the value to a temporary vector of int, consume the comma with getline and the delimiter ',', and when you run out of values to read from the line, add the temporary vector of int to your final storage. (Repeat until all lines are read).

Your complete read loop could be:

    while (getline (f, line)) { /* read each line into line */
        /* if no digits in line - get next */
        if (line.find_first_of("0123456789") == std::string::npos)
            continue;
        int itmp;                               /* temporary int */
        std::vector<int> tmp;                   /* temporary vector<int> */
        std::stringstream ss (line);            /* stringstream from line */
        while (ss >> itmp) {                    /* read int from stringstream */
            std::string tmpstr;                 /* temporary string to ',' */
            tmp.push_back(itmp);                /* add int to tmp */
            if (!getline (ss, tmpstr, ','))     /* read to ',' w/tmpstr */
                break;                          /* done if no more ',' */
        } 
        values.push_back (tmp);     /* add tmp vector to values */
    }

There is no limit on the number of values read per-line, or the number of lines of values read per-file (up to the limits of your virtual memory for storage)

Putting the above together in a short example, you could do something similar to the following which just reads your input file and then outputs the collected integers when done:

#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <vector>

int main (int argc, char **argv) {

    if (argc < 2) { /* validate at least 1 argument given for filename */
        std::cerr << "error: insufficient input.\nusage: ./prog <filename>\n";
        return 1;
    }

    std::string line;       /* string to hold each line read from file  */
    std::vector<std::vector<int>> values {};    /* vector vector of int */
    std::ifstream f (argv[1]);                  /* file stream to read  */

    while (getline (f, line)) { /* read each line into line */
        /* if no digits in line - get next */
        if (line.find_first_of("0123456789") == std::string::npos)
            continue;
        int itmp;                               /* temporary int */
        std::vector<int> tmp;                   /* temporary vector<int> */
        std::stringstream ss (line);            /* stringstream from line */
        while (ss >> itmp) {                    /* read int from stringstream */
            std::string tmpstr;                 /* temporary string to ',' */
            tmp.push_back(itmp);                /* add int to tmp */
            if (!getline (ss, tmpstr, ','))     /* read to ',' w/tmpstr */
                break;                          /* done if no more ',' */
        } 
        values.push_back (tmp);     /* add tmp vector to values */
    }

    for (auto row : values) {       /* output collected values */
        for (auto col : row)
            std::cout << "  " << col;
        std::cout << '\n';
    }
}

Example Input File

Using an input file with miscellaneous blank lines and two-integers per-line on the lines containing values as you describe in your question:

$ cat dat/csvspaces.csv
1,1


2,2
3,3

4,4



5,5
6,6

7,7

8,8


9,9

Example Use/Output

The resulting parse:

$ ./bin/parsecsv dat/csvspaces.csv
  1  1
  2  2
  3  3
  4  4
  5  5
  6  6
  7  7
  8  8
  9  9

Example Input Unknown/Uneven No. of Columns

You don't need to know the number of values per-line in the .csv or the number of lines of values in the file. The STL containers handle the memory allocation needs automatically allowing you to parse whatever you need. Now you may want to enforce some fixed number of values per-row, or rows per-file, but that is simply up to you to add simple counters and checks to your read/parse routine to limit the values stored as needed.

Without any changes to the code above, it will handle any number of comma-separated-values per-line. For example, changing your data file to:

$ cat dat/csvspaces2.csv
1


2,2
3,3,3

4,4,4,4



5,5,5,5,5
6,6,6,6,6,6

7,7,7,7,7,7,7

8,8,8,8,8,8,8,8


9,9,9,9,9,9,9,9,9

Example Use/Output

Results in the expected parse of each value from each line, e.g.:

$ ./bin/parsecsv dat/csvspaces2.csv
  1
  2  2
  3  3  3
  4  4  4  4
  5  5  5  5  5
  6  6  6  6  6  6
  7  7  7  7  7  7  7
  8  8  8  8  8  8  8  8
  9  9  9  9  9  9  9  9  9

Let me know if you have questions that I didn't cover or if you have additional questions about something I did and I'm happy to help further.

It's really a constructive and helpful answer, thanks. Some remarks: I have to validate that there is nothing else in a line but 2 digits separated by a comma. If we have some other char/string in the line - it's invalid. I have also to validate that the path (relative/absolute) received is of a CSV format. We can have empty lines only at the end of the file; that is, if there is an empty line and after that some non-empty line - invalid. **The real question I have is**: What's going on in memory in these process - in this loop of your code: `while (ss >> itmp)`. — CCPPSup, Jan 05 '20 at 12:13
Since each line that contains digits is used to initialize a `std::stringstream ss (line);` object, the loop validates that one integer value was read from the stringstream `ss` (i.e. the line). The successful reading of an integer from the stringstream is the loop condition, so if for any reason an integer is not read into `itmp` the loop ceases at that point. Later in the loop `getline (ss, tmpstr, ',')` is used to read the comma between the number to prepare for reading the next number from the stringstream into `itmp` on the next iteration. — David C. Rankin, Jan 05 '20 at 12:35
When you invoke this one `ss >> itmp` maybe `ss` contains multiple comma-separated digits. So, the right-most digit is inserted into `itmp`, or what?... And then, you insert the right-most comma in `ss` into `itmpstr`? — CCPPSup, Jan 05 '20 at 13:32
BTW, I have also to validate that the path (can be relative/absolute) received is of a CSV format. My validation here is sufficient? `std::string filepath = argv[1]; if (filepath.substr(filepath.find_last_of(EXTDOT) + 1) != VALIDEXT) { std::cerr << invalidErr; exit(EXIT_FAILURE); } std::ifstream in(filepath);` — CCPPSup, Jan 05 '20 at 13:44
If you have `EXTDOT` defined as `"."` and `VALIDTEXT` defined as `"csv"`, that should work fine. — David C. Rankin, Jan 05 '20 at 23:16
Or like: `std::string fname = argv[1]; if (fname.substr(fname.find_last_of(".") + 1) != "csv") { std::cerr << "error: invalid format - not '.csv' file.\n"; ...` Also, on the csv read loop, it is just (1) read `int` into `itmp` as loop condition, (2) declare `tmpstr`; (3) add `int` to temporary vector, (4) read comma that is before next `int` into `tmpstr` -- repeat until no more `int`. (or no more comma) — David C. Rankin, Jan 05 '20 at 23:50
Thank you so much. **I have one more problem** at the final loop in my original code in the thread. I have to destruct the Fractals but Fractal class is an abstract base class so I can't invoke just delete on each cell. Do you have a solution for that? — CCPPSup, Jan 06 '20 at 13:13
That is probably better asked as a new question itself where you can provide the underlying class structure. That said, there was discussion about that in [C++ abstract class destructor](https://stackoverflow.com/questions/24316700/c-abstract-class-destructor) that may solve your problem. — David C. Rankin, Jan 06 '20 at 16:30

A M · Answer 2 · 2020-01-05T15:39:28.480

I will not update your code. I look at your title Parsing a CSV file - C++ and would like to show you, how to read csv files in a more modern way. Unfortunately you are still on C++14. With C++20 or the ranges library it would be ultra simple using getlines and split.

And in C++17 we could use CTAD and if with initializer and so on.

But what we do not need is boost. C++`s standard lib is sufficient. And we do never use scanf and old stuff like that.

And in my very humble opinion the link to the 10 years old question How can I read and parse CSV files in C++? should not be given any longer. It is the year 2020 now. And more modern and now available language elements should be used. But as said. Everybody is free to do what he wants.

In C++ we can use the std::sregex_token_iterator. and its usage is ultra simple. It will also not slow down your program dramatically. A double std::getline would also be ok. Although it is not that flexible. The number of columns must be known for that. The std::sregex_token_iterator does not care about the number of columns.

Please see the following example code. In that, we create a tine proxy class and overwrite its extractor operator. Then we us the std::istream_iterator and read and parse the whole csv-file in a small one-liner.

#include <algorithm>
#include <fstream>
#include <iostream>
#include <iterator>
#include <regex>
#include <string>
#include <vector>

// Define Alias for easier Reading
// using Columns = std::vector<std::string>;
using Columns = std::vector<int>;

// The delimiter
const std::regex re(",");

// Proxy for the input Iterator
struct ColumnProxy {
    // Overload extractor. Read a complete line
    friend std::istream& operator>>(std::istream& is, ColumnProxy& cp) {
        // Read a line
        std::string line;
        cp.columns.clear();
        if(std::getline(is, line) && !line.empty()) {
            // Split values and copy into resulting vector
            std::transform(
                std::sregex_token_iterator(line.begin(), line.end(), re, -1), {},
                std::back_inserter(cp.columns),
                [](const std::string& s) { return std::stoi(s); });
        }
        return is;
    }
    // Type cast operator overload.  Cast the type 'Columns' to
    // std::vector<std::string>
    operator Columns() const { return columns; }

protected:
    // Temporary to hold the read vector
    Columns columns{};
};

int main() {
    std::ifstream myFile("r:\\log.txt");
    if(myFile) {
        // Read the complete file and parse verything and store result into vector
        std::vector<Columns> values(std::istream_iterator<ColumnProxy>(myFile), {});

        // Show complete csv data
        std::for_each(values.begin(), values.end(), [](const Columns& c) {
            std::copy(c.begin(), c.end(),
                      std::ostream_iterator<int>(std::cout, " "));
            std::cout << "\n";
        });
    }
    return 0;
}

Please note: There are tons of other possible solutions. Please feel free to use whatever you want.

EDIT

Because I see a lot of complicated code here, I would like to show a 2nd example of how to

Parsing a CSV file - C++

Basically, you do not need more than 2 statements in the code. You first define a regex for digits. And then you use a C++ language element that has been exactly designed for the purpose of tokenizing strings into substrings. The std::sregex_token_iterator. And because such a most-fitting language element is available in C++ since years, it would may be worth a consideration to use it. And maybe you could do basically the task in 2 lines, instead of 10 or more lines. And it is easy to understand.

But of course, there are thousands of possible solutions and some like to continue in C-Style and others like more moderen C++ features. That's up to everybodies personal decision.

The below code reads the csv file as specified, regardless of how many rows(lines) it contains and how many columns are there for each row. Even foreing characters can be in it. An empty row will be an empty entry in the csv vector. This can also be easly prevented, with an "if !empty" before the emplace back.

But some like so and the other like so. Whatever people want.

Please see a general example:

#include <algorithm>
#include <iterator>
#include <iostream>
#include <regex>
#include <sstream>
#include <string>
#include <vector>

// Test data. Can of course also be taken from a file stream.
std::stringstream testFile{ R"(1,2
3, a, 4 
5 , 6  b ,  7

abc def
8 , 9
11 12 13 14 15 16 17)" };

std::regex digits{R"((\d+))"};

using Row = std::vector<std::string>;

int main() {
    // Here we will store all the data from the CSV as std::vector<std::vector<std::string>>
    std::vector<Row> csv{};


    // This extremely simple 2 lines will read the complete CSV and parse the data
    for (std::string line{}; std::getline(testFile, line);  ) 
        csv.emplace_back(Row(std::sregex_token_iterator(line.begin(), line.end(), digits, 1), {}));


    // Now, you can do with the data, whatever you want. For example: Print double the value
    std::for_each(csv.begin(), csv.end(), [](const Row& r) { 
        if (!r.empty()) {
            std::transform(r.begin(), r.end(), std::ostream_iterator<int>(std::cout, " "), [](const std::string& s) {
            return std::stoi(s) * 2; }
        ); std::cout << "\n";}});

    return 0;
}

So, now, you may get the idea, you may like it, or you do not like it. Whatever. Feel free to do whatever you want.

Personally, I love regex when programming perl - it's sort of mandatory. Though I rarely find a use-case for it in C++. It might be my mindset when programming that limits me. What made me a little more open and curious about regex's in C++ in the future was a [talk about constexpr regex](https://www.youtube.com/watch?v=g51_HYn_CqE) by Hana Dusíková. — Ted Lyngmo, Jan 05 '20 at 00:57
... and her library: https://github.com/hanickadot/compile-time-regular-expressions — Ted Lyngmo, Jan 05 '20 at 01:11
@Ted Lyngmo: Thank you very much for posting the links. I viewed the talk and checked the lib. Very impressing! I hope it will make it into the standard. Thanks, Armin — A M, Jan 05 '20 at 15:37
@ArminMontigny (*the link to the 10 years old question... It is the year 2020 now...*); I just want to say that I disagree with you, simply because the main accepted answer is kept updated. look at : **Now that we are in 2020 lets add a CSVRange object**:...CSVIterator begin().... — ibra, Feb 18 '21 at 12:55
Thank you very much for the hint. Please note. I gave my answer in Jan 20. Martin York edited in Aug 20. Addtionally, I repeat my last sentence from the post: "...you may like it, or you do not like it. Whatever. Feel free to do whatever you want." And before I wrote "Please note: There are tons of other possible solutions. Please feel free to use whatever you want." **Personally I** would not use a CSV Iterator. By far to complicated. I would always overwrite the extractor operator of the related class. That's more OO. But as said, anybody can do what he wants . . . — A M, Feb 18 '21 at 16:05

Parsing a CSV file - C++

2 Answers2