0

Alright so basically I need to read data from a CSV file and store it in some sort of data structure. The CSV data would kind of look like this:

year,position,MVP,entity
INT,STRING,BOOL,STRING
2020,FORWARD,TRUE,Lionel Messi
2020,MIDFIELDER,FALSE,Jordan Henderson
2020,GOALKEEPER,FALSE,David De Gea
2020,DEFENDER,FALSE,Virgil van Dijk

The first two rows would tell you the name of the attributes, and their types.

I know how to read data from a CSV file, but the problem is that I don't really know what is the best data structure to store said data when the number of columns, attribute types (bool, int, etc), can vary.

Originally I thought a table represented by a vector of Row objects would work, but that only works when I know exactly how many attributes there are, what their types are, what their names are, and etc.

I'm thinking I can somehow store it based off the metadata of the data, like # of attributes, attribute types, location of row, etc, but I don't really know how to expand on this idea yet.

Any help would be appreciated!

edit:

so basically my program has to work with CSV files that are similar to the structure that I posted above, but each CSV file could have a different number of columns, different attribute types, and etc.

One csv file can look like the example above, and another can look like this:

startYear,job,entity
INT,STRING,STRING
2001,SALES ASSOCIATE,Jackie Cruz
1992,GENERAL MANAGER,Jorge Almandra
2004,CUSTODIAN,Jeffrey Howie 
2018,ELECTRICIAN,Katie Moody

I still need to be able to store the data into some sort of data structure even though the number of columns and their types differ.

agaiuqer
  • 21
  • 5
  • If the columns aren't fixed, how would you do anything meaningful with the data? – stark Feb 03 '20 at 18:01
  • I would expect each record (row) in a single file would be the same? Do you have different kind of records mixed in a single file? Then you can still have targeted structure for each record type (which is really what I recommend anyway). – Some programmer dude Feb 03 '20 at 18:01
  • 1
    any type that has a text representation can be stored in an `std::string`, then it depends on what you want to do with those values.. – 463035818_is_not_an_ai Feb 03 '20 at 18:02
  • It can be `std::vector>` where `std::vector` represents a single line of the CSV. – drescherjm Feb 03 '20 at 18:06
  • @stark that's just the problem that was presented to me, it's not going to be used in the real world – agaiuqer Feb 03 '20 at 18:12
  • Your edit did not help at all. – drescherjm Feb 03 '20 at 18:13
  • @Someprogrammerdude different kinds of records are not mixed in one file, I just need to adapt to whatever file is handed to me – agaiuqer Feb 03 '20 at 18:13
  • @drescherjm explain where you're confused then and i'll make a more helpful edit – agaiuqer Feb 03 '20 at 18:13
  • I already understood what your data looks like but not how you want to use the data. I have already told you one way to store the data but that seems to have been ignored. Can't you use `std::vector` to store a row of your data and make a vector of that to store the whole file? – drescherjm Feb 03 '20 at 18:15
  • Basically I am talking about this: [https://stackoverflow.com/a/1120224/487892](https://stackoverflow.com/a/1120224/487892) – drescherjm Feb 03 '20 at 18:18
  • @drescherjm Ah, my bad. So basically some stuff I want to do with the data are things like returning the value based on the attribute name. So say I have a function that looks like this getValue(string attributeName), if I call getValue("MVP") on a specific row, I'd want to get the boolean value that MVP returns for that row. – agaiuqer Feb 03 '20 at 18:20
  • 1
    You'll have to use something like `std::variant` with all possible types in it. Then you parse the first row to get column names, second - to get column types and then you just read row by row and fill vector of values (with type verification) – Michael Nastenko Feb 03 '20 at 22:25
  • I was next going to suggest that next but was waiting on the the OPs opinion of using `std::vector` for a row. – drescherjm Feb 04 '20 at 02:32

1 Answers1

0

Here is one possible solution

And I hate it. I would never do something like that. Becuase the design idea or the rquirement is already nonesense.

Either, we use types and we know the which column has what type, or we simply use a fits-to-all-type for the required context. In this case, simply a std::string.

But doing this dynamically will result in really ugly and not maintanable code.

The solution here is std::any. But maybe a class hierachie would be even better. I will try later.

Please see this ugly code:

#include <iostream>
#include <sstream>
#include <vector>
#include <regex>
#include <string>
#include <iterator>
#include <algorithm>
#include <utility>
#include <any>
#include <map>
#include <tuple>

// the delimiter for the csv
const std::regex re(",");

// One DataRow from the csv file
struct DataRow {
    std::vector<std::string> columns{};

    friend std::istream& operator >> (std::istream& is, DataRow& dr) {

        // Read one complete line
        if (std::string line{}; std::getline(is, line)) {

            // Split the string, containing the complete line into parts
            dr.columns.clear();
            std::copy(std::sregex_token_iterator(line.begin(), line.end(), re, -1), {}, std::back_inserter(dr.columns));
        }
        return is;
    }
};

struct CSV {

protected:
    // Conversion functions
    std::any stringToAnySTRING(const std::string& s) { return s; }
    std::any stringToAnyBOOL(const std::string& s) { bool result{ false }; if (s == "TRUE") result = true; return result; }
    std::any stringToAnyINT(const std::string& s) { int result = std::stoi(s); return result; }
    std::any stringToAnyLONG(const std::string& s) { long result = std::stol(s); return result; }

    // Making Reading easier
    using ConvertToAny = std::any(CSV::*)(const std::string&);

    // Map conversion functions to type strings
    std::map<std::string, ConvertToAny> converter{
        {"STRING", &CSV::stringToAnySTRING},
        {"BOOL", &CSV::stringToAnyBOOL},
        {"INT", &CSV::stringToAnyINT},
        {"LONG", &CSV::stringToAnyLONG}
    };

public:
    // Header, Types and data as std::any
    std::vector<std::string> header{};
    std::vector<std::string> types{};
    std::vector<std::vector<std::any>> data{};

    // Extractor operator
    friend std::istream& operator >> (std::istream& is, CSV& c) {
        // Read header line
        if (std::string line{}; std::getline(is, line)) {

            // Split header line into sub strings
            c.header.clear();
            std::copy(std::sregex_token_iterator(line.begin(), line.end(), re, -1), {}, std::back_inserter(c.header));

            // Read types line
            if (std::getline(is, line)) {

                // Spit types into sub strings
                c.types.clear();
                std::copy(std::sregex_token_iterator(line.begin(), line.end(), re, -1), {}, std::back_inserter(c.types));

                // Read all data, so all lines, split them and convert them to the desired data type
                c.data.clear();

                // This will read all lines and split them into columns
                std::vector<DataRow> drs(std::istream_iterator<DataRow>(is), {});

                // Make at least one plausibility check, that all rows have the same number of columns
                size_t minDataLength = std::min_element(drs.begin(), drs.end(), [](const DataRow& dr1, const DataRow& dr2)
                    {return dr1.columns.size() < dr2.columns.size(); })->columns.size();
                if (c.header.size() == c.types.size() && c.types.size() == minDataLength) {

                    // Now convert all columns into the type denoted by the read type array and store them as any data
                    // Double transform because of 2 dimensional array
                    std::transform(drs.begin(), drs.end(), std::back_inserter(c.data), [&c](const DataRow& dr) {

                        std::vector<std::any> va{};
                        // This is the conversion into a type defined by the types array
                        // Anybody who understands this transfrom will get the Nobel price for Obfuscation
                        std::transform(dr.columns.begin(), dr.columns.end(), std::back_inserter(va),
                            [&c, i = 0U](const std::string& s) mutable {return (c.*(c.converter[c.types[i++]]))(s); });
                        return va; });
                }
            }
        }
        return is;
    }

    // Inserter operator
    friend std::ostream& operator << (std::ostream& os, const CSV& c) {

        // Write header
        os << "Header: ";
        std::copy(c.header.begin(), c.header.end(), std::ostream_iterator<std::string>(os, "  "));

        // And the type names
        os << "\nTypes:  ";
        std::copy(c.types.begin(), c.types.end(), std::ostream_iterator<std::string>(os, "  "));
        os << "\n\nData:\n";

        // And the types. Arrgh. How ugly
        std::for_each(c.data.begin(), c.data.end(), [&c,&os](const std::vector<std::any>& va) {
            for (size_t i = 0U; i < va.size(); ++i) {
                if (c.types[i] == "INT") { int v = std::any_cast<int>(va[i]); os << v << " "; }
                else if (c.types[i] == "LONG") { long v = std::any_cast<long>(va[i]); os << v << " "; }
                else if (c.types[i] == "STRING") { std::string v = std::any_cast<std::string>(va[i]); os << v << " "; }
                else if (c.types[i] == "BOOL") { bool v = std::any_cast<bool>(va[i]); os << v << " "; }
            }
            os << "\n";
        });
        return os;
    }
};

// The data. Does not matter if file or stringstream. Is the same
std::istringstream csvFile{ R"(year,category,winner,entity
INT,STRING,BOOL,STRING
2015,CHEF OF THE YEAR,FALSE,John Doe
2015,CHEF OF THE YEAR,FALSE,Bob Brown
2015,CHEF OF THE YEAR,TRUE,William Thorton
2015,CHEF OF THE YEAR,FALSE,Jacob Smith)" };


int main() {

    // Define varaiable of type csv
    CSV csv{};

    // Read from somewhere
    csvFile >> csv;

    // Show some debug output
    std::cout << csv;

    return 0;
}
A M
  • 14,694
  • 5
  • 19
  • 44