FIltering CSV data using C++

Question

Sorry for asking a question that many may think has already been asked.

I have a very long CSV data file (dat.csv) with 5 columns. I have another short CSV (filter.csv) file with 1 column.

Now, I only need to extract columns from dat.csv where column-1 matches with that of column-1 of filter.csv.

I would usually do this in BASH using sed/awk. However, for some other reasons I need to do this within a C++ file. Can you suggest an efficient way to do this?

Sample Data:

data.csv

ID,Name,CountryCode,District,Population

3793,NewYork,USA,NewYork,8008278
3794,LosAngeles,USA,California,3694820
3795,Chicago,USA,Illinois,2896016
3796,Houston,USA,Texas,1953631
3797,Philadelphia,USA,Pennsylvania,1517550
3798,Phoenix,USA ,Arizona,1321045
3799,SanDiego,USA,California,1223400
3800,Dallas,USA,Texas,1188580
3801,SanAntonio,USA,Texas,1144646

filter.csv

3793
3797
3798

See http://stackoverflow.com/q/1120140/10077 How does your question differ from this? — Fred Larson, Feb 16 '14 at 01:36

score 8 · Answer 1 · answered Apr 03 '14 at 23:22

8

This .csv sorting library might help:

http://www.partow.net/programming/dsvfilter/index.html

You could merge the columns of both tables into one larger table, and then query for matches in the new table (where column 1 of table A is and column 1 of table B is). Or maybe that library has functions for comparing tables.

answered Apr 03 '14 at 23:22

Alex Hall

956
10
18

I have just downloaded the library. However, I can not compile it. It seems that the parser of the Expression Toolkit library does not contain two functions cache_symbols and expression_symbols. Do you have the same problem ? – thd Dec 29 '14 at 21:18
Sorry I missed your reply long ago :p and I'll look into whether and how I ended up using that library, and get back to you. (Have you found another working solution?) – Alex Hall Apr 14 '15 at 04:34

David G · Accepted Answer · 2014-02-16T16:46:50.657

Here are some tips:

The stream from which you're reading the data needs to ignore the commas, so what it should to is set comma characters to whitespace using the std::ctype<char> facet imbued in its locale. Here's an example of modifying the classification table:

struct ctype : std::ctype<char>
{
private:
    static mask* get_table()
    {
        static std::vector<mask> v(classic_table(),
                                   classic_table() + table_size);

        v[','] &= ~space;
        return &v[0];
    }
public:
    ctype() : std::ctype<char>(get_table()) { }
};

Read the first csv. file line-wise (meaning std::getline()). Extract the first word and compare it with an extraction from the second .csv file. Continue this until you reach the end of the first file:

int main()
{
    std::ifstream in1("test1.csv");
    std::ifstream in2("test2.csv");

    typedef std::istream_iterator<std::string> It;

    in1 >> comma_whitespace;
    in2 >> comma_whitespace;

    std::vector<std::string> in2_content(It(in2), It());
    std::vector<std::string> matches;

    while (std::getline(in1, line))
    {
        std::istringstream iss(line);
        It beg(iss);

        if (std::find(in2_content.begin(),
                      in2_content.end(), *beg) != in2_content.end())
        {
            matches.push_back(line);
        }
    }
}

// After the above, the vector matches should hold all the rows that
// have the same ID number as in the second csv file

comma_whitespace is a manipulator which changes the locale to the custom ctype defined above.

_{Disclaimer: I haven't tested this code.}

Thanks 0x499602D2 for taking the time to answer. Really appreciated. — hashb, Feb 16 '14 at 11:58

FIltering CSV data using C++

2 Answers2