3

I'm trying to parse a CSV file (with header line) using boost spirit. The csv is not in a constant format. Sometimes there is some extra column or the order of the column is mixed. I'm interested in few columns, whose header name is well known.

For instance my CSV may look like:

Name,Surname,Age
John,Doe,32

Or:

Age,Name
32,John

I want to parse only the content of Name and Age (N.B. Age is integer type). At the moment i come out with a very ugly solution where Spirit parses the first line and creates a vector that contains an enum in the positions i'm interested into. And then i have to do the parsing of the terminal symbols by hand...

enum LineItems {
    NAME, AGE, UNUSED
};

struct CsvLine {
    string name;
    int age;
};

using Column = std::string;
using CsvFile = std::vector<CsvLine>;

template<typename It>
struct CsvGrammar: qi::grammar<It, CsvFile(), qi::locals<std::vector<LineItems>>, qi::blank_type> {
    CsvGrammar() :
            CsvGrammar::base_type(start) {
        using namespace qi;

        static const char colsep = ',';

        start = qi::omit[header[qi::_a = qi::_1]] >> eol >> line(_a) % eol;
        header = (lit("Name")[phx::push_back(phx::ref(qi::_val), LineItems::NAME)]
                | lit("Age")[phx::push_back(phx::ref(qi::_val), LineItems::AGE)]
                | column[phx::push_back(phx::ref(qi::_val), LineItems::UNUSED)]) % colsep;
        line = (column % colsep)[phx::bind(&CsvGrammar<It>::convertFunc, this, qi::_1, qi::_r1,
                qi::_val)];
        column = quoted | *~char_(",\n");
        quoted = '"' >> *("\"\"" | ~char_("\"\n")) >> '"';
    }

    void convertFunc(std::vector<string>& columns, std::vector<LineItems>& positions, CsvLine &csvLine) {
       //terminal symbol parsing here, and assign to csvLine struct.
       ...
    }
private:
    qi::rule<It, CsvFile(), qi::locals<std::vector<LineItems>>, qi::blank_type> start;
    qi::rule<It, std::vector<LineItems>(), qi::blank_type> header;
    qi::rule<It, CsvLine(std::vector<LineItems>), qi::blank_type> line;
    qi::rule<It, Column(), qi::blank_type> column;
    qi::rule<It, std::string()> quoted;
    qi::rule<It, qi::blank_type> empty;

};

Here is the full source.

What if the header parser could prepare a vector<rule<...>*> and the "line parser" just use this vector to parse itself? a sort of advanced nabialek trick (i've been trying but i couldn't make it).

Or is there any better way to parse this kind of CSV with Spirit? (any help is appreciated, thank you in advance)

Gab
  • 756
  • 11
  • 23

1 Answers1

1

I'd go with the concept that you have,

I think it's plenty elegant (the qi locals even allow reentrant use of this).

To reduce the cruft in the rules (Boost Spirit: "Semantic actions are evil"?) you could move the "conversion function" off into attribute transformation customization points.

Oops. As commented that was too simple. However, you can still reduce the cruftiness quite a bit. With two simple tweaks, the grammar reads:

item.add("Name", NAME)("Age", AGE);
start  = omit[ header[_a=_1] ] >> eol >> line(_a) % eol;

header = (item | omit[column] >> attr(UNUSED)) % colsep;
line   = (column % colsep) [convert];

column = quoted | *~char_(",\n");
quoted = '"' >> *("\"\"" | ~char_("\"\n")) >> '"';

The tweaks:

  • using qi::symbols to map from header to LineItem
  • using a raw semantinc action ([convert]) which directly access the context (see boost spirit semantic action parameters):

    struct final {
        using Ctx = typename decltype(line)::context_type;
    
        void operator()(Columns const& columns, Ctx &ctx, bool &pass) const {
            auto& csvLine   = boost::fusion::at_c<0>(ctx.attributes);
            auto& positions = boost::fusion::at_c<1>(ctx.attributes);
            int i =0;
    
            for (LineItems position : positions) {
                switch (position) {
                    case NAME: csvLine.name = columns[i];              break;
                    case AGE:  csvLine.age = atoi(columns[i].c_str()); break;
                    default:   break;
                }
                i++;
            }
    
            pass = true; // returning false fails the `line` rule
        }
    } convert;
    

Arguably the upshot is akin to doing auto convert = phx::bind(&CsvGrammar<It>::convertFunc, this, qi::_1, qi::_r1, qi::_val) but using auto with Proto/Phoenix/Spirit expressions is notoriously error prone (UB due to dangling refs to temporaries from the expression template), so I'd certainly prefer the way shown above.

Live On Coliru

//#define BOOST_SPIRIT_DEBUG
#define BOOST_SPIRIT_USE_PHOENIX_V3
#include <iostream>
#include <boost/fusion/include/at_c.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <string>
#include <vector>

namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;

using std::string;

enum LineItems { NAME, AGE, UNUSED };

struct CsvLine {
    string name;
    int age;
};

using Column  = std::string;
using Columns = std::vector<Column>;
using CsvFile = std::vector<CsvLine>;

template<typename It>
struct CsvGrammar: qi::grammar<It, CsvFile(), qi::locals<std::vector<LineItems>>, qi::blank_type> {
    CsvGrammar() : CsvGrammar::base_type(start) {
        using namespace qi;
        static const char colsep = ',';

        item.add("Name", NAME)("Age", AGE);
        start  = qi::omit[ header[_a=_1] ] >> eol >> line(_a) % eol;

        header = (item | omit[column] >> attr(UNUSED)) % colsep;
        line   = (column % colsep) [convert];

        column = quoted | *~char_(",\n");
        quoted = '"' >> *("\"\"" | ~char_("\"\n")) >> '"';

        BOOST_SPIRIT_DEBUG_NODES((header)(column)(quoted));
    }

private:
    qi::rule<It, std::vector<LineItems>(),                      qi::blank_type> header;
    qi::rule<It, CsvFile(), qi::locals<std::vector<LineItems>>, qi::blank_type> start;
    qi::rule<It, CsvLine(std::vector<LineItems> const&),        qi::blank_type> line;

    qi::rule<It, Column(), qi::blank_type> column;
    qi::rule<It, std::string()> quoted;
    qi::rule<It, qi::blank_type> empty;

    qi::symbols<char, LineItems> item;

    struct final {
        using Ctx = typename decltype(line)::context_type;

        void operator()(Columns const& columns, Ctx &ctx, bool &pass) const {
            auto& csvLine   = boost::fusion::at_c<0>(ctx.attributes);
            auto& positions = boost::fusion::at_c<1>(ctx.attributes);
            int i =0;

            for (LineItems position : positions) {
                switch (position) {
                    case NAME: csvLine.name = columns[i];              break;
                    case AGE:  csvLine.age = atoi(columns[i].c_str()); break;
                    default:   break;
                }
                i++;
            }

            pass = true; // returning false fails the `line` rule
        }
    } convert;
};

int main() {
    const std::string s = "Surname,Name,Age,\nJohn,Doe,32\nMark,Smith,43";

    auto f(begin(s)), l(end(s));
    CsvGrammar<std::string::const_iterator> p;

    CsvFile parsed;
    bool ok = qi::phrase_parse(f, l, p, qi::blank, parsed);

    if (ok) {
        for (CsvLine line : parsed) {
            std::cout << '[' << line.name << ']' << '[' << line.age << ']';
            std::cout << std::endl;
        }
    } else {
        std::cout << "Parse failed\n";
    }

    if (f != l)
        std::cout << "Remaining unparsed: '" << std::string(f, l) << "'\n";
}

Prints

[Doe][32]
[Smith][43]
Community
  • 1
  • 1
sehe
  • 374,641
  • 47
  • 450
  • 633
  • sorry, but i can't find a way to pass an extra parameter to ``assign_to_attribute_from_value``. In order to successfully parse the ``vector`` into the structure i must also know the position of the field into the line (the ``std::vector& positions`` parameter). – Gab Jan 15 '15 at 17:20
  • You can bind a functor. I'll show you later. Sadly no time for at least the coming 6 hours (it's already very fratifying that you try, and you've come this far. I'm happy to help) – sehe Jan 15 '15 at 17:21
  • Okay I'm back. I just realized I made a thinko indeed (forgetting that the transformation traits are purely static). You could sugar the `phx::bind` somewhat with a "raw semantic actor", but of course you will always need the rule context. I'll post in a bit. – sehe Jan 16 '15 at 00:44
  • See update. Sorry for raising too many expectations earlier :S – sehe Jan 16 '15 at 00:53
  • Thank you very much for your answer and for the links. It is an intriguing use of a semantic action. Though i'll continue to research if Spirit can be used to parse a file where a (complex) header describes the grammar to be used in the remaining part of the file. My lack of ability of handling that situation is a point where the project manager is hitting hard on Spirit. For your curiosity i'm writing a converter between languages for numerical simulation (Abaqus Nastran Samcef Code Aster). We'll have to wait a bit more for the "Sehe trick". :) – Gab Jan 16 '15 at 09:38
  • Nah. I don't think it's really worth it to be honest. Spirit is /just/ not designed to do runtime parser generation. The whole EDSL thing is invented to do _static_ parser generation! And indeed, I would precise here that the grammar **is** static. If you properly separate responsibilities, you'd easily convince your boss that Spirit is excellent for the /parsing/ responsibility, and you just need to separate your /interpretation/ responsibilities (basically, like you already do). Nothing about the grammar (currently) changes. Just the semantic handling of it. – sehe Jan 16 '15 at 09:43
  • @Gab Anyways, have you seen **[this](http://stackoverflow.com/a/18366335/85371)**? – sehe Jan 16 '15 at 09:43
  • yes, it was the example i used as a starting point. I'm not interested into dynamic parser generation. I'm in a gray zone. If Nabialek trick allow to parse ``KWD >> rule``, where KWD and rule are static, i'm often in a situation like: ``KWD1 >> KWD2 >> ..>>KWDN >> rule1 >> rule2 .. >> rulen``. But a comment is too short to explain and i don't want to hijack the original question. I'll come back with a proper thread (or with a post on the spirit mailing list) as soon as I have time to address to this. Thanks again! – Gab Jan 16 '15 at 10:24
  • @Gab I think I got that. It borders on dynamic generation, because Spirit includes attribute propagation in the core. Anyhoops, I agree I'm often looking for that "magic" `fusion_auto` parser that JustWorks. I've been thinking of creating such a thing for generic serialization. However I keep coming to the conclusion that separating parsing and transformation is the way to go (even though things *could* be more efficient). I'll probably revisit once I get hands-on with Spirit X3 – sehe Jan 16 '15 at 10:30
  • @sehe - I tried to compile the example in VS 2015 and there seems to be some issue with the line "using Ctx = typename decltype(line)::context_type;" any ideas? VS indicaates: error C2146: syntax error: missing ';' before identifier 'context_type' – johnco3 Dec 04 '15 at 21:27