7

I have this csv line

std::string s = R"(1997,Ford,E350,"ac, abs, moon","some "rusty" parts",3000.00)";

I can parse it using boost::tokenizer:

typedef boost::tokenizer< boost::escaped_list_separator<char> , std::string::const_iterator, std::string> Tokenizer;
boost::escaped_list_separator<char> seps('\\', ',', '\"');
Tokenizer tok(s, seps);
for (auto i : tok)
{
    std::cout << i << std::endl;
}

It gets it right except that token "rusty" should have double quotes which are getting stripped:

some rusty parts

Here is my attempt to use boost::spirit

boost::spirit::classic::rule<> list_csv_item = !(boost::spirit::classic::confix_p('\"', *boost::spirit::classic::c_escape_ch_p, '\"') | boost::spirit::classic::longest_d[boost::spirit::classic::real_p | boost::spirit::classic::int_p]);
std::vector<std::string> vec_item;
std::vector<std::string>  vec_list;
boost::spirit::classic::rule<> list_csv = boost::spirit::classic::list_p(list_csv_item[boost::spirit::classic::push_back_a(vec_item)],',')[boost::spirit::classic::push_back_a(vec_list)];
boost::spirit::classic::parse_info<> result = parse(s.c_str(), list_csv);
if (result.hit)
{
  for (auto i : vec_item)
  {
    cout << i << endl;
   }
}

Problems:

  1. does not work, prints the first token only

  2. why boost::spirit::classic? can't find examples using Spirit V2

  3. the setup is brutal .. but I can live with this

** I really want to use boost::spirit because it tends to be pretty fast

Expected output:

1997
Ford
E350
ac, abs, moon
some "rusty" parts
3000.00
Claudiu Cruceanu
  • 135
  • 1
  • 1
  • 10
user841550
  • 1,067
  • 3
  • 16
  • 25
  • I don't see how you would treat `""rusty""` as valid input. If quoted strings are ok, then I'd expect `"embedded ""quotes"" like this"`, but not unexpected `""` (empty string) occuring inside a field. – sehe Aug 21 '13 at 19:37
  • I have edited the string input so that the double quotes make better sense, I hope. – user841550 Aug 21 '13 at 22:59
  • I don't think it does make more sense now. The number of quotes is unbalanced. Why don't you provide the _expected output_? – sehe Aug 21 '13 at 23:00
  • Just posted expected output – user841550 Aug 21 '13 at 23:12
  • I don't think there is a sane way to interpret that input in that way. The 'embedded' quotes _will_ have to be escaped (`""` or e.g. `\"`) one way or another, otherwise the scanning couldn't possibly decide whether the end of a string was reached? I don't think any CSV engine treats it this way. – sehe Aug 22 '13 at 00:30
  • If you have MS Excel, replace the separators with tabs copy the line and paste into an Excel sheet. It is parsed correctly – user841550 Aug 22 '13 at 00:42

2 Answers2

10

For a background on parsing (optionally) quoted delimited fields, including different quoting characters (', "), see here:

For a very, very, very complete example complete with support for partially quoted values and a

splitInto(input, output, ' ');

method that takes 'arbitrary' output containers and delimiter expressions, see here:

Addressing your exact question, assuming either quoted or unquoted fields (no partial quotes inside field values), using Spirit V2:

Let's take the simplest 'abstract datatype' that could possibly work:

using Column  = std::string;
using Columns = std::vector<Column>;
using CsvLine = Columns;
using CsvFile = std::vector<CsvLine>;

And the repeated double-quote escapes a double-quote semantics (as I pointed out in the comment), you should be able to use something like:

static const char colsep = ',';

start  = -line % eol;
line   = column % colsep;
column = quoted | *~char_(colsep);
quoted = '"' >> *("\"\"" | ~char_('"')) >> '"';

The following complete test program prints

[1997][Ford][E350][ac, abs, moon][rusty][3001.00]

(Note the BOOST_SPIRIT_DEBUG define for easy debugging). See it Live on Coliru

Full Demo

//#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>

namespace qi = boost::spirit::qi;

using Column  = std::string;
using Columns = std::vector<Column>;
using CsvLine = Columns;
using CsvFile = std::vector<CsvLine>;

template <typename It>
struct CsvGrammar : qi::grammar<It, CsvFile(), qi::blank_type>
{
    CsvGrammar() : CsvGrammar::base_type(start)
    {
        using namespace qi;

        static const char colsep = ',';

        start  = -line % eol;
        line   = column % colsep;
        column = quoted | *~char_(colsep);
        quoted = '"' >> *("\"\"" | ~char_('"')) >> '"';

        BOOST_SPIRIT_DEBUG_NODES((start)(line)(column)(quoted));
    }
  private:
    qi::rule<It, CsvFile(), qi::blank_type> start;
    qi::rule<It, CsvLine(), qi::blank_type> line;
    qi::rule<It, Column(),  qi::blank_type> column;
    qi::rule<It, std::string()> quoted;
};

int main()
{
    const std::string s = R"(1997,Ford,E350,"ac, abs, moon","""rusty""",3001.00)";

    auto f(begin(s)), l(end(s));
    CsvGrammar<std::string::const_iterator> p;

    CsvFile parsed;
    bool ok = qi::phrase_parse(f,l,p,qi::blank,parsed);

    if (ok)
    {
        for(auto& line : parsed) {
            for(auto& col : line)
                std::cout << '[' << col << ']';
            std::cout << std::endl;
        }
    } else
    {
        std::cout << "Parse failed\n";
    }

    if (f!=l)
        std::cout << "Remaining unparsed: '" << std::string(f,l) << "'\n";
}
Community
  • 1
  • 1
sehe
  • 374,641
  • 47
  • 450
  • 633
  • I am having trouble compiling your code on my machine. VC12 64-bit Windows 7. It crashed the compiler several times. But I see that it works fine in Coliru so it must be my environment – user841550 Aug 21 '13 at 20:38
  • Perhaps see the sample linked from the [background answer](http://stackoverflow.com/questions/7436481/how-to-make-my-split-work-only-on-one-real-line-and-be-capable-to-skeep-quoted-p/7462539#7462539) (or [on github](https://gist.github.com/bcfbe2b5f071c7d153a0)). I recall testing this on a wide variety of compilers/boost versions. – sehe Aug 21 '13 at 20:49
  • Also, here's a version of the sample in this answer that removes all use of [tag:c++11] features: **[http://ideone.com/VVVTYe](http://ideone.com/VVVTYe)**. I _bet_ it compiles. (My best guess is MSVC doesn't like the `using` clauses) – sehe Aug 21 '13 at 20:55
  • Yes, MSVC didn't like the using clause and I had that changed already but it still doesn't like the declarations in the CsvGrammer constructor. e.g. 'eol' : undeclared identifier – user841550 Aug 21 '13 at 21:07
  • @user841550 You probably have an _evil_ header (***burninate it!***) containing a `#define` for `eol`? Try with explicit qualifications like `qi::eol`: http://ideone.com/VVVTYe. (I hate how I have to psychic debug compiler/library issues here) – sehe Aug 21 '13 at 22:01
  • I have decided to compile the code using MinGW. No issues here. The codes works fine, except the double quotes around "rusty" are being stripped just like boost::tokenizer is doing. Also similar to shart's answer, it runs slower than boost:;tokenizer which kind of surprised me – user841550 Aug 21 '13 at 23:02
  • @user841550 again, have a look at the linked answer... http://stackoverflow.com/questions/7436481/how-to-make-my-split-work-only-on-one-real-line-and-be-capable-to-skeep-quoted-p/7462539#7462539 It does exactly what you want. Plus it has the `splitInto` function interface ready made. Good night – sehe Aug 21 '13 at 23:03
  • This wasn't working for me on multiple lines and I thought it was Windows/wchar_t, but there is actually a small mistake in the grammar: `*~char_(colsep)` should be something like `*~char_(",\n")` so that columns don't consume the newline! – Jeremy W. Murphy Jul 31 '14 at 11:31
  • @JeremyW.Murphy Well, I'm not aware of any CSV format that allows actual newline characters _inside_ quoted column values. But yeah, your suggestion is close if you really want to support that: `*(char_ - eol - ',')` would be cleaner – sehe Jul 31 '14 at 12:08
  • @sehe That's not what I meant although that is a reality with the CSV data we parse: sometimes there are embedded newlines, yay! What I meant is that because Kleene * is greedy, I think the grammar matches newlines as a column character rather than a line terminator/separator. – Jeremy W. Murphy Aug 01 '14 at 05:43
  • @sehe love the grammar use but I cannot get it to work with multiple lines "a,b,c\n1,2,3" The column rule appears to be greedy and eating the newline and 1 until it sees the next comma delimiter. – John Mar 19 '18 at 20:37
  • @sehe, 6 years later :) How about X3 version? :) – kreuzerkrieg Jul 25 '19 at 12:25
  • 1
    @kreuzerkrieg is this good for you? https://stackoverflow.com/questions/50821925/spirit-x3-parser-with-internal-state/50824603#comment88656748_50824603 – sehe Jul 30 '19 at 21:20
5

Sehe's post looks a fair bit cleaner than mine, but I was putting this together for a bit, so here it is anyways:

#include <boost/tokenizer.hpp>
#include <boost/spirit/include/qi.hpp>

namespace qi = boost::spirit::qi;

int main() {
    const std::string s = R"(1997,Ford,E350,"ac, abs, moon",""rusty"",3000.00)";

    // Tokenizer
    typedef boost::tokenizer< boost::escaped_list_separator<char> , std::string::const_iterator, std::string> Tokenizer;
    boost::escaped_list_separator<char> seps('\\', ',', '\"');
    Tokenizer tok(s, seps);
    for (auto i : tok)
        std::cout << i << "\n";
    std::cout << "\n";

    // Boost Spirit Qi
    qi::rule<std::string::const_iterator, std::string()> quoted_string = '"' >> *(qi::char_ - '"') >> '"';
    qi::rule<std::string::const_iterator, std::string()> valid_characters = qi::char_ - '"' - ',';
    qi::rule<std::string::const_iterator, std::string()> item = *(quoted_string | valid_characters );
    qi::rule<std::string::const_iterator, std::vector<std::string>()> csv_parser = item % ',';

    std::string::const_iterator s_begin = s.begin();
    std::string::const_iterator s_end = s.end();
    std::vector<std::string> result;

    bool r = boost::spirit::qi::parse(s_begin, s_end, csv_parser, result);
    assert(r == true);
    assert(s_begin == s_end);

    for (auto i : result)
        std::cout << i << std::endl;
    std::cout << "\n";
}   

And this outputs:

1997
Ford
E350
ac, abs, moon
rusty
3000.00

1997
Ford
E350
ac, abs, moon
rusty
3000.00

Something Worth Noting: This doesn't implement a full CSV parser. You'd also want to look into escape characters or whatever else is required for your implementation.

Also: If you're looking into the documentation, just so you know, in Qi, 'a' is equivalent to boost::spirit::qi::lit('a') and "abc" is equivalent to boost::spirit::qi::lit("abc").

On Double quotes: So, as Sehe notes in a comment above, it's not directly clear what the rules surrounding a "" in the input text means. If you wanted all instances of "" not within a quoted string to be converted to a ", then something like the following would work.

qi::rule<std::string::const_iterator, std::string()> double_quote_char = "\"\"" >> qi::attr('"');
qi::rule<std::string::const_iterator, std::string()> item = *(double_quote_char | quoted_string | valid_characters );
Bill Lynch
  • 80,138
  • 16
  • 128
  • 173
  • This is excellent. I was expecting double quotes around the token "rusty". – user841550 Aug 21 '13 at 20:04
  • 1
    I've added a note on that. While it does give you the correct result for this input, I'm not sure that it is correct according to some larger ruleset that I'm not aware of. – Bill Lynch Aug 21 '13 at 20:14
  • Also, note that there are other interesting concerns when building a CSV parser. What should the empty string result in? This code will produce a vector that looks like `{''}`, but others might expect an empty vector `{}`. – Bill Lynch Aug 21 '13 at 20:20
  • Your suggestion on double quotes works perfectly. I am surprised though that, on this input at least, boost::tokenizer is faster than boost::spirit::qi. Usually the later is faster than anything I have tested it against – user841550 Aug 21 '13 at 20:52
  • You can parse CSV with regular expressions (which boost::tokenizer may be doing) which will be faster than spirit. – Bill Lynch Aug 21 '13 at 22:43
  • Can you take a look at my edited input string? Your code drops the quotes – user841550 Aug 21 '13 at 23:07
  • This is probably a good opportunity for you to try and extend this code yourself. Give it a try first, you'll probably be able to figure something out. – Bill Lynch Aug 22 '13 at 05:39