There is no formal standard for CSV format, but let's note at the outset
that the ugly column you have cited:
"abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",
does not conform to what are deemed to be the Basic Rules of CSV,
because two of those are:-
If the problem column obeys rule 1) then it doesn't obey rule 2). But we can construe it
so as to obey rule 1) - so we can say where it ends - if we balance the double-quotes as, e.g.
[abc, defghijk. [Lmnopqrs, ]tuv,[] wxyz.],
The balanced outermost quotes enclose the column. The balanced internal quotes
can just lack any other indication of being internal except that the balancing
makes them internal.
We'd like a rule that will parse this text as one column,
consistently with rule 1), and that will also parse columns that
do obey the rule 2) too. The balancing just exhibited suggests this
can be done, because columns that obey both rules will necessarily be
balance-able too.
The suggested rule is:
- A column runs to the first comma that is preceded by 0 double-quotes or
is preceded by the last of an even number of double-quotes.
If there is any even number of double-quotes up to the comma, then we know
we can balance enclosing quotes and balance the rest in at least one way.
The simpler rule that you are considering:
After running into a quote, should I read the quoted junk character-by-character until I find ", in sequence?
will fail if it meets with certain columns that do obey rule 2), e.g.
"Super, ""luxurious"", truck",
The simpler rule will terminate the column after ""luxurious""
. But since
this column conforms to rule 2), adjacent double-quotes are "escaped" double-
quotes, with no delimiting significance. On the other hand the suggested
rule still parses the column correctly, terminating it after truck"
.
Here is a demo program in which the function get_csv_column
parses columns
by the suggested rule:
#include <iostream>
#include <fstream>
#include <cstdlib>
using namespace std;
/*
Assume `in` is positioned at start of column.
Accumulates chars from `in` as long as `in` is good
until either:-
- Have consumed a comma preceded by 0 quotes,or
- Have consumed a comma immediately preceded by
the last of an even number of quotes.
*/
std::string get_csv_column(ifstream & in)
{
std::string col;
unsigned quotes = 0;
char prev = 0;
bool finis = false;
for (int ch; !finis && (ch = in.get()) != EOF; ) {
switch(ch) {
case '"':
++quotes;
break;
case ',':
if (quotes == 0 || (prev == '"' && (quotes & 1) == 0)) {
finis = true;
}
break;
default:;
}
col += prev = ch;
}
return col;
}
int main()
{
ifstream in("csv.txt");
if (!in) {
cout << "Open error :(" << endl;
exit(EXIT_FAILURE);
}
for (std::string col; in; ) {
col = get_csv_column(in),
cout << "<[" << col << "]>" << std::endl;
}
if (!in && !in.eof()) {
cout << "Read error :(" << endl;
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}
It encloses each column in <[...]>
, not discounting newlines, and
including the terminal ',' with each column:
The file csv.txt
is:
...,"abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",...,
",","",
Year,Make,Model,Description,Price,
1997,Ford,E350,"Super, ""luxurious"", truck",
1997,Ford,E350,"Super, ""luxurious"" truck",
1997,Ford,E350,"ac, abs, moon",3000.00,
1999,Chevy,"Venture ""Extended Edition""","",4900.00,
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00,
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00,
The output is:
<[...,]>
<["abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",]>
<[...,]>
<[
",",]>
<["",]>
<[
Year,]>
<[Make,]>
<[Model,]>
<[Description,]>
<[Price,]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["Super, ""luxurious"", truck",]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["Super, ""luxurious"" truck",]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["ac, abs, moon",]>
<[3000.00,]>
<[
1999,]>
<[Chevy,]>
<["Venture ""Extended Edition""",]>
<["",]>
<[4900.00,]>
<[
1999,]>
<[Chevy,]>
<["Venture ""Extended Edition, Very Large""",]>
<[,]>
<[5000.00,]>
<[
1996,]>
<[Jeep,]>
<[Grand Cherokee,]>
<["MUST SELL!
air, moon roof, loaded",]>
<[4799.00]>