3

I have the following code for a csv parser

string input = wholeFile;
IList<string> wholeFileArray = new List<string>();
int start = 0;
bool inQuotes = false;
for (int current = 0; current < input.Length; current++)
{
   // test each character before and after to determine if it is a valid quote, or a quote within a quote.
   int test_backward = (current == 0 ? 1 : current) - 1;
   int test_forward = (current == input.Length - 1 ? input.Length - 2 : current) + 1;
   bool valid_quote = input[test_backward] == ',' || input[test_forward] == ',' || input[test_forward] == '\r';
    if (input[current] == '\"') // toggle state
    {
        inQuotes = !inQuotes;
    }
    bool atLastChar = (current == input.Length - 1);
    if (atLastChar)
    {
        wholeFileArray.Add(input.Substring(start));
    }
    else if (input[current] == ',' && !inQuotes)
    {
        wholeFileArray.Add(input.Substring(start, current - start));
        start = current + 1;
    }
}

It takes a String and splits it on , if the , is not inside a double quote "something,foobar" string like that.

My problem is that a rogue " in my string is messing up my whole process.

EX: "bla bla","bla bla2",3,4,"5","bla"bla","End" Result

  • "bla bla"
  • "bla bla2"
  • 3
  • 4
  • "5"
  • "bla"bla","End"

How do I change my code to allow for the rogue "

A 'valid' close quote is always followed by a comma (,) OR a Control Linefeed

Added This seems to fix it

// test each character before and after to determine if it is a valid quote, or a quote within a quote.
int test_backward = (current == 0 ? 1 : current) - 1;
int test_forward = (current == input.Length - 1 ? input.Length - 2 : current) + 1;
bool valid_quote = input[test_backward] == ',' || input[test_forward] == ',' || input[test_forward] == '\r';
Josef Van Zyl
  • 915
  • 3
  • 19
  • 43
  • 1
    As fun as trying to determine what colour quotes are when they're presented in black and white, I decided to correct the spelling. – Damien_The_Unbeliever Aug 08 '13 at 08:18
  • 2
    The only reliable pattern in your example is that a 'valid' close quote is always followed by a comma (`,`). You might be able to get it working by check for that – musefan Aug 08 '13 at 08:21
  • @musefan I should probably mention that this is a csv parser, so it needs to match on end of line as well – Josef Van Zyl Aug 08 '13 at 08:22
  • 3
    @Josefvz: The problem is that the input just isn't valid. Nobody can expect a parser to *just work* with invalid data. Inner quotes should be escaped. The best you can do is like I said, after each potential close quote, look ahead at a few characters and work out if you are still in a string of not. i.e. if all you have between the potentially close quote and the next quote(or line end) is a comma or whitespace then it was a valid close quote. If you find any other characters, assume you are still in the string. – musefan Aug 08 '13 at 08:28
  • @Josefvz: After seeing you edit, just do what I said and check if the next character (`input[current+1]`) is a comma or linefeed – musefan Aug 08 '13 at 08:30
  • 5
    Tomorrow you'll return to us saying... I have double rogue: `"bla",bla"`... What can I do? – xanatos Aug 08 '13 at 08:31
  • @musefan Yes, but can a "rogue" quote _ever_ be followed by a comma? – Grant Thomas Aug 08 '13 at 08:36
  • @GrantThomas: You can't cater for everything. But as I said before, you can check for other 'upcoming' characters (until either next quote or line end) and make a best guess – musefan Aug 08 '13 at 08:48
  • @musefan Yes, the issue is the Input, but i can't change the input so i have to work around it. – Josef Van Zyl Aug 08 '13 at 08:54
  • This question is almost the same as [How do I robustly parse malformed CSV?](http://stackoverflow.com/q/11733076/7586) (which I answered, but the answer is incomplete). – Kobi Aug 08 '13 at 09:13

3 Answers3

2

Try something like this:

if (input[current] == '"' && // 1
    (!inQuotes || // 2
    current + 1 == input.Length || // 3
    input[current + 1] == '\r' || // 4
    input[current + 1] == '\n' || // 5
        (input[current + 1] == ',' && // 6
            (current + 2 == input.Length || // 7
            input[current + 2] == '\r' || // 8
            input[current + 2] == '\n' || // 9
            input[current + 2] == '"' || // 10
                (input[current + 2] >= '0' && input[current + 2] <= '9'))))) // 11
// toggle state

But note that what you want to do is wrong on various conceptual levels.

A corret quote is an opening quote 2 or a quote that is the last character of the string 3 or that is followed by a \r 4 or by a \n 5 or that is followed by a , 6 that in turn is the last character of the string 7 or that is followed by a \r 8 or by a \n 9 or by a quote " 10 or by a digit 11.

xanatos
  • 109,618
  • 12
  • 197
  • 280
  • Thanks, I'll try this. I know it is wrong on various conceptual levels, the file I get is EVIL, and the guys "CAN'T" change it. So I have to work around it – Josef Van Zyl Aug 08 '13 at 08:51
1

In case you have the option of doing this based on bnf it's a rather simple grammar. THe below is what it might look like using fsyacc (which in turn can be used from C#)

start: lines
lines: line lines {$1::$2}
     | {[]}

line: val vals {$1::$2}
    |  {[]}

val : QUOTE STR QUOTE COMMA {$2}
    | QUOTE STR QUOTE STR QUOTE COMMA { $2 + "\"" + $4 }
    | QUOTE STR QUOTE EOL {$2}
    | QUOTE STR QUOTE STR QUOTE EOL { $2 + "\"" + $4 }
    | QUOTE STR QUOTE EOF {$2}
    | QUOTE STR QUOTE STR QUOTE EOF { $2 + "\"" + $4 }

The production val also kinda shows that it's a unclean grammar because your need the next token to determine what to do. If it would be possible to require that each line ended with a newline (including the last) then val could be simplified to four instead of six and requiring that each line ends with a comma would get it down to two. Quite a lot of grammars can be simplified this way (by requiring that every statement ends with a specific charater) which is way c++ uses ;

Rune FS
  • 21,497
  • 7
  • 62
  • 96
0

As an alternative, as long as you're not going to have a , inside the quotes, you might look into the Microsoft.VisualBasic.FileIO.TextFieldParser.

The following code snippet:

using Microsoft.VisualBasic.FileIO;


using (TextFieldParser parser = new TextFieldParser(fileName))
{

    parser.Delimiters = new string[] { "," };

    while (!parser.EndOfData)
    {
        string[] fields = parser.ReadFields();   
    }
}

The above code snippet produces an array with your sample line as follows:

"bla bla"
"bla bla2"
3
4
5
"bla"bla"
"End"

Obviously this will need to be adapted to your code,and it's not an optimal solution (especially if you have , between the quotes), but it might be easier than trying to handle any number of "rogue" quotes.

Tim
  • 28,212
  • 8
  • 63
  • 76