Parsing CSV data using a finite state machine

Question

I want to read a file containing comma-separated values, so have written a finite state machine:

private IList<string> Split(string line)
{
    List<string> values = new List<string>();
    string value = string.Empty;
    ParseState state = ParseState.Initial;
    foreach (char c in line)
    {
        switch (state)
        {
            case ParseState.Initial:
                switch (c)
                {
                    case COMMA:
                        values.Add(string.Empty);
                        break;
                    case QUOTE:
                        state = ParseState.Quote;
                        break;
                    default:
                        value += c;
                        state = ParseState.Data;
                        break;
                }
                break;
            case ParseState.Data:
                switch (c)
                {
                    case COMMA:
                        values.Add(value);
                        value = string.Empty;
                        state = ParseState.Initial;
                        break;
                    case QUOTE:
                        throw new InvalidDataException("Improper quotes");
                    default:
                        value += c;
                        break;
                }
                break;
            case ParseState.Quote:
                switch (c)
                {
                    case QUOTE:
                        state = ParseState.QuoteInQuote;
                        break;
                    default:
                        value += c;
                        break;
                }
                break;
            case ParseState.QuoteInQuote:
                switch (c)
                {
                    case COMMA:
                        values.Add(value);
                        value = string.Empty;
                        state = ParseState.Initial;
                        break;
                    case QUOTE:
                        value += c;
                        state = ParseState.Quote;
                        break;
                    default:
                        throw new InvalidDataException("Unpaired quotes");
                }
                break;
        }
    }

    switch (state)
    {
        case ParseState.Initial:
        case ParseState.Data:
        case ParseState.QuoteInQuote:
            values.Add(value);
            break;
        case ParseState.Quote:
            throw new InvalidDataException("Unclosed quotes");
    }

    return values;
}

Yes, I know the advice about CSV parsers is "don't write your own", but

I needed it quickly and
our download policy at work would take several days to allow me to get open source off the 'net.

Hey, at least I didn't start with string.Split() or, worse, try using a Regex!

And yes, I know it could be improved by using a StringBuilder, and it's restrictive on quotes in the data, but

performance is not an issue and
this is only to generate well-defined test data in-house,

so I don't care about those.

What I do care about is the apparent trailing block at the end for mopping up all the data after the final comma, and the way that it's starting to look like some sort of an anti-pattern down there, which was exactly the sort of thing that "good" patterns such as a FSM were supposed to avoid.

So my question is this: is this block at the end some sort of anti-pattern, and is it something that's going to come back to bite me in the future?

Did you try [TextFieldParser](http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.aspx) which comes with .net framework ? I'm not sure that will help you, I've not used it. — Sriram Sakthivel, Jul 19 '14 at 10:06
Would they be capable of restricting their answers to just my question, without embarking on a wholesale review of the entire method? — ClickRick, Jul 19 '14 at 10:25
Just to bump @SriramSakthivel: TextFieldParser is a really good, simple soltution to this. Yes, you need to add a reference to the VB assembly that is part of .Net, which feels unnatural but is entirely fine; I have a large production app using that approach. If you do ever manage to get around the download restrictions at work, CsvHelper (https://github.com/JoshClose/CsvHelper) is amazing - it is now our standard tool for anything CSV related. — flytzen, Jul 19 '14 at 10:54
@Frans CsvHelper would have been my first choice, and indeed I already have a request into tech services to let me download it, but procedures are procedures... — ClickRick, Jul 19 '14 at 13:06

score 1 · Accepted Answer · edited May 23 '17 at 12:09

All of the FSMs I've ever seen (not that I go hunting for them, mind you) all have some kind of "mopping up" step, simply due to the nature of enumeration.

In an FSM you're always acting upon the current state, and then resetting the 'current state' for the next iteration, so once you've hit the end of your iterations you have to do one last operation to act upon the 'current state'. (Might be better to think about it as acting upon the 'previous state' and setting the 'current state').

Therefore, I would consider that what you've done is part of the pattern.

But why didn't you try some of the other answers on SO?

Split CSV String (specifically this answer)
How to properly split a CSV using C# split() function? (specifically this answer)

Adapted solution, still an FSM:

public IEnumerable<string> fsm(string s)
{
    int i, a = 0, l = s.Length;
    var q = true;
    for (i = 0; i < l; i++)
    {
        switch (s[i])
        {
            case ',':
                if (q)
                {
                    yield return s.Substring(a, i - a).Trim();
                    a = i + 1;
                }
                break;

            // pick your flavor
            case '"':
            //case '\'':
                q = !q;
                break;
        }
    }
    yield return s.Substring(a).Trim();
}

// === usage ===
var row = fsm(csvLine).ToList();
foreach(var column in fsm(csvLine)) { ... }

score 1 · Answer 2 · answered May 30 '19 at 18:33

In a FSM you identify which states are the permitted halting states. So in a typical implementation, when you come out of the loop you need to at least check to make sure that your last state is one of the permitting halting states or throw a jam error. So having that one last state check outside of the loop is part of the pattern.

score 0 · Answer 3 · answered Apr 27 '17 at 20:33

The source of the problem, if you want to call it that, is the absence of an end-of-line marker in your input data. Add a newline character, for example, at the end of your input string and you will be able to get rid of the "trailing block" that seems to annoy you so much.

As far as I'm concerned, your code is correct and, no, there is no reason why this implementation will come back to bite you in the future!

score 0 · Answer 4 · answered Sep 25 '20 at 20:16

I had a similiar issue, but i was parsing a text file character by character. I didnt like this big clean-up-switch-block after the while loop. To solve this, I made a wrapper for the streamreader. The wrapper checked when streamreader had no characters left. In this case, the wrapper would return an EOT-ascii character once (EOT is equal to EOF). This way my state machine could react to the EOF depending on the state it was in at that moment.

Parsing CSV data using a finite state machine

4 Answers4