0

I'm trying to write a Regex that that will extract individual fields from a CSV file.

For example, if given the following line in a CSV file:

123,    Bob    ,Bob, " Foo Bar ", "a, ""b"", c"

Should give the following results (without the single quotes):

'123'
'Bob'
'Bob'
' Foo Bar '
'a, "b", c'

Note that leading and trailing white spaces should be trimmed unless they are within quotes.

I'm not worried about invalid CSV lines such as open quotes without matching closing quotes. You can safely assume that the CSV file is perfectly valid according to the rules above.

I'm also fine with using multiple Regexes if a single one is difficult. But I like to avoid using standard C# operations unless they are simple and short. (I don't want to end up with writing lots of code.)

So, any suggestions?

SF Lee
  • 1,767
  • 4
  • 17
  • 32
  • Why regex? Do you have to use it only? – gdoron Jun 29 '12 at 12:51
  • 1
    If the CSV is perfectly valid you have have line breaks in quoted fields. – Joey Jun 29 '12 at 12:51
  • I think that "split" method is more appropriate for this case. – Hamza_L Jun 29 '12 at 12:55
  • I'd argue that your concept of "generic" with regards to CSV is misguided. IMO "Generic" would hand you back the quotes too. – spender Jun 29 '12 at 12:56
  • You are trying to solve a problem that has been solved - again and again - and more elegantly. Some of the major problems you are having is that you do not understand what "valid CSV" is, and that indeed a valid CSV dataset can be multiline. – ty812 Jun 29 '12 at 12:56
  • I guess what I'm after is to extract the fields in a way that I want (ie. without quotes, and trimmed). Also, when I say "valid CSV", I'm basing my rules on the requirements mentioned above. Granted, there are different kind of CSVs out there, so my CSV is not that 'generic', but it is generic enough for what I want to do. – SF Lee Jun 29 '12 at 13:15
  • Take a look at my [JavaScript CSV regex Solution](http://stackoverflow.com/a/8497474/433790). This should get you started. If you have trouble I could help translate it to C# for you. p.s. You do need to specify the CSV rules a bit more clearly - e.g. what about empty values? Single quotes? Do backslashes escape chars? etc. – ridgerunner Jun 29 '12 at 15:00

4 Answers4

5

i would not try to write my own csv parser there many out there that do the job for you.

Peter
  • 37,042
  • 39
  • 142
  • 198
  • I prefer not having to install third party libraries if there is a simple solution available. But I will look into the built-in Microsoft.VisualBasic.FileIO.TextFieldParser that is mentioned in your second link. – SF Lee Jun 29 '12 at 13:18
1

Well there are many gotchas and error possiable with Regexes... try following code it did trick for me and it is sweet and simple...

Using Reader As New Microsoft.VisualBasic.FileIO.TextFieldParser("C:\MyFile.csv")

Reader.TextFieldType = Microsoft.VisualBasic.FileIO.FieldType.Delimited

Dim MyDelimeters(0 To 0) As String
Reader.HasFieldsEnclosedInQuotes = False
Reader.SetDelimiters(","c)

Dim currentRow As String()
While Not Reader.EndOfData
    Try
        currentRow = Reader.ReadFields()
        Dim currentField As String
        For Each currentField In currentRow
            MsgBox(currentField)
        Next
    Catch ex As Microsoft.VisualBasic.FileIO.MalformedLineException
        MsgBox("Line " & ex.Message &
        "is not valid and will be skipped.")
    End Try
End While
End Using

Mark as answer if found handy ...;)

Please see the same implementation here,,,

Jigar Pandya
  • 6,004
  • 2
  • 27
  • 45
  • I'm using C#, and I can't seem to find the Microsoft.VisualBasic.FileIO namespace. Is there a reference that I need to import? – SF Lee Jun 29 '12 at 13:26
  • C# you should just add "using System.IO;" This will give you readers and writers and file access. – marcmiller2007 Jun 29 '12 at 13:43
  • System.IO does not have TextFieldParser class. I'm asking what reference I need to get that class. – SF Lee Jun 29 '12 at 13:55
  • Yoo man i got your problem... updated my answer for the same. – Jigar Pandya Jun 29 '12 at 14:04
  • @Jigar Thanks for your help. The TextFieldParser has everything I need... except one thing: If used against my example data above, the field " Foo Bar " returns 'Foo Bar' rather than ' Foo Bar '. But if I set the TrimWhiteSpace to false, then other fields like Bob will not have their spaces trimmed... Any ideas? – SF Lee Jun 29 '12 at 14:13
1

I agree that regex is not the "right" answer, but it is what the question asked for and I like a good regex challenge.

The pattern below is a modified version of my standard CSV parsing regex that removes the spaces and assumes the CSV is perfect as you requested. The only part of your question not addressed by it is that it will not remove escaped/doubled quotes. Examples for unescaping the quotes are given after the patterns.


When one or more lines/records of a CSV file/stream are passed to the regular expression below it will return a match for each non-empty line/record. Each match will contain a capture group named Value that contains the captured values in that line/record.


Here's the commented pattern (test it on Regexstorm.net):

(?<=\r|\n|^)(?!\r|\n|$)                       // Records start at the beginning of line (line must not be empty)
(?:                                           // Group for each value and a following comma or end of line (EOL) - required for quantifier (+?)
  [^\S\r\n]*                                  // Removes leading spaces
  (?:                                         // Group for matching one of the value formats before a comma or EOL
    "(?<Value>(?:[^"]|"")*)"|                 // Quoted value -or-
    (?<Value>[^,\r\n]+)|                      // Unquoted/open ended quoted value -or-
    (?<Value>)                                // Empty value before comma (before EOL is excluded by "+?" quantifier later)
  )
  [^\S\r\n]*                                  // Removes trailing spaces
  (?:,|(?=\r|\n|$))                           // The value format matched must be followed by a comma or EOL
)+?                                           // Quantifier to match one or more values (non-greedy/as few as possible to prevent infinite empty values)
(?:(?<=,)(?<Value>))?                         // If the group of values above ended in a comma then add an empty value to the group of matched values
(?:\r\n|\r|\n|$)                              // Records end at EOL


Here's the raw pattern without all the comments or whitespace.

(?<=\r|\n|^)(?!\r|\n|$)(?:[^\S\r\n]*(?:"(?<Value>(?:[^"]|"")*)"|(?<Value>[^,\r\n]+)|(?<Value>))[^\S\r\n]*(?:,|(?=\r|\n|$)))+?(?:(?<=,)(?<Value>))?(?:\r\n|\r|\n|$)


And, here's the C# escaped version.

String CSVPattern=
    @"(?<=\r|\n|^)(?!\r|\n|$)" +
    @"(?:" +
        @"[^\S\r\n]*" +
        @"(?:" +
            @"""(?<Value>(?:[^""]|"""")*)""|" +
            @"(?<Value>[^,\r\n]+)|" +
            @"(?<Value>)" +
        @")" +
        @"[^\S\r\n]*" +
        @"(?:,|(?=\r|\n|$))" +
    @")+?" +
    @"(?:(?<=,)(?<Value>))?" +
    @"(?:\r\n|\r|\n|$)";


Examples on how to use the regex pattern (well, the original pattern which can be replaced with this pattern) can be found on my answer to a similar question here, or on C# pad here, or here.

NOTE: The examples above contain the logic for unescaping/undoubling quotes as seen below:

if (Capture.Length == 0 || Capture.Index == Record.Index || Record.Value[Capture.Index - Record.Index - 1] != '\"')
{
    // No need to unescape/undouble quotes if the value is empty, the value starts
    // at the beginning of the record, or the character before the value is not a
    // quote (not a quoted value)
    Console.WriteLine(Capture.Value);
}
else
{
    // The character preceding this value is a quote
    // so we need to unescape/undouble any embedded quotes
    Console.WriteLine(Capture.Value.Replace("\"\"", "\""));
}
Community
  • 1
  • 1
David Woodward
  • 1,265
  • 11
  • 20
  • 1
    I'll give you +1 for the amount of work you put into this massive regex! :P – SF Lee Oct 12 '16 at 01:08
  • @SF Lee Thanks! I've actually been dreading this answer getting any attention out of fear that it'd be down voted into oblivion just for being a regex. At least now it can withstand five such votes without going negative on reputation. :) – David Woodward Oct 12 '16 at 01:42
0

You can use the TextFieldParser class which is inbuilt in .NET framework.

In order to use that class in your C# application you would need to add reference of Microsoft.VisualBasic.dll at the following location(assuming you did a default settings installation)

C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETFramework\v4.0\Microsoft.VisualBasic.dll

Now in your C# class have the below using statement:

using Microsoft.VisualBasic.FileIO
Rajesh
  • 7,766
  • 5
  • 22
  • 35