5

I'm trying to parse csv file with VB.NET.

csv files contains value like 0,"1,2,3",4 which splits in 5 instead of 3. There are many examples with other languages in Stockoverflow but I can't implement it in VB.NET. Here is my code so far but it doesn't work...

 Dim t As String() = Regex.Split(str(i), ",(?=([^\""]*\""[^\""]*\"")*[^\""]*$)")
shinya
  • 403
  • 2
  • 9
  • 23

4 Answers4

8

Assuming your csv is well-formed (ie no " besides those used to delimit string fields, or besides ones escaped like \"), you can split on a comma that's followed by an even number of non-escaped "-marks. (If you're inside a set of "" there's only an odd number left in the line).

Your regex you've tried looks like you're almost there.

The following looks for a comma followed by an even number of any sort of quote marks:

,(?=([^"]*"[^"]*")*[^"]*$)

To modify it to look for an even number of non-escaped quote marks (assuming quote marks are escaped with backslash like \"), I replace each [^"] with ([^"\\]|\\.). This means "match a character that isn't a " and isn't a blackslash, OR match a backslash and the character immediately following it".

,(?=(([^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)

See it in action here. (The reason the backslash is doubled is I want to match a literal backslash).

Now to get it into vb.net you just need to double all your quote marks:

splitRegex = ",(?=(([^""\\]|\\.)*""([^""\\]|\\.)*"")*([^""\\]|\\.)*$)"
mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • Hmm, the regex works in the link I posted on the input you gave in your question, so perhaps it is another problem. Are you sure `str(i)` is what you expect it to be? – mathematical.coffee Feb 07 '12 at 01:30
6

Instead of a regular expression, try using the TextFieldParser class for reading .csv files. It handles your situation exactly.

TextFieldParserClass

Especially look at the HasFieldsEnclosedInQuotes property.

Example:

Note: I used a string instead of a file, but the result would be the same.

    Dim theString As String = "1,""2,3,4"",5"

    Using rdr As New StringReader(theString)
        Using parser As New TextFieldParser(rdr)
            parser.TextFieldType = FieldType.Delimited
            parser.Delimiters = New String() {","}
            parser.HasFieldsEnclosedInQuotes = True
            Dim fields() As String = parser.ReadFields()

            For i As Integer = 0 To fields.Length - 1
                Console.WriteLine("Field {0}: {1}", i, fields(i))
            Next
        End Using
    End Using

Output:

Field 0: 1
Field 1: 2,3,4
Field 2: 5
Chris Dunaway
  • 10,974
  • 4
  • 36
  • 48
0

This worked great for parsing a Shipping Notice .csv file we receive. Thanks for keeping this solution here.

This is my version of the code:

Try
Using rdr As New IO.StringReader(Row.FlatFile)
    Using parser As New FileIO.TextFieldParser(rdr)
        parser.TextFieldType = FileIO.FieldType.Delimited
        parser.Delimiters = New String() {","}
        parser.HasFieldsEnclosedInQuotes = True
        Dim fields() As String = parser.ReadFields()

        Row.Account = fields(0).ToString().Trim()
        Row.AccountName = fields.GetValue(1).ToString().Trim()
        Row.Status = fields.GetValue(2).ToString().Trim()
        Row.PONumber = fields.GetValue(3).ToString().Trim()
        Row.ErrorMessage = ""

    End Using
End Using

Catch ex As Exception
    Row.ErrorMessage = ex.Message
End Try
Doug
  • 1
0

It's possible to do it with regex VB.NET in the following way:

,(?=(?:[^"]*"[^"]*")*[^"]*$)

The positive lookahead ((?= ... )) ensures that there is an even number of quotes ahead of the comma to split on (i.e. either they occur in pairs, or there are none).

[^"]* matches non-quote characters.

Given below is a VB.NET example to apply the regex.

Imports System
Imports System.Text.RegularExpressions

Public Class Test
    Public Shared Sub Main()
        Dim theString As String = "1,""2,3,4"",5"
        Dim theStringArray As String() = Regex.Split(theString, ",(?=(?:[^""\\]*""[^""\\]*"")*[^""\\]*$)")
        For i As Integer = 0 To theStringArray.Length - 1
            Console.WriteLine("theStringArray {0}: {1}", i, theStringArray(i))
        Next
    End Sub
End Class

'Output:
'theStringArray 0: 1
'theStringArray 1: "2,3,4"
'theStringArray 2: 5