0

Im trying to parse a single of a csv file. Curently it is done with some online regex webpage but in the end it has to be implemented in c#. (as reaction of some question in the comments)

I read a lot of other articels here on SO to figure it out by myself, but im stuck in solving it.

My test line for my RegExp looks like this (UPDATE: quotes escaped inside of quoted-strings):

;;"test123;weiterer Text";;"Test mit " Zeichen im Spaltenwert";nächste Spalte mit " Begrenzungszeichen;"4711";irgendwas 123,4;1222;"foo"test"

;;"test123;weiterer Text";;"Test mit "" Zeichen im Spaltenwert";nächste Spalte mit "" Begrenzungszeichen;"4711";irgendwas 123,4;1222;"foo""test"
  • ; is the delimiter
  • " is the sign for quoted columns

Problem:

  • the line may contain empty columns (semicolon followed by semicolon without any text)
  • quoted strings may contain the quote sign, like here "Test mit " Zeichen im Spaltenwert"
  • the column delimiter may occure also in quoted strings, like here: "test123;weiterer Text"

What i have done so far with several googling and my limited understanding of regular expressions is this expression

(?<=^|;)(\".\"|[^;]*)|[^;]+

This gives following result

        [0] => 
        [1] => 
        [2] => "test123
        [3] => weiterer Text"
        [4] => 
        [5] => "Test mit " Zeichen im Spaltenwert"
        [6] => nächste Spalte mit " Begrenzungszeichen
        [7] => "4711"
        [8] => irgendwas 123,4
        [9] => 1222
        [10] => "foo"test"

Tested with https://www.myregextester.com/

The problem i have now is at the elements 2 and 3. This text

"test123;weiterer Text"

has to be one column but gets splited at the semicolon inside of the quoted string, although i thought i told the expression to match everysthing inside of quotation marks.

Any help here is highly appreciated. Thanks in advance.

Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
Dom84
  • 852
  • 7
  • 20
  • What do you mean with "what is your regex flavor"? i don't understand. Using a csv parser maybe an option in future but not currently because of the existing implementation where i have to fix this. – Dom84 Jul 31 '17 at 10:23
  • Curerntly with the tool under the mentioned url, myregextester.com But in the end with C# – Dom84 Jul 31 '17 at 10:25
  • 1
    if a quoted part can also contain an unescaped quote, there are no way to solve your problem. – Casimir et Hippolyte Jul 31 '17 at 10:26
  • @CasimiretHippolyte ok, that is a good hint. That was my fault with my testdata. Thx so far. But if the quotes in the quoted string are escaped, how could it than be solved? – Dom84 Jul 31 '17 at 10:36
  • Is it a valid CSV string input? If yes, do not use regex, use the built-in CSV parser. – Wiktor Stribiżew Jul 31 '17 at 11:37
  • @WiktorStribiżew could you give me an example? first time i hear of a build-in csv parser. – Dom84 Jul 31 '17 at 12:11
  • See [How to split csv whose columns may contain ,](http://stackoverflow.com/questions/6542996/how-to-split-csv-whose-columns-may-contain/6543418#6543418). – Wiktor Stribiżew Jul 31 '17 at 12:16
  • thank you, with a quick googling it seems that TextFieldParser is nothing to think about when it depends on performance. (But i haven't mentioned the performance thing in my question) https://www.dotnetperls.com/textfieldparser – Dom84 Jul 31 '17 at 12:54

2 Answers2

1

Assuming a proper csv that uses doubled quotes for escaping (""), that is read line by line you can use

"(?:[^"]+|"")*"|[^;]+|(?<=;|^)(?=;|$)

Basically three different ways to match a column:

  • "(?:[^"]+|"")*" starting and closing quote with non-quotes or double quotes between
  • [^;]+ a series of non-semikolons
  • (?<=;|^)(?=;|$) an empty field between semikolons or between semikolon and start/end

Note:

  • if you want to use this in multiline context you would have to add \n in the negated character classes
  • it doesn't handle leading or trailing spaces connected with quoted fields

See https://regex101.com/r/twKZVN/1

(While regex 101 tests a PCRE pattern, all features used are also available in a .net pattern.

Sebastian Proske
  • 8,255
  • 2
  • 28
  • 37
0
(?<=^|;)(\"[^"]*\";|\".\"|[^;]*)|[^;]+

add this part for merge 2 and 3 \"[^"]*\";

[0] => Array
    (
        [0] => 
        [1] => 
        [2] => "test123;weiterer Text";
        [3] => 
        [4] => "Test mit " Zeichen im Spaltenwert"
        [5] => nächste Spalte mit " Begrenzungszeichen
        [6] => "4711";
        [7] => irgendwas 123,4
        [8] => 1222
        [9] => "foo"test"
    )
Kerwin
  • 1,212
  • 1
  • 7
  • 14
  • you may have an idea how a column like that could also be matched? currently it isn't. "foo"te;st" (quotation AND semicolon in the same column) – Dom84 Jul 31 '17 at 10:54