1

I've got some pipe-delimited data that looks like this:

    field|field|field|field|another "field"|field
    field|field|field|"another" field|field|field
    field|"fie|ld"|field|field|field|field

And the problem that I'm having is that the double-quotes aren't properly formed for a delimited field (fields containing the delimiter, or the quotes, should itself be wrapped with double-quotes, and the double-quotes should be excaped with another double-quote). This is what the output should be:

    field|field|field|field|"another ""field"""|field
    field|field|field|"""another"" field"|field|field
    field|"fie|ld"|field|field|field|field

Anyone know if there is an easy way to do this with a regular expression?

m0meni
  • 16,006
  • 16
  • 82
  • 141
user1399233
  • 73
  • 1
  • 8
  • What programming language are you using? This is important because different languages use different regex syntax. – 4castle Jun 13 '16 at 20:48
  • I don't have a language constraint, but I would probably run it under bash in a *nix environment. Or Perl. We're unfortunately mostly SQL Server/SSIS here FYI. Thx. – user1399233 Jun 13 '16 at 20:56
  • 1
    You have to solve this riddle before you should even try this. Which comes first, the _pipe_ or the _quote_ ? And what if your dirty data contains a single quote `...|..."...|...` ? By _single quote_ I am referring to the _unbalanced_ condition where there is an _odd_ number of quotes. The latter condition nullifies any possible attempt you might be considering. If it is, or can never be unbalanced, it has to be the first thing validated before even trying to do this. –  Jun 13 '16 at 21:46
  • @sln The data seems to already be valid according to their test case. It just needs some additional formatting for the quotes within fields. – 4castle Jun 13 '16 at 22:08
  • @sin the pipe always comes first. A single quote would still be escaped by another quote, AND that field would be wrapped with quotes as well. – user1399233 Jun 14 '16 at 00:45

1 Answers1

1

Only some of this operation is suitable for regex. Other parts are better suited for regular program flow. Regex can still be an integral part of the operation, but use built-in string manipulation where possible.

Using PCRE:

  1. Read in one line of the input string.
  2. Split the line at any match to "[^"]*"(*SKIP)(*F)|\| to get a list of fields.
  3. For each field:
    1. If it contains a match to (?<!^)"(?!$), surround the field in quotes.
    2. Afterwards, replace every match to (?<!^)"(?!$) with "".
  4. Rejoin the list of fields and output it as a line of the new CSV.
  5. Go back to step #1 until all lines have been processed.

Regex #1 Test
Regex #2 Test (Stage 1)
Regex #2 Test (Stage 2)

My attempt at translating this to PERL

4castle
  • 32,613
  • 11
  • 69
  • 106
  • If you don't use `[^|"]`, it opens the door for an unbalanced condition, since quotes are used as in-field delimiters as well. –  Jun 13 '16 at 21:50
  • @sln I built the regex so that it when it encounters a quote, it ignores everything until it finds the next quote. You should try it out. This is assuming that there aren't any mismatched quotes (but if that was a concern it would be specified by the OP). – 4castle Jun 13 '16 at 22:11
  • Thanks, @4castle. I will be trying this approach out tonight. I'm not sure exactly how to implement it but thank you very much for the info. --fc – user1399233 Jun 13 '16 at 22:30
  • 4castle, this is a good start but it does get hung up on [`" "field" | "field" "`](https://regex101.com/r/aG4bC3/1) – Ro Yo Mi Jun 13 '16 at 22:52
  • @RoYoMi Good catch. I'm switching over to my alternative regex which splits the string. – 4castle Jun 13 '16 at 23:01
  • `if ($string =~ m/(?:"[^"]*"|[^|]+)*/) { if ($string =~ m/(?<!^)"(?!$)/) { $string = '"' . $string . '"'; $string =~ s/(?<!^)"(?!$)/""/g; } }` – user1399233 Jun 13 '16 at 23:02
  • @user1399233 I don't know PERL, but it looks like that code is missing a loop? Also, I changed my algorithm slightly so that it's more reliable. It now uses the first regex to split the string into a list of fields which you can iterate over. – 4castle Jun 13 '16 at 23:10
  • I updated my answer with a link to my attempt at writing PERL. It's probably not very good, but it's my first time. – 4castle Jun 13 '16 at 23:47
  • Just a recap here - I was able to wrap the regular expressions above (actually, I used 4castle's Perl script as a basis) into a command line Perl script and scrub my files. It worked successfully, except where there is a double-quote at the beginning of a field, and I manually edited those records. Again, thanks to everyone who helped me out here, your suggestions have saved me a lot of manual editing. fc – user1399233 Jun 14 '16 at 19:09