1

I have a .csv file with this error that I want to correct with regular expression, some fields contain line breaks, example:

"abc

de
f 123",123,456

Should be in one line, I want to remove line-breaks only but to keep the text

"abcdef 123",123,456

I tried to isolate within double quotes, but this removes only first line-break:

^(?:"[0-9a-zA-Z])\r?\n(?=",)

(Text can include numbers and other characters, so I tried to include them, I only want to remove line-breaks and keep all other characters, hope it is clear)

Dan
  • 15
  • 3
  • Sounds like you're missing the multiline flag which is required since you are using the `^` anchor – MonkeyZeus Sep 04 '19 at 13:20
  • If you cannot define left and right hand contexts for the open/close quotes, you can't achieve it with one regex (because the left and right hand delimiters are identical). The best approach is to read the file with some CSV parser and format as you see fit. – Wiktor Stribiżew Sep 04 '19 at 13:42
  • Any example? Tried `^(?:"[0-9a-zA-Z])?m\r?\n(?=",)` with no result... – Dan Sep 04 '19 at 13:51
  • Can you assume that the first line will always start with a double quote ? If so you could try `\r?\n(?!")` and replace it with nothing : remove all line breaks, except when there's a double quote at the start of the next line – LogicalKip Sep 04 '19 at 15:06
  • @WiktorStribiżew I define left with `"` and right with `",` – Dan Sep 04 '19 at 16:00
  • @LogicalKip Tried it, removes all line breaks in entire file. – Dan Sep 04 '19 at 16:01
  • Try a PCRE `(?:\G(?!\A)|")[^",\r\n]*(?:"(?!,)[^",\r\n]*)*\K\R` pattern and replace with empty string. – Wiktor Stribiżew Sep 04 '19 at 16:16
  • @Dan Could you provide a counter example ? I put your example in a file, then on the second line `"test", 456, 789`, used my editor's search and replace, replaced my regex with [nothing], and it certainly didn't remove all line breaks, it gave me 2 lines : `"abcdef 123",123,456` and `"test", 456, 789`. The only problem I see happening is when the problematic line break is just before the second quote, but maybe in your case that never happens. – LogicalKip Sep 05 '19 at 06:19

1 Answers1

0

It's not possible to capture multiple newlines in a single query if you're using a repeated capture group; the regex engine can only grab the last match. That being said, if you're using powergrep (or some other search-and-replace that can selectively replace capture groups, not the whole match), do you really need a one-liner?

^"(?:[^"\n]|(\n+))*", will look between " and ", and match any amount of text that doesn't include a ", keeping it within your quoted statement - or, alternatively, will capture the last group of newlines it finds. If your tool can remove/replace text only in the capture group, why not use this regex a few times in a row? It'll leave your typo-free rows alone, but will remove one chunk of whitespace from your erroneous rows each time it's run. (Try it here! Note that this matches all lines, but only captures whitespace for the badly formatted ones)

How will you know when you're done? Try using ^(?=.*\n.*)"[^"]*", - it'll match any lines in your csv file that still have newlines, but will ignore properly formatted lines. When this regex returns no matches, you can be confident your file is typo-free. (Try it here!)

It's not a very elegant solution, but if you run it enough times, you'll get rid of all the whitespace.

Nick Reed
  • 4,989
  • 4
  • 17
  • 37