Regex expression to replace newlines between quotation marks, but only if there's no tabs in-between

Question

We are trying to process a tsv file. The fools who made it allowed newlines in some columns, which cause issues now. Luckily these column values with the newlines in them are always contained in double quotes so it is possible to replace them (the newline characters) with spaces using regex. We found this post with the perfect solution to do this, as such:

python: re.sub(r'"[^"]*(?:""[^"]*)*"', lambda m: m.group(0).replace("\n", ""), s)

However, in some unfortunate cases, some other fool entered a unpaired double quote (i.e. not closed) into some other data field and this throws the whole thing off.

We thought about it and we believe everything can be solved if we just refine the regex condition so that the 'selected' text (in which new line characters are to be replaced) so that it may not include any tab characters.

In other words: this "[^"]*(?:""[^"]*)*" needs to exclude substrings that contain a tab character.

How can we do that?

If you do not want to match TAB chars with the above regex, just add it to the negated character classes: `"[^"\t]*(?:""[^"\t]*)*"`. Note it might not work as expected by matching between unwanted quotes. Then use `"[^"]*(?:""[^"]*)*"` and check for a TAB in the lambda, `re.sub(r'"[^"]*(?:""[^"]*)*"', lambda m: m.group().replace("\n", "") if "\t" not in m.group() else m.group(), s)` — Wiktor Stribiżew, Aug 23 '22 at 12:14

score 1 · Accepted Answer · answered Aug 28 '22 at 19:53

If you do not want to match TAB chars with your regex, you just need to add \t to the negated character classes:

"[^"\t]*(?:""[^"\t]*)*"

See the regex demo.

Details:

" - a double quote
[^"\t]* - zero or more chars other than double quotes and tab chars
(?:""[^"\t]*)* - zero or more sequences of
- "" - two double quotes
- [^"\t]* - zero or more chars other than double quotes and tab chars
" - a double quote.

NOTE it might not work as expected by matching between unwanted quotes. Then use your "[^"]*(?:""[^"]*)*" regex and check for a tab char in the lambda expression, e.g.

re.sub(r'"[^"]*(?:""[^"]*)*"', lambda m: m.group().replace("\n", "") if "\t" not in m.group() else m.group(), s)

Or something similar.

score -1 · Answer 2 · answered Aug 28 '22 at 14:21

-1

This was easier than we thought. The search expression can just be refined to exclude any tab characters:

"[^"]*(?:""[^"]*)*"  ---> "[^"^\t]*(?:""[^"]*)*"

answered Aug 28 '22 at 14:21

levraininjaneer

1,157
2
18
38

t is not done like this here, if my suggestion worked let me explain in the answer, please remove yours. – Wiktor Stribiżew Aug 28 '22 at 19:45

Regex expression to replace newlines between quotation marks, but only if there's no tabs in-between

2 Answers2

Linked