We are trying to process a tsv file. The fools who made it allowed newlines in some columns, which cause issues now. Luckily these column values with the newlines in them are always contained in double quotes so it is possible to replace them (the newline characters) with spaces using regex. We found this post with the perfect solution to do this, as such:
python: re.sub(r'"[^"]*(?:""[^"]*)*"', lambda m: m.group(0).replace("\n", ""), s)
However, in some unfortunate cases, some other fool entered a unpaired double quote (i.e. not closed) into some other data field and this throws the whole thing off.
We thought about it and we believe everything can be solved if we just refine the regex condition so that the 'selected' text (in which new line characters are to be replaced) so that it may not include any tab characters.
In other words: this "[^"]*(?:""[^"]*)*"
needs to exclude substrings that contain a tab character.
How can we do that?