Remove field-internal newlines in CSV file

Question

I tried different awk methods to achieve this, but since I don't really understand how awk works, I didn't succeed.

So, I have a - large - csv-file that contains multi-line entries such as this:

"99999";"xyz";"text

that has

multiple newlines";"fdx";"xyz"

I need to get rid of those extra newlines in between the quotes.

Since every line ends with a double quote, followed by a newline, I thought I could create a command that replaces all newlines, except the ones that are prepended by a double-quote.

How would I do that?

What did you search for, and what did you find? What did you try, and how did it fail? This may be easier with a tool such as Python which has a robust and well-tested CSV parser, rather than building one again from first principles yourself. — tripleee, Dec 29 '21 at 13:05
But essentially, if you have unpaired quotes, collect more lines until you have even quotes again, then post-process the data you collected. — tripleee, Dec 29 '21 at 13:05
Try this `sed` command: `sed '/^$/D' infile | sed -e :a -e '$!N;s/[^"]\n"/"/;ta' -e 'P;D'` `` — fpmurphy, Dec 29 '21 at 13:26
Thanks guys - I found the regex that would be necessary to achieve what I want: `(?<!")\n` but sadly awk doesn't support look ahead/before. — Swissdude, Dec 29 '21 at 13:51

Ed Morton · Answer 1 · 2021-12-29T20:48:47.687

1

Chances are all you need is this, using GNU awk for multi-char RS:

awk -v RS='\r\n' '{gsub(/\n/," ")}1' file

since your input is probably a CSV exported from a Windows tool like Excel and so has \r\n "line" endings but individual \ns for newlines within fields.

Alternatively, again using GNU awk for multi-char RS and RT:

$ awk -v RS='"[^"]+"' -v ORS= '{gsub(/\n/," ",RT); print $0 RT}' file
"99999";"xyz";"text  that has  multiple newlines";"fdx";"xyz"

or if you want all the chains of newlines compressed to single blanks:

$ awk -v RS='"[^"]+"' -v ORS= '{gsub(/\n+/," ",RT); print $0 RT}' file
"99999";"xyz";"text that has multiple newlines";"fdx";"xyz"

If you need anything else, including being able to identify and use the individual fields on each input "line", see What's the most robust way to efficiently parse CSV using awk?.

edited Dec 29 '21 at 20:48

answered Dec 29 '21 at 13:34

Ed Morton

188,023
17
78
185

Tried this already. But it's - unfortunately - not an Excel-Export but from a CRM that I don't have control of. – Swissdude Dec 29 '21 at 13:50
OK, then it's a dup of the existing question I linked which handles such files even when the fields contain escaped quotes or separator chars (`;`s in your case) which you have to assume can happen since you have no control of the input. – Ed Morton Dec 29 '21 at 13:53
OMG - that looks awfully complicated. I'll give it a try, though. Thanks! – Swissdude Dec 29 '21 at 13:57
Yup, it's a complicated problem to solve robustly and efficiently for the general case. Alternatively you can use a tool that specifically supports parsing CSVs. You're welcome. – Ed Morton Dec 29 '21 at 13:59
If awk would support lookahead, it'd be a breeze, since this `awk -v RS='(?<!")\n' -v ORS= '{gsub(/\n/, " ", RT); print $0 RT}' file` would do it… :) but lookaheads aren't supported so this produces an error – Swissdude Dec 29 '21 at 14:02
But then it'd be using a far less efficient regexp engine and it still wouldn't solve the general problem of parsing CSVs which can contain `;`s (in your case) and idk if there's other cases it wouldn't solve as I don't speak PCRE :-). You **can** solve just the problem of newlines within quoted fields with far less code than shown in the linked question, but that's not the only problem with CSVs with quoted fields. – Ed Morton Dec 29 '21 at 14:05
@Swissdude I updated my answer to show a way to solve JUST the newlines-in-quotes problem you asked about by using an ERE in the way I think you're suggesting using a PCRE. – Ed Morton Dec 29 '21 at 14:33

Remove field-internal newlines in CSV file

1 Answers1