0

I have csv files with newlines within fields. Now I would like to remove them without removing the newline at the end of the row.

The end of the rows have a closing double quote like so:

...;"25.33"\n

So in order to remove the newlines within the fields I try to remove every newline that is not preceded by a double quote. The regular expression for that would be: [^"]\n

And in sed:

sed -i -E "s/[^"]\n/ /g" *.csv # a newline not following a double quote

I get a complaint in bash:

➜ sed -i -E "s/[^"]\n/ /g" *.csv
dquote>

Obviously I have to escape the quote within the brackets:

sed -i -E "s/[^\"]\n/ /g" *.csv

But that won't work either:

➜  csv_working_copy1 sed -i -E "s/[^\"]\n/ /g" *.csv
sed: RE error: illegal byte sequence

What am I missing?


Example

This is an example row

"2019-03-17";"Comment \n
with newline within it";"23.88"\n

I would like to have this output

"2019-03-17";"Comment with newline within it";"23.88"\n
Ugur
  • 1,914
  • 2
  • 25
  • 46

3 Answers3

0

Use the single quote for the outermost double quote:

sed -i -E 's/[^"]\n/ /g' *.csv
monok
  • 494
  • 5
  • 16
  • Tried that as well, but I get the same error: `➜ sed -i -E 's/[^\"]\n/ /g' *.csv sed: RE error: illegal byte sequence` Maybe I need to add that I use a Mac + iTerm? – Ugur Mar 10 '19 at 09:07
  • Don't escape the " with \", please see my example. – monok Mar 10 '19 at 09:12
  • I am sorry. You are right. But still the same error :( `sed -i -E 's/[^"]\n/ /g' *.csv sed: RE error: illegal byte sequence` – Ugur Mar 10 '19 at 09:15
  • 1
    Obviously the error resulted from the content of the file. It's a Mac specific error. I should have googled "sed: RE error: illegal byte sequence". This would have led me to https://stackoverflow.com/questions/19242275/re-error-illegal-byte-sequence-on-mac-os-x – Ugur Mar 10 '19 at 11:00
  • 2
    No need of -i here, the command do nothing. sed never see \n ! – ctac_ Mar 10 '19 at 11:48
0

Here is an awk that should handle it:

$ awk -v RS="^$" '{            # read the whole file in at the beginning
    for(i=1;i<=length;i++) {   # iterate file char at a time
        c=substr($0,i,1)       # read char
        if(c=="\"")            # if its a quote
            f=!f               # ... flag up, of down if already up
        if(c=="\n" && f)       # if its newline and flag is up ie. within quotes
            c=""               # replace newline with null
        printf "%s",c          # print char
    }
}' file

Output with the sample:

"2019-03-17";"Comment \nwith newline within it";"23.88"\n

More records:

$ awk ... file file file
"2019-03-17";"Comment \nwith newline within it";"23.88"\n
"2019-03-17";"Comment \nwith newline within it";"23.88"\n
"2019-03-17";"Comment \nwith newline within it";"23.88"\n

It won't tolerate any quote problems, naturally.

Update: Another shorter solution:

$ awk '{if((c+=gsub(/"/,"&"))%2==0)print;else printf "%s",$0}' file

Explained:

$ awk '{
    if((c+=gsub(/"/,"&"))%2==0)  # keep count of quotes, if count is even:
        print                    # print with newline
    else                         # else
        printf "%s",$0           # omit newline
}'
James Brown
  • 36,089
  • 7
  • 43
  • 59
0

Another awk :

awk '!($0~"\"$"){a=a$0;next}{$0=a $0;a=""}1' infile
ctac_
  • 2,413
  • 2
  • 7
  • 17