csv file - remove line feed in the middle of string

Question

I have a csv file like below. Sometimes there is line feed in the middle column. is there any way to replace line feed in unix with space using sed, awk or anything else? -

COLA,COLB,COLC,COLD
test,test2,"test3",test4
test5,"test6
test666
test67","test7",test8
test9,test10,test11,test12

in above data the second record is actually one record -

test5,"test6
test666
test67","test7",test8

while I am a big awk fan, this is the point where I say ... don't use awk! There is no easy way to parse such multi-line-fields in CSV files. A true CSV parser is needed for this. I advise you to use tools such as perl or python for this who have extensive libraries to handle this easy and correct. GNU awk might have CSV extensions, but so far they are not stable yet. — kvantour, Feb 11 '20 at 21:54
@kvantour a great tool to parse a CSV is Miller. My reply https://stackoverflow.com/a/60178030/757714 — aborruso, Feb 11 '20 at 22:07
See [whats-the-most-robust-way-to-efficiently-parse-csv-using-awk](https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk) — Ed Morton, Feb 11 '20 at 22:16

score 3 · Answer 1 · answered Feb 11 '20 at 22:01

Using Miller (https://github.com/johnkerl/miller) and running

mlr --csv put '
  for (k in $*) {
    $[k] = gsub($[k], "\n", " ");
  }
' input.csv >output.csv

you will have

COLA,COLB,COLC,COLD
test,test2,test3,test4
test5,test6 test666 test67,test7,test8
test9,test10,test11,test12

score 2 · Answer 2 · answered Feb 11 '20 at 21:44

Perl to the rescue!

perl -MText::CSV_XS=csv -we 'csv( in    => "file.csv",
                                  on_in => sub { $_[1][1] =~ s/\n/ /g } )'

on_in specifies a callback to handle input; $_[1] contains the current row, so $_[1][1] corresponds to column 1 (0-based, i.e. the 2nd column).

See Text::CSV_XS.

ghoti · Answer 3 · 2020-02-11T21:54:53.400

The standard command-line tools, even awk, don't fully understand CSV. You may be able to fashion something that works properly with a more advanced language like python/perl/ruby/GO/etc.

If all you want to do is collapse split lines, the following hack might do:

awk -F, 'NF==4{print;next} NF<4{x=(x~/./?x" ":"") $0; if(split(x,a,FS)>=4){print x;x=""}}' input.csv

Or for easier reading:

BEGIN {
 FS=OFS=","
}

NF==4 {
  print
  next
}

NF<4 {
  x=(x ~ /./ ? x" " : "") $0
  if (split(x,a,FS)>=4) {
    print x
    x=""
  }
}

This will print-and-move-on if a line has 4 fields. For any smaller field count, it will begin concatenating lines until it sees at least 4 fields, then will print what it's collected. Note that the return value of the split() function should be the item count of the array created by the split.

potong · Answer 4 · 2020-02-12T14:07:05.070

0

This might work for you (GNU sed):

sed -E ':a;/^(([^",]*,)*("[^"]*",)*)*"[^"]*$/{N;s/\n/,/;ta}' file

If a line contains an unmatched ", append the next line, replace the newline by , and repeat.

edited Feb 12 '20 at 14:07

answered Feb 12 '20 at 12:37

potong

55,640
6
51
83

csv file - remove line feed in the middle of string

4 Answers4