1

Hi I have the following CSV input data which features several newline and carriage return characters. I am trying to cleanup the file with SED:

"Data1","This<LF>
Is<LF>
Foobar"<CR><LF>
"Data2","Additional<LF>
Data<CR><LF>
With Inline CR LF<CR><LF>
End of Data."<CR><LF>

Note: CR and LF equals actual \r and \n here

I want to replace all linefeeds which have no prepended " - the double quoted character is imported here to consider. I manage to filter out all linefeeds but do not know how to tell SED to ignore those with a specific pattern.

Output is expected to look like this:

"Data1","This Is Foobar"
"Data2","Additional Data With Inline CR LF End of Data."

Any ideas?

anubhava
  • 761,203
  • 64
  • 569
  • 643
skrskrskr
  • 83
  • 1
  • 5
  • 1
    Post the expected output – sjsam Jul 14 '16 at 07:35
  • DO you have literal `` and `` or `\n, \r` etc? – anubhava Jul 14 '16 at 07:50
  • We are talking about actual linefeeds and carriage returns: \n \r – skrskrskr Jul 14 '16 at 07:54
  • sed reads one line at a time and chops off the newline before putting what's left into the pattern space. Thus to replace newlines it's kinda not really good for that. [It seems possible though](http://stackoverflow.com/questions/1251999/how-can-i-replace-a-newline-n-using-sed). Awk is better for this, but sadly awk doesn't seem to support negative lookbehind. But even without lookbehinds it could be done via a capture group `([^\"])\r?\n` change to `\1 `. – LukStorms Jul 14 '16 at 08:06
  • 1
    @EdMorton Wait, is that not what they wanted... I don't think i understood the requirements. – 123 Jul 14 '16 at 10:27

2 Answers2

1

You can use this gnu awk as you have \r instead of <CR> and \n instead of <LF> shown in question:

awk -v BINMODE=3 -v RS='"\r\n"' 's!=""{printf "%s\"\n\"", s} {
   s = $0; gsub(/\r?\n/, " ", s)} END{print s}' file

"Data1","This Is Foobar"
"Data2","Additional Data Width Inline CR LF End of Data."
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    Thanks @EdMorton for the tips. I've corrected the awk command now to take care of `"foo","bar"` input as well – anubhava Jul 14 '16 at 09:52
0

Using GNU awk for multi-char RS and RT:

$ cat tst.awk
BEGIN { RS="\"[^\"]*\"" }
RT != "" {
    gsub(/\r/,"")
    gsub(/[\r\n]+/," ",RT)
    printf "%s%s", $0, RT
}
END { print "" }

$ awk -f tst.awk file
"Data1","This Is Foobar"
"Data2","Additional Data With Inline CR LF End of Data."
Ed Morton
  • 188,023
  • 17
  • 78
  • 185