1

I have a large csv file with about 100 double quoted "text fields" per line. Many of the lines have a \r\n embedded in a double quoted field. The \r\n pair is also used for line termination.

How can the \r\n pairs be removed from the double quoted fields and not impact the \r\n line terminations.

I have tried creating individual sed scripts to identify the particular embedded sequences. That sort of worked, but the number of scripts became unmanageable. I have also tried using a 'tr -d '\r' command, that did not work.

dan sawyer
  • 193
  • 1
  • 1
  • 12

1 Answers1

1

For this purpose you can use perl.

perl -0pe 's/"(.*?)([[:space:]]+)(.*?)"/"\1 \3"/g' input.csv > output.csv

perl -pe 's/../../g' file is analogous to sed -e 's/../../g' file but using Perl compatible regular expressions (PCRE). See this or this for more details.

With the option zero perl -0 we are considering the null character as input record separator (see xargs -0 ... or find . -print0 ...).

All versions of sed supports Basic Regular Expressions (BRE) and some versions, as GNU sed (using sed -E), supports Extended Regular Expressions (ERE), a superset of BRE. Both in BRE and ERE the quantifiers * and + are always greedy, they match as many characters as possible. In some contexts, however, we need to match as few characters as possible. With PCRE we can do the quantifiers * and + lazy or minimal or reluctant using *? and +? respectively.

In order to remove \r\n inside double quotes preserving any other line break, we use lazy matching: "(.*?)([[:space:]]+)(.*?)". We also use backreferences "\1 \3" omitting the second matched parentheses which contains any sequence of white spaces (characters \t,\r, \n, \v or \f), in particular sequences of \r\n.

This approach works in CSV files in which "lines have a \r\n embedded in a double quoted field" as is your case. If input.csv contains cells with three or more paragraphs separated by \r\n, we must modify the regular expression.

lezambranof
  • 141
  • 1
  • 5