For this purpose you can use perl
.
perl -0pe 's/"(.*?)([[:space:]]+)(.*?)"/"\1 \3"/g' input.csv > output.csv
perl -pe 's/../../g' file
is analogous to sed -e 's/../../g' file
but using Perl compatible regular expressions (PCRE). See this or this for more details.
With the option zero perl -0
we are considering the null character as input record separator (see xargs -0 ...
or find . -print0 ...
).
All versions of sed
supports Basic Regular Expressions (BRE) and some versions, as GNU sed (using sed -E
), supports Extended Regular Expressions (ERE), a superset of BRE. Both in BRE and ERE the quantifiers *
and +
are always greedy, they match as many characters as possible. In some contexts, however, we need to match as few characters as possible. With PCRE we can do the quantifiers *
and +
lazy or minimal or reluctant using *?
and +?
respectively.
In order to remove \r\n
inside double quotes preserving any other line break, we use lazy matching: "(.*?)([[:space:]]+)(.*?)"
. We also use backreferences "\1 \3"
omitting the second matched parentheses which contains any sequence of white spaces (characters \t
,\r
, \n
, \v
or \f
), in particular sequences of \r\n
.
This approach works in CSV files in which "lines have a \r\n
embedded in a double quoted field" as is your case. If input.csv
contains cells with three or more paragraphs separated by \r\n
, we must modify the regular expression.