Would you please try a perl solution:
perl -0777 -pe 's/\r?\n(?=,)//g; s/(?<=,)\r?\n//g; 's/\r//g; test.csv
Output:
234,aa,bb,cc,30,dd
22,cc,ff,dd,40,gg
pxy,aa,,cc,40,dd
- The
-0777
option tells perl to slurp all lines including line endings at once.
- The
-pe
option interprets the next argument as a perl script.
- The regex
\r?\n(?=,)
matches zero or one CR character followed by
a NL character, with a positive lookahead for a comma.
- Then the substitution
s/\r?\n(?=,)//g
removes the line endings which matches
the condition above. The following comma is not removed due to the nature
of lookaround assertions.
- The substitution
s/(?<=,)\r?\n//g
is the switched version, which removes
the line endings after the comma.
- The final
s/\r//g
removes still remaining CR characters.
[Edit]
As the perl
script above slurps all lines into the memory, it may be slow if the file is huge. Here is an alternative which processes the input line by line using a state machine.
awk -v ORS="" ' # empty the output record separator
/^\r?$/ {next} # skip blank lines
f && !/^,/ {print "\n"} # break the line if the flag is set and the line does not start with a comma
{
sub(/\r$/, "") # remove trailing CR character
print # print current line (w/o newline)
if ($0 ~ /,$/) f = 0 # if the line has a trailing comma, clear the flag
else f = 1 # if the line properly ends, set the flag
}
END {
print "\n" # append the newline to the last line
}
' test.csv
BTW if you want to put blank lines in between as the posted expected output
which looks like:
234,aa,bb,cc,30,dd
22,cc,ff,dd,40,gg
pxy,aa,,cc,40,dd
then append another \n
in the print
line as:
f && !/^,/ {print "\n\n"}