1

I'm trying to use sed to find and replace globally in a csv file where every field is delimited by " and separated by , but where some content of a field can also contain ". I am trying to find the occurrences where the last character in a field is " and insert a space after that so that the field ends with a space instead of a ".
Note that there could be multiple fields in a row where the last character of a field could be ".

As an example, here is some file content (4 rows)...

"123","def","","",""
"456","seven eight "nine" ten","","",""
"789"."twenty thirty sixty "seven"","","",""
"303030","one two "three" "four"","five "six"","",""

and it should become...

"123","def","","",""
"456","seven eight "nine" ten","","",""
"789"."twenty thirty sixty "seven" ","","",""
"303030","one two "three" "four" ","five "six" ","",""

i.e. 3 places where the space was inserted: once in the 3rd row and twice in the fourth row.

Currently I got as far as:

1,$ s/[^,]"",/" ",/g

so it finds all the occurrences but does not preserve the character before the match, so I get the result...

"123","def","","",""
"456","seven eight "nine" ten","","",""
"789"."twenty thirty sixty "seve" ","","",""
"303030","one two "three" "fou" ","five "si" ","",""

How to get the desired output with sed? Or maybe with awk?

Thanks.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 2
    Having an unescaped double quote within a double quote delimited field is invalid per the RFC for CSVs and per Excel (defacto CSV standard). Within double quotes a double quote should be escaped by doubling it `"this is ""one"" way"` or far less commonly preceding with a backslash `"This is \"the other\" way"`. Fix whatever tool is generating your invalid CSV or at least fix it in the output of the new tool you're asking for help to write rather than introducing yet another non-standard oddity to your CSV and then see https://stackoverflow.com/q/45420535/1745001 for how to parse it with awk. – Ed Morton Apr 11 '18 at 14:11
  • Unfortunately I am receiving the files from a customer who is not willing to adjust the format for us the supplier, so I have to work around the issue. – user5568662 Apr 11 '18 at 14:25
  • OK then you should at least fix it in the output of the tool you're currently writing so other tools can work with it. – Ed Morton Apr 11 '18 at 15:44

1 Answers1

1

You need to make a capture group and use back-reference in replacement:

sed -E 's/([^,"])""/\1" "/g' file

"123","def","","",""
"456","seven eight "nine" ten","","",""
"789"."twenty thirty sixty "seven" ","","",""
"303030","one two "three" "four" ","five "six" ","",""

To save changes inline use:

sed -i.bak -E 's/([^,"])""/\1" "/g' file
anubhava
  • 761,203
  • 64
  • 569
  • 643