-1

I have a file like this:

col1×col2×col3
12×"Some field with "quotes" inside it"×"Some field without quotes inside but with new lines \n"

And I would like to replace the interior double quotes with single quotes so the result will look like this:

col1×col2×col3
12×"Some field with 'quotes' inside it"×"Some field without quotes inside but with new lines \n"

I guess this can be done with sed, awk or ex but I haven't been able to figure out a clean and quick way of doing it. Real CSV files are of the order of millions of lines.

The preferred solution would be a one-liner using the aforementioned programs.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
  • 1
    Real CSV files have some edge cases with respect to parsing. For example, when the field separator appears inside a quoted string. You need to use a proper CSV parser. General purpose scripting languages (perl, python, ruby) will come with CSV libraries – glenn jackman Jun 04 '18 at 10:34
  • 1
    If a field is surrounded by quotes, any quotes inside it should be doubled. [That's how the specs say it](https://tools.ietf.org/html/rfc4180#section-2). – Nyerguds Jun 04 '18 at 10:45
  • @Nyerguds remark, this is only for double quotes and not single quotes! – kvantour Jun 04 '18 at 12:19
  • @glennjackman I need this as a pre-process step. The files are then read with pandas.read_csv function. I am able to read the file without errors just removing all quotes or selecting another quoting character. However, I would like to keep the quoting character as (") since some fields also contain new line characters (\n) which cause problems when reading the file. – Rafael Perez-Torro Jun 04 '18 at 13:05
  • It is possible for you to fix the process that generates this file, so that it already contains valid CSV? – glenn jackman Jun 04 '18 at 13:16
  • @glennjackman No. The files are given to my in that format by an external company. – Rafael Perez-Torro Jun 04 '18 at 14:08

1 Answers1

0

A simple workaround using sed, based on your fields separator ×, could be:

 sed -E "s/([^×])\"([^×])/\1'\2/g" file

This replace each " which is preceded and followed by any characters other that ×, with '.

Note that sed not support positive lookahead, so we have to group and reinsert the patterns.

Hazzard17
  • 633
  • 7
  • 14
  • 1
    You are right, thanks! Also is useless to verify `not at start nor end of line` with special character `^` and `$`, it is enough to have one character before and after `"` to do so. I edited the answer. – Hazzard17 Jun 04 '18 at 14:02
  • You're welcome. The OP dropped some extremely imporant information in a comment though (`some fields also contain new line characters`) so this is now just an academic discussion :-). – Ed Morton Jun 04 '18 at 14:05
  • That was exactly the kind of solution I was looking for. Many thanks! – Rafael Perez-Torro Jun 04 '18 at 14:28
  • @RafaelPèreziTorró you said your input contains newlines in some fields. This sed script won't robustly handle newlines within fields - try replacing the blank char between `"quotes"` and `inside` with a newline. – Ed Morton Jun 04 '18 at 21:48