How to replace quotes inside a quoted field of a non-standard CSV file using a one-liner bash command?

Question

I have a file like this:

col1×col2×col3
12×"Some field with "quotes" inside it"×"Some field without quotes inside but with new lines \n"

And I would like to replace the interior double quotes with single quotes so the result will look like this:

col1×col2×col3
12×"Some field with 'quotes' inside it"×"Some field without quotes inside but with new lines \n"

I guess this can be done with sed, awk or ex but I haven't been able to figure out a clean and quick way of doing it. Real CSV files are of the order of millions of lines.

The preferred solution would be a one-liner using the aforementioned programs.

Real CSV files have some edge cases with respect to parsing. For example, when the field separator appears inside a quoted string. You need to use a proper CSV parser. General purpose scripting languages (perl, python, ruby) will come with CSV libraries — glenn jackman, Jun 04 '18 at 10:34
If a field is surrounded by quotes, any quotes inside it should be doubled. [That's how the specs say it](https://tools.ietf.org/html/rfc4180#section-2). — Nyerguds, Jun 04 '18 at 10:45
@Nyerguds remark, this is only for double quotes and not single quotes! — kvantour, Jun 04 '18 at 12:19
@glennjackman I need this as a pre-process step. The files are then read with pandas.read_csv function. I am able to read the file without errors just removing all quotes or selecting another quoting character. However, I would like to keep the quoting character as (") since some fields also contain new line characters (\n) which cause problems when reading the file. — Rafael Perez-Torro, Jun 04 '18 at 13:05
It is possible for you to fix the process that generates this file, so that it already contains valid CSV? — glenn jackman, Jun 04 '18 at 13:16
@glennjackman No. The files are given to my in that format by an external company. — Rafael Perez-Torro, Jun 04 '18 at 14:08

Hazzard17 · Accepted Answer · 2018-06-04T14:02:12.027

0

A simple workaround using sed, based on your fields separator ×, could be:

 sed -E "s/([^×])\"([^×])/\1'\2/g" file

This replace each " which is preceded and followed by any characters other that ×, with '.

Note that sed not support positive lookahead, so we have to group and reinsert the patterns.

edited Jun 04 '18 at 14:02

answered Jun 04 '18 at 13:14

Hazzard17

633
7
14

1

You are right, thanks! Also is useless to verify `not at start nor end of line` with special character `^` and `$`, it is enough to have one character before and after `"` to do so. I edited the answer. – Hazzard17 Jun 04 '18 at 14:02
You're welcome. The OP dropped some extremely imporant information in a comment though (`some fields also contain new line characters`) so this is now just an academic discussion :-). – Ed Morton Jun 04 '18 at 14:05
That was exactly the kind of solution I was looking for. Many thanks! – Rafael Perez-Torro Jun 04 '18 at 14:28
@RafaelPèreziTorró you said your input contains newlines in some fields. This sed script won't robustly handle newlines within fields - try replacing the blank char between `"quotes"` and `inside` with a newline. – Ed Morton Jun 04 '18 at 21:48

How to replace quotes inside a quoted field of a non-standard CSV file using a one-liner bash command?

1 Answers1