Replace column value in csv file with (g)awk with delimiter containing strings

Question

I'm using gawk 4.0.1 and I know how to replace a column value in a CSV file, for example:

> ROW='1,2,3,4,5,6'
> echo $ROW | gawk -F, -vOFS=, '$2="X"'
1,X,3,4,5,6

However, I am dealing with a file which has strings containing the delimiter. Reading a column goes just fine, but when replacing a value, an extra delimiter is inserted:

> ROW='1,"2,3",4,5,6'
> echo $ROW | gawk -vOFS=, -vFPAT='[^,]*|"[^"]*"' '{print $2}'
"2,3"
> echo $ROW | gawk -vOFS=, -vFPAT='[^,]*|"[^"]*"' '$2="X"'
1,X,,4,5,6

This is what I was expecting:

> echo $ROW | gawk -vOFS=, -vFPAT='[^,]*|"[^"]*"' '$2="X"'
1,X,4,5,6

The value '"2,3"' is replaced by 'X,'. How can I solve this?

EDIT: I did not include that I have empty fields as well. A better row example would therefore be:

ROW='1,,"2,3",4,5,6'

EDIT 2: From Dawg's answer I gather it's not possible in pure awk. Although I agree a solution with python is better, the only solution with awk is to include some pre- and post processing to deal with empty fields.

#/bin/bash
ROW='1,,"2,3",4,"",5'
for col in {1..6}; do 
    echo $ROW |\ 
        sed 's:,,:, ,:' |\ 
        gawk -v c=$col -v OFS=, -v FPAT='([^,]+)|("[^\"]*")' '$c="X"' |\
        sed 's:, ,:,,:g'
done

Output:

X,,"2,3",4,"",5
1,X,"2,3",4,"",5
1,,X,4,"",5
1,,"2,3",X,"",5
1,,"2,3",4,X,5
1,,"2,3",4,"",X

It's very weird, because `echo $ROW | gawk -vOFS=, -vFPAT='[^,]*|"[^"]*"' '{print $1,"X",$3}'` makes it. I also tried doing `$0=$0` to have the string reinterpreted, but no luck. — fedorqui, Sep 25 '14 at 09:50
@gospes why you rely on awk? Did you want answers from other tools? — Avinash Raj, Sep 25 '14 at 09:51
According to [the fine manual](https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html) you should use grouping on either side of the alternative operator `|`. That is (always from the previous source), `FPAT = "([^,]+)|(\"[^\"]+\")"` --- hth — gboffi, Sep 25 '14 at 10:27
@Jidder: that does not work either, or at least, if I have empty fields. — gospes, Sep 25 '14 at 16:05

score 2 · Answer 1 · edited Oct 07 '21 at 06:06

$ echo $ROW | awk -vOFS=, -vFPAT="([^,]+)|(\"[^\"]+\")" '$2="X"'
1,X,4,5,6

I used the pattern from GNU Awk Manual 4.7 Defining Fields By Content

Compare with the * in the same pattern:

$ echo $ROW | awk -vOFS=, -vFPAT="([^,]*)|(\"[^\"]*\")" '$2="X"'
1,X,,4,5,6

So answer is -- (to this limited example) -- use -vFPAT="([^,]+)|(\"[^\"]+\")", but then that does not work with empty fields like 1,"2,3",4,,"","should be 6th field"

Here is the result with both kinds of empty fields (,, and ""):

$ echo $ROW2 | awk -vOFS=, -vFPAT="([^,]+)|(\"[^\"]+\")" '$2="X"'
1,X,4,"","should be 6th field"
      ^^                    - missing the ',,' field
            ^^^             - now the 5th field  -- BUG!

By convention, ROW2 should be treated as having 6 fields with the blank fields ,, and "" each counting as 1 field. If you do not count blank fields as fields, you will loose count of which field is which after the blanks. Add to the list of complications of CSV with an awk regex.

Know that CSV is surprisingly complicated and to handle the many possibilities is not trivial with awk or a regex alone.

Another solution for CSV is to use Perl or Python with the more sophisticated and standardized CSV libraries they can use. In the case of Python, it is part of the standard distribution of Python.

Here is a Python solution that would be completely compatible with RFC 4180

$ echo $ROW | python -c '
> import csv, fileinput
> for line in csv.reader(fileinput.input()):
> print ",".join(e if i!=1 else "X" for i, e in enumerate(line))'
1,X,4,5,6

Which allows more complicated CSV to be handled easily.

Here is 4 record by 5 field CSV with CRLF's in the quoted fields, escaped quotes in the quoted fields, and both kinds of blank fields (,, and "").

1,"2,3",4,5,6
"11,12",13,14,15,16
21,"22,
23",24,25,"26
27"
31,,"33\"not 32\"","",35

With that same script (using repr to see the complete field values, but you would probably use str in normal circumstances) all these cases are handled correctly according to RFC 4180:

$ cat /tmp/3.csv | python -c '
import csv, fileinput
for line in csv.reader(fileinput.input()):
   print ",".join(repr(e) if i!=1 else "X" for i, e in enumerate(line))'
'1',X,'4','5','6'
'11,12',X,'14','15','16'
'21',X,'24','25','26\n27'
'31',X,'33\\not 32\\""','','35'

This is difficult with awk since \n defines each record, we do not handle empty fields correctly, and do not treat escaped quotes correctly:

$ cat /tmp/3.csv | awk -vOFS=, -vFPAT='[^,]+|"[^"]*"' '$2="X"'
1,X,4,5,6
"11,12",X,14,15,16
21,X
23",X,25,"26
27",X
31,X,"",35

Now you would need to redefine RS to a regex that finds quotes around CR and read multiples lines with awk... Add support for escaped quotes... Do a more sophisticated regex to split the fields... Complicated... Good luck!

score 0 · Answer 2 · edited Sep 25 '14 at 15:47

0

The output is for

$ ROW='1,"2,3",4,5,6' 
$ echo $ROW | gawk -vOFS=, -vFPAT='[^,]+|"[^"].*"' '$2="X"'
1,X,4,5,6

Both of these commands are working fine. In the second command, * was missed while pasting here.

perl:

$var='1,"2,3",4,5,6';
$var=~s/\".*\"/X/g;
print $var;

edited Sep 25 '14 at 15:47

fedorqui

275,237
103
548
598

answered Sep 25 '14 at 12:16

karthick Sundaram

33
6

1

Could you elaborate a bit more and provide an example, please? Are you saying that you added a comma as in "X,"? – kkuilla Sep 25 '14 at 12:23
Yes. Adding a comma with X while substituting as the whole double quote is going into $2. – karthick Sundaram Sep 25 '14 at 12:40
1

Could you update your answer and show that it actually works with the data the asker provides, please. Otherwise, it's comment, not real answer and will probably be deleted. – kkuilla Sep 25 '14 at 12:57
1

@karthickSundaram How could this get rid of the extraneous comma by adding one more ? – Tensibai Sep 25 '14 at 12:58
echo $ROW | gawk -vOFS=, -vFPAT='[^,]+|"[^"].*"' '$2="X"' * was missed while pasting here. – karthick Sundaram Sep 25 '14 at 15:28
$ echo $ROW | gawk -vOFS=, -vFPAT='[^,]+|"[^"]*.*"' '$2="X"' This *.* after [^"] is also working fine. – karthick Sundaram Sep 25 '14 at 15:47

score 0 · Answer 3 · answered Sep 25 '14 at 15:50

0

$ echo $ROW | gawk -vOFS=, -vFPAT='[^,]+|"[^"]."' '$2="X"'

. should be after [^"]

echo $ROW | gawk -vOFS=, -vFPAT='[^,]+|"[^"].*"' '$2="X"'

These 2 answers produce the output of 1,x,4,5,6 for ROW='1,"2,3",4,5,6'

answered Sep 25 '14 at 15:50

karthick Sundaram

33
6

Replace column value in csv file with (g)awk with delimiter containing strings

3 Answers3