$ echo $ROW | awk -vOFS=, -vFPAT="([^,]+)|(\"[^\"]+\")" '$2="X"'
1,X,4,5,6
I used the pattern from GNU Awk Manual 4.7 Defining Fields By Content
Compare with the *
in the same pattern:
$ echo $ROW | awk -vOFS=, -vFPAT="([^,]*)|(\"[^\"]*\")" '$2="X"'
1,X,,4,5,6
So answer is -- (to this limited example) -- use -vFPAT="([^,]+)|(\"[^\"]+\")"
, but then that does not work with empty fields like 1,"2,3",4,,"","should be 6th field"
Here is the result with both kinds of empty fields (,,
and ""
):
$ echo $ROW2 | awk -vOFS=, -vFPAT="([^,]+)|(\"[^\"]+\")" '$2="X"'
1,X,4,"","should be 6th field"
^^ - missing the ',,' field
^^^ - now the 5th field -- BUG!
By convention, ROW2
should be treated as having 6 fields with the blank fields ,,
and ""
each counting as 1 field. If you do not count blank fields as fields, you will loose count of which field is which after the blanks. Add to the list of complications of CSV with an awk regex.
Know that CSV is surprisingly complicated and to handle the many possibilities is not trivial with awk or a regex alone.
Another solution for CSV is to use Perl or Python with the more sophisticated and standardized CSV libraries they can use. In the case of Python, it is part of the standard distribution of Python.
Here is a Python solution that would be completely compatible with RFC 4180
$ echo $ROW | python -c '
> import csv, fileinput
> for line in csv.reader(fileinput.input()):
> print ",".join(e if i!=1 else "X" for i, e in enumerate(line))'
1,X,4,5,6
Which allows more complicated CSV to be handled easily.
Here is 4 record by 5 field CSV with CRLF
's in the quoted fields, escaped quotes in the quoted fields, and both kinds of blank fields (,,
and ""
).
1,"2,3",4,5,6
"11,12",13,14,15,16
21,"22,
23",24,25,"26
27"
31,,"33\"not 32\"","",35
With that same script (using repr
to see the complete field values, but you would probably use str
in normal circumstances) all these cases are handled correctly according to RFC 4180:
$ cat /tmp/3.csv | python -c '
import csv, fileinput
for line in csv.reader(fileinput.input()):
print ",".join(repr(e) if i!=1 else "X" for i, e in enumerate(line))'
'1',X,'4','5','6'
'11,12',X,'14','15','16'
'21',X,'24','25','26\n27'
'31',X,'33\\not 32\\""','','35'
This is difficult with awk since \n
defines each record, we do not handle empty fields correctly, and do not treat escaped quotes correctly:
$ cat /tmp/3.csv | awk -vOFS=, -vFPAT='[^,]+|"[^"]*"' '$2="X"'
1,X,4,5,6
"11,12",X,14,15,16
21,X
23",X,25,"26
27",X
31,X,"",35
Now you would need to redefine RS to a regex that finds quotes around CR and read multiples lines with awk... Add support for escaped quotes... Do a more sophisticated regex to split the fields... Complicated... Good luck!