So, you want to parse a CSV file with awk
and modify only a subset of columns?
First of all, parsing CSV fields is not as simple as splitting on a separator (,
, or in your case ;
), since you must avoid splitting when value is quoted. The awk
recipe for this is given in an excellent answer by @EdMorton, and if you use GNU awk
, the most elegant approach is with FPAT
:
awk -v FPAT='[^;]*|"[^"]+"' -v OFS=';' '...'
(For other awk
s and some special cases, see the cited answer.)
Now back to your program. The proper syntax of gsub
ERE argument is either /pattern/
or "pattern"
, but not both (e.g. "/pattern/"
).
That means you'll have to replace as follows:
gsub("/\&\;/","\&",$3) --> gsub(/&/, "\\&", $3)
gsub("/\·\;/", " ",$3) --> gsub(/·/, " ", $3)
gsub("/\â\;/", "a",$3) --> gsub(/â/, "a", $3)
gsub("/\é\;/", "e",$3) --> gsub(/é/, "e", $3)
Also note that in the ERE regexp part, &
and ;
don't have to be escaped, but in the replacement string &
does (with \
which also needs to be escaped).
Additionally, to modify only the column $3
, you don't need the for
loop. But, if you really want to modify a range of columns starting with $3
and ending with the last $NF
, you'll need to use $i
in each gsub
call, instead of $3
.
Fixed, your awk
program looks like:
awk -v FPAT='[^;]*|"[^"]+"' -v OFS=';' '{
for (i=3; i<=NF; i++) {
gsub(/&/, "\\&", $i)
gsub(/·/, " ", $i)
gsub(/â/, "a", $i)
gsub(/é/, "e", $i)
gsub(/#/, " ", $i)
}
print
}' file.csv
(The print
at the end ensures each line get printed.)
Applied to your example (and converted to a one-liner):
$ echo '32602;1;"Wet & Dry 5029";2663,2662' | awk -v FPAT='[^;]*|"[^"]+"' -v OFS=';' '{for (i=3;i<=NF;i++) {gsub(/&/,"\\&",$i); gsub(/·/," ",$i); gsub(/â/,"a",$i); gsub(/é/,"e",$i); gsub(/#/," ",$i)}; print}'
32602;1;"Wet & Dry 5029";2663,2662
After additional troubleshooting in comments, seems like the solution to your problem was not to replace those HTML entities in some specific column, but rather to replace them in the complete file, since your CSV file seems to be malformed, so that the subsequent processor fails to parse it (probably due to unquoted ;
s).
You can replace all HTML entities you specified with a simple sed
command like:
sed -e 's/&/\&/g' -e 's/·/ /g' -e 's/â/a/g' -e 's/é/e/g' -e 's/#/ /g' file