1

I have a csv file which contains thousand of lines.

I need to replace some characters in a specific column

&acirc ; ---> a
&amp ; ---> &
&eacute ; ---> é

I tried with this command but it doesn't work

awk 'BEGIN{FS=OFS=";"} {for (i=3;i<=NF;i++) gsub("/\&amp\;/","\&",$3); gsub("/\&middot\;/", " ",$3); gsub("/\&acirc\;/", "a",$3); gsub("/\&eacute\;/", "e",$3); gsub(/\#/, " ",$3)}' file.csv

Sample input:

32602;1;"Wet &amp; Dry 5029";2663,2662

Expected output:

32602;1;"Wet & Dry 5029";2663,2662
randomir
  • 17,989
  • 1
  • 40
  • 55
Jack Dreamond
  • 13
  • 1
  • 3

1 Answers1

1

So, you want to parse a CSV file with awk and modify only a subset of columns?

First of all, parsing CSV fields is not as simple as splitting on a separator (,, or in your case ;), since you must avoid splitting when value is quoted. The awk recipe for this is given in an excellent answer by @EdMorton, and if you use GNU awk, the most elegant approach is with FPAT:

awk -v FPAT='[^;]*|"[^"]+"' -v OFS=';' '...'

(For other awks and some special cases, see the cited answer.)

Now back to your program. The proper syntax of gsub ERE argument is either /pattern/ or "pattern", but not both (e.g. "/pattern/").

That means you'll have to replace as follows:

gsub("/\&amp\;/","\&",$3)     -->  gsub(/&amp;/, "\\&", $3)
gsub("/\&middot\;/", " ",$3)  -->  gsub(/&middot;/, " ", $3)
gsub("/\&acirc\;/", "a",$3)   -->  gsub(/&acirc;/, "a", $3)
gsub("/\&eacute\;/", "e",$3)  -->  gsub(/&eacute;/, "e", $3)

Also note that in the ERE regexp part, & and ; don't have to be escaped, but in the replacement string & does (with \ which also needs to be escaped).

Additionally, to modify only the column $3, you don't need the for loop. But, if you really want to modify a range of columns starting with $3 and ending with the last $NF, you'll need to use $i in each gsub call, instead of $3.

Fixed, your awk program looks like:

awk -v FPAT='[^;]*|"[^"]+"' -v OFS=';' '{
    for (i=3; i<=NF; i++) {
        gsub(/&amp;/, "\\&", $i)
        gsub(/&middot;/, " ", $i)
        gsub(/&acirc;/, "a", $i)
        gsub(/&eacute;/, "e", $i)
        gsub(/#/, " ", $i)
    }
    print
 }' file.csv

(The print at the end ensures each line get printed.)

Applied to your example (and converted to a one-liner):

$ echo '32602;1;"Wet &amp; Dry 5029";2663,2662' | awk -v FPAT='[^;]*|"[^"]+"' -v OFS=';' '{for (i=3;i<=NF;i++) {gsub(/&amp;/,"\\&",$i); gsub(/&middot;/," ",$i); gsub(/&acirc;/,"a",$i); gsub(/&eacute;/,"e",$i); gsub(/#/," ",$i)}; print}'
32602;1;"Wet & Dry 5029";2663,2662

After additional troubleshooting in comments, seems like the solution to your problem was not to replace those HTML entities in some specific column, but rather to replace them in the complete file, since your CSV file seems to be malformed, so that the subsequent processor fails to parse it (probably due to unquoted ;s).

You can replace all HTML entities you specified with a simple sed command like:

sed -e 's/&amp;/\&/g' -e 's/&middot;/ /g' -e 's/&acirc;/a/g' -e 's/&eacute;/e/g' -e 's/#/ /g' file
randomir
  • 17,989
  • 1
  • 40
  • 55
  • Thank you for the answer but doesn't work yet. Here is a sample line of my csv file --- > 32602;1;"Wet & Dry 5029";2663,2662 When I applied to the csv file I still have & it's not replaced in &. Any idea why? – Jack Dreamond Oct 17 '17 at 03:29
  • Have a look now, the problem was with splitting on `;`, so you really need a proper CSV parsing. – randomir Oct 17 '17 at 11:46
  • Thank you for really clear information. But actually, still have a problem. The csv is not well encoded that's why I would like to replace these characters which contain special characters ";" and "#" with awk as it provokes problem when I import the csv to the database. When I apply the new code, I now get in many lines of the column 3 a ";" instead of space. – Jack Dreamond Oct 17 '17 at 16:44
  • In that case, may I suggest you try with a simple `sed -e 's/&/\&/g' -e 's/·/ /g' -e 's/â/a/g' -e 's/é/e/g' -e 's/#/ /g' file > newfile` instead of with `awk`? It will replace those sequences no matter where they are in the file, but it seems that'll do for you. – randomir Oct 18 '17 at 10:29
  • Added the final solution to my answer. – randomir Oct 18 '17 at 15:21
  • One thing that is interfering with applying this solution to my data is that some of my fields contains newlines. – avatarofhope2 Jun 05 '20 at 18:20