0

I am working with a large number of CSV files, and in one of the columns, the field itself contains commas. Unfortunately, this column hasn't been enclosed in quotes, so it's causing an issue with loading the CSV files into external applications.

My CSV files look like this:

col1, col2, col3, co,,,l4, col5, col6
col1, col2, col3, co,,,,,l4, col5, col6
col1, col2, col3, co,,l4, col5, col6

I need to remove all the commas in this particular column, but I'm unsure of how to go about doing it. Unfortunately, rewriting the files with the problematic column properly enclosed in quotes isn't an option.

These problematic commas always occur between the third and second-last commas, but I don't have enough bash know-how to write a script that removes them.

Input file:

col1, col2, col3, co,,,l4, col5, col6
col1, col2, col3, co,,,,,l4, col5, col6
col1, col2, col3, co,,l4, col5, col6

Expected output:

col1, col2, col3, col4, col5, col6
col1, col2, col3, col4, col5, col6
col1, col2, col3, col4, col5, col6
kvantour
  • 25,269
  • 4
  • 47
  • 72

2 Answers2

0

I would make the following suggestion:

awk '{ match($0,/^[^,]*,[^,]*,[^,],/); p1=RLENGTH+1
       match($0,/,[^,]*,[^,]*$/);    ; p2=RSTART
       s=substr($0,p1,p2-p1); gsub(/,/,"",s)
       print substr($0,1,p1-1) s substr($0,p2)
     }' file.csv

or

awk 'BEGIN{FS=OFS=","}
     { s=""; for(i=4;i<NF-1;++i) s=s $i }
     { print $1,$2,$3,s,$(NF-1),$NF }' file.csv

These solutions assume that no , appear in col1,col2,col3,col5 and col6.

If you have a comma in the other columns, but those columns are properly quoted according to the CSV-standard, then you can use a similar method based on What's the most robust way to efficiently parse CSV using awk?

awk -v FPAT='[^,]*|"[^"]+"' 'BEGIN{OFS=","}
     { s=""; for(i=4;i<NF-1;++i) s=s $i }
     { print $1,$2,$3,s,$(NF-1),$NF }' file.csv

More generically, to answer the title question:

How can I replace all instances of a character between the Nth and Kth last instance of that character?

Assume c is the character:

awk 'BEGIN{FS=OFS="c"; n=3; k=}
     { s=""; for(i=1; i <= n   ;++i) s = $i OFS 
             for(   ; i <= NF-k;++i) s=s $i 
             for(   ; i <= NF  ;++i) s = OFS $i }
     { print s }' file.csv
kvantour
  • 25,269
  • 4
  • 47
  • 72
  • Ah, so it turns out that some `,` do actually appear in the other columns, and they need to remain. Can these solutions be easily adapted so that only column 4 is affected? thanks – Guy Incognito Jun 13 '19 at 14:26
  • @isaaclangley if there are in the other columns, then the answer is no, unless that amount of commas is always the same in every line. – kvantour Jun 13 '19 at 14:28
  • @isaaclangley are the other commas between double-quotes? – kvantour Jun 13 '19 at 14:29
  • @isaaclangley I have added a solution that might solve it assuming that the other commas are quoted. – kvantour Jun 13 '19 at 14:34
  • Alas no, the commas in the other columns are not quoted, and there are variable numbers of them per line. – Guy Incognito Jun 13 '19 at 14:40
  • So if the commas are not quoted, how can you identify column 4 from column 5? – kvantour Jun 13 '19 at 14:44
0

If you REALLY just want to remove the commas in that field then with GNU awk for the 3rd arg to match():

awk 'match($0,/(([^,]*,){3})(.*)((,[^,]*){2})/,a){gsub(/,/,"",a[3]); $0=a[1] a[3] a[4]} 1' file
col1, col2, col3, col4, col5, col6
col1, col2, col3, col4, col5, col6
col1, col2, col3, col4, col5, col6

but otherwise I'd just wrap the troublesome field in double quotes and then treat it like any other CSV (e.g. see What's the most robust way to efficiently parse CSV using awk?):

$ awk 'match($0,/(([^,]*,){3})(.*)((,[^,]*){2})/,a){$0=a[1] "\"" a[3] "\"" a[4]} 1' file
col1, col2, col3," co,,,l4", col5, col6
col1, col2, col3," co,,,,,l4", col5, col6
col1, col2, col3," co,,l4", col5, col6

$ awk '
    BEGIN { FPAT="[^,]*|\"[^\"]+\"" }
    match($0,/(([^,]*,){3})(.*)((,[^,]*){2})/,a) { $0=a[1] "\"" a[3] "\"" a[4] }
    { for (i=1; i<=NF; i++) print NR, NF, i, $i }
' file
1 6 1 col1
1 6 2  col2
1 6 3  col3
1 6 4 " co,,,l4"
1 6 5  col5
1 6 6  col6
2 6 1 col1
2 6 2  col2
2 6 3  col3
2 6 4 " co,,,,,l4"
2 6 5  col5
2 6 6  col6
3 6 1 col1
3 6 2  col2
3 6 3  col3
3 6 4 " co,,l4"
3 6 5  col5
3 6 6  col6

or just to do the quoting part with sed:

$ sed -E 's/(([^,]*,){3})(.*)((,[^,]*){2})/\1"\3"\4/' file
col1, col2, col3," co,,,l4", col5, col6
col1, col2, col3," co,,,,,l4", col5, col6
col1, col2, col3," co,,l4", col5, col6

The above requires GNU or BSD/OSX sed for -E. With any POSIX sed it'd be:

$ sed 's/\(\([^,]*,\)\{3\}\)\(.*\)\(\(,[^,]*\)\{2\}\)/\1"\3"\4/' file
col1, col2, col3," co,,,l4", col5, col6
col1, col2, col3," co,,,,,l4", col5, col6
col1, col2, col3," co,,l4", col5, col6
Ed Morton
  • 188,023
  • 17
  • 78
  • 185