0

My CSV file format. Notice that the 6th row has multiple comma separated values for the same column.

a,b,c,d,e,f
g,h,i,j,k,l
m,n,o,p,q,r
s,t,u,v,w,x
y,z,ab,bc,cd,de
"a,b,c",d,e,f,g,"h,i,j"

Now, If I fire awk -F, '{print $1}', then for the 6th row, I get incorrect results.

$ awk -F, '{print $1}' test.csv
a
g
m
s
y
"a

How can I handle these multiple comma separated values for the same column with awk or with any other utility (sed, cut etc)?

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Praveen B K
  • 21
  • 1
  • 3

1 Answers1

0

You can use GNU- with a regex that captures between ""'s:

➜  /tmp cat test.csv
a,b,c,d,e,f
g,h,i,j,k,l
m,n,o,p,q,r
s,t,u,v,w,x
y,z,ab,bc,cd,de
"a,b,c",d,e,f,g,"h,i,j"
➜  /tmp
➜  /tmp
➜  /tmp awk 'BEGIN { FPAT = "([^,]*)|(\"[^\"]+\")"; } { print $1 }' test.csv
a
g
m
s
y
"a,b,c"
➜  /tmp

The FPAT value is a regex that describes the content of each field (col).

So in your CSV example, the 'content of each field' is any value that is (optionally) wrapped in quotes ("").


For more information, refer to the gawk manual:

0stone0
  • 34,288
  • 4
  • 39
  • 64
  • I've added some explanation, for more info, please check the official manual with the provided link. – 0stone0 May 02 '23 at 15:08