0

I know it may sounds that there are 2000 answer to this question online but I found none for this specific case (ex. -vFPAT of this and other answers) cause I need to be with split. I have to split a CSV file with awk in which there may be some values inside double quotes. I need to tell the split function to ignore , if inside "" in order to get an array of the elements.

Here what I tried based on other answers as example

cat try.txt

Hi,I,"am,your",father
maybe,you,knew,it
but,"I,wanted",to,"be,sure"


cat tst.awk

BEGIN {}
{
    n_a = split($0,a,/([^,]*)|("[^"]+")/);
    for (i=1; i<=n_a; i++) {
        collecter[NR][i]=a[i];
    }
}
END {
    for (i=1; i<=length(collecter); i++)
    {
        for (z=1; z<=length(collecter[i]);z++)
        {
            printf "%s\n", collecter[i][z];
        }
    }
}

but no luck:

awk -f tst.awk try.txt 

,
,
,


,
,
,


,
,
,

I tried other regex expression based on other similar answer but none works for this particular case.

Please note: double quoted fields mat and may not be present, may be more than one, and without fixed position/length!

Thanks in advance for any help!

cccnrc
  • 1,195
  • 11
  • 27
  • Is Python an alternative? – accdias Feb 07 '20 at 00:42
  • Does this answer your question? [What's the most robust way to efficiently parse CSV using awk?](https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk) – David C. Rankin Feb 07 '20 at 02:32

2 Answers2

2

gnu awk has a function called patsplit that lets you do a split using an FPAT pattern:

$ awk '{ print "RECORD " NR ":"; n=patsplit($0, a, "([^,]*)|(\"[^\"]+\")"); for (i=1;i<=n;++i) {print i, "|" a[i] "|"}}' file
RECORD 1:
1 |Hi|
2 |I|
3 |"am,your"|
4 |father|
RECORD 2:
1 |maybe|
2 |you|
3 |knew|
4 |it|
RECORD 3:
1 |but|
2 |"I,wanted"|
3 |to|
4 |"be,sure"|
jas
  • 10,715
  • 2
  • 30
  • 41
1

If Python is an alternative, here is a solution:

try.txt:

Hi,I,"am,your",father
maybe,you,knew,it
but,"I,wanted",to,"be,sure"

Python snippet:

import csv

with open('try.txt') as f:
    reader = csv.reader(f, quoting=csv.QUOTE_ALL)
    for row in reader:
        print(row)

The code snippet above will result in:

['Hi', 'I', 'am,your', 'father']
['maybe', 'you', 'knew', 'it']
['but', 'I,wanted', 'to', 'be,sure']
accdias
  • 5,160
  • 3
  • 19
  • 31