8

I have been using happily gawk with FPAT. Here's the script I use for my examples:

#!/usr/bin/gawk -f

BEGIN {
    FPAT="([^,]*)|(\"[^\"]+\")"
}

{
    for (i=1; i<=NF; i++) {
        printf "Record #%s, field #%s: %s\n", NR, i, $i
    }
}

Simple, no quotes

Works well.

$ echo 'a,b,c,d' | ./test.awk 
Record #1, field #1: a
Record #1, field #2: b
Record #1, field #3: c
Record #1, field #4: d

With quotes

Works well.

$ echo '"a","b",c,d' | ./test.awk 
Record #1, field #1: "a"
Record #1, field #2: "b"
Record #1, field #3: c
Record #1, field #4: d

With empty columns and quotes

Works well.

$ echo '"a","b",,d' | ./test.awk 
Record #1, field #1: "a"
Record #1, field #2: "b"
Record #1, field #3: 
Record #1, field #4: d

With escaped quotes, empty columns and quotes

Works well.

$ echo '"""a"": aaa","b",,d' | ./test.awk 
Record #1, field #1: """a"": aaa"
Record #1, field #2: "b"
Record #1, field #3: 
Record #1, field #4: d

With a column containing escaped quotes and ending with a comma

Fails.

$ echo '"""a"": aaa,","b",,d' | ./test.awk 
Record #1, field #1: """a"": aaa
Record #1, field #2: ","
Record #1, field #3: b"
Record #1, field #4: 
Record #1, field #5: d

Expected output:

$ echo '"""a"": aaa,","b",,d' | ./test_that_would_be_working.awk 
Record #1, field #1: """a"": aaa,"
Record #1, field #2: "b"
Record #1, field #4: 
Record #1, field #5: d

Is there a regex for FPAT that would make this work, or is this just not supported by awk?

The pattern would be " followed by anything but a single ". The regex class search works one character at a time so it can't not match a "".

I think there may be an option with lookaround, but I'm not good enough with it to make it work.

Marc Lambrichs
  • 2,864
  • 2
  • 13
  • 14
Benoit Duffez
  • 11,839
  • 12
  • 77
  • 125
  • 1
    @RomanPerekhrest, it's four fields. Using `|` as a field separator: `a||"b" b,|c` – jas Nov 02 '17 at 16:47
  • @BenoitDuffez, can you accept another working alternative solution? simple one-liner – RomanPerekhrest Nov 02 '17 at 16:53
  • @BenoitDuffez, unfortunately, the question is marked as duplicate and I can't add an answer. If you still want to get it - post the question with `python` tag without mentioning `awk`. And let me know – RomanPerekhrest Nov 02 '17 at 21:40
  • @BenoitDuffez, what input is prefered for you: a single string or a file? – RomanPerekhrest Nov 02 '17 at 23:00
  • please fix the second example, it's obviously wrong, and explain what "somewhat parses" means. Add the exact output you want for each case. Now it is fuzzy, i.e. second example says that field `"""b"" b"` needs further parsing (why?) and third one says it would be ok. – thanasisp Nov 02 '17 at 23:31
  • @thanasisp: sorry some of my edits/comments were made on mobile and I didn't thoroughly check what was written. I have updated the whole question with actual snippets and outputs taken from my machine running gawk 4.1.3. – Benoit Duffez Nov 03 '17 at 09:14
  • @RomanPerekhrest: sorry I didn't see that you would post something using python. I know that other tools/languages would be more suitable for the job, however I wanted to do it with awk and see if it was possible without too much overhead or wheel inventing. So to reply to your question, with awk, it doesn't matter if it's a file or stdin. – Benoit Duffez Nov 03 '17 at 09:16

1 Answers1

4

Because awk's FPAT doesn't know lookarounds, you need to be explicit in your patterns. This one will do:

FPAT="[^,\"]*|\"([^\"]|\"\")*\""

Explanation:

[^,\"]*             # match 0 or more times any character except , and "
|                   # OR
\"                  # match '"'
  ([^\"]            #   followed by 0 or more anything but '"'
   |                #   OR
   \"\"             #   '""'
  )*        
\"                  # ending with '"'

Now testing it:

$ cat tst.awk
BEGIN {
    FPAT="[^,\"]*|\"([^\"]|\"\")*\""
}
{ 
   for (i=1; i<=NF; i++){ printf "Record #%s, field #%s: %s\n", NR, i, $i }
}


$ echo '"""a"": aaa,","b",,d' | awk -f tst.awk
Record #1, field #1: """a"": aaa,"
Record #1, field #2: "b"
Record #1, field #3:
Record #1, field #4: d
Marc Lambrichs
  • 2,864
  • 2
  • 13
  • 14