Awk Ignore the delimiter in double quotation marks

Question

how to tell awk ignore the delmiter in double quotation marks

eg line='test,t2,t3,"t5,"' $(echo $line | awk -F "," '{print $4}')

Expected value is "t5," but in fact is "t5"

how to get "t5,"?

I would suggest if possible, running sed to convert the data "t5," into something like "t5," before running the awk command and have awk convert the . back to , when it output the data. This of course depends on the format of the data and whether . exists anywhere else in the data though — Raman Sailopal, Aug 07 '17 at 10:51
@RamanSailopal that's terrible advice. You never need sed when you're using awk and that approach would be incredibly fragile. — Ed Morton, Aug 07 '17 at 12:27

score 4 · Accepted Answer · answered Aug 07 '17 at 12:01

4

With GNU awk for FPAT, all you need for your case is:

$ line='test,t2,t3,"t5,"'
$ echo "$line" | awk -v FPAT='([^,]*)|("[^"]*")' '{print $4}'
"t5,"

and if your awk can contain newlines and escaped quotes then see What's the most robust way to efficiently parse CSV using awk?.

answered Aug 07 '17 at 12:01

Ed Morton

188,023
17
78
185

Great solution. I would redirect the contents of line into the awk statement as opposed to using echo though. This would negate the need to pipe. – Raman Sailopal Aug 07 '17 at 13:09
Thanks. Just trying to help the OP with her specific problem and introducing more differences than necessary between the answer and the question would obfuscate the solution plus for all we know the `echo` in her question really represents a pipeline output from some other command. – Ed Morton Aug 07 '17 at 13:11
1

Great solution, It solved my problem perfectly. thank you very much – seanchann Aug 07 '17 at 23:59

John Goofy · Answer 2 · 2017-08-07T10:51:26.270

-1

Your arbitrary input could be checked or if you know where your input is not well formatted, use substr() starting from index 2 in column 4.

$ echo 'test,t2,t3,"t5,"' | awk -F, '{printf "%s,\n", substr($4,2) }'
t5,

edited Aug 07 '17 at 10:51

answered Aug 07 '17 at 10:31

John Goofy

1,330
1
10
20

I want get 't5,' not 't5'. The result should contain that comma – seanchann Aug 07 '17 at 10:39
@seanchannzhou Sorry, I thought you have had a typo. Then `printf` could help you. View my edit please. – John Goofy Aug 07 '17 at 10:43
1

thanks. It may contain other characters, here really need a parser instead of awk – seanchann Aug 07 '17 at 10:55

Claes Wikner · Answer 3 · 2017-08-07T12:39:01.783

-1

Perhaps this is better.

echo 'test,t2,t3,"t5,"' | awk -F, '{print $(NF-1),$NF}' OFS=,

"t5,"

edited Aug 07 '17 at 12:39

answered Aug 07 '17 at 10:44

Claes Wikner

1,457
1
9
8

Please tell me why! – Claes Wikner Aug 07 '17 at 12:11
1

I didn't downvote but surely you can see this is nothing like what the OP is trying to do. The question isn't "how to I print from characters 12 to 17 of a string" it's "how do I print a specific field of a CSV when the fields might contain commas". – Ed Morton Aug 07 '17 at 12:24
1

I just saw your edit - no, it's not better because you'd need to know the value of the fields (i.e. how many commas are present in each field and it'd need to be the same number for every row of input) before you write your script to print the field. Imagine your CSV is books with an ISBN and a title and the author's name with lines like: `13245,"The lion, the witch, and the wardrobe","Lewis, C.S., Mr."` and `51622,"The Bible","Multiple"` - how are you going to print the title of each row with your current approach? – Ed Morton Aug 07 '17 at 12:40
Thank you for your forbearance I don't manage but will keep it. – Claes Wikner Aug 07 '17 at 14:00

piojo · Answer 4 · 2017-08-09T05:41:41.803

-1

In the general case, you can't. You need a full parser to remember a tag, change state, then go back to the prior state when it encounters the matching tag. You can't do it with a regular expression unless you make a lot of assumptions about the shape of your data--and since I see you're parsing CSV, those assumptions will not hold true.

If you like awk, I suggest trying perl for this problem. You can either use somebody else's CSV parsing library (search here), or you can write your own. Of course, there's no reason you can't write a CSV parser in pure awk, so long as you understand that this is not what awk is good at. You need to parse character by character (don't separate records by newlines), remember the current state (is the line quoted?) and remember the previous character to see whether it was a backslash (for treating a quote as a literal quote or a comma as a literal comma). You need to remember the previous quote so you can parse "" as an escaped quote instead of a malformed field. It's kind of fun, and it's a bitch. Use somebody else's library if you like. I wouldn't choose awk to write any parser where the records don't have an obvious separator.

Edit: Ed Morton actually did write a full CSV parser for Gawk, which he linked to in his answer. I helped him break it, and he quickly fixed the problem case. His script will be useful, though it will be somewhat unwieldy to adapt to real-world uses.

edited Aug 09 '17 at 05:41

answered Aug 07 '17 at 10:44

piojo

6,351
1
26
36

@EdMorton I'm glad that feature helped with the OP's scenario. However, I stand by my answer. In one minute, I found the Gawk canonical example of CSV parsing[1] can't handle embedded newlines. Their code gets confused by properly escaped quotation marks (""). If you think these cases aren't important, you haven't seen a CSV generated from a google doc that was written by normal humans who use copy/paste on a variety of platforms. [1]: https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html – piojo Aug 08 '17 at 03:18
@EdMorton I updated my answer with a bit more of a reply, but suffice to say that the script you claimed can handle embedded newlines and `""` actually cannot, based on a a simple one-record three-field test containing a quoteless field, a quoted field with a comma and newline, and a quoted field containing a comma and an escaped quote. – piojo Aug 08 '17 at 03:40
@piojo I didn't say using FPAT alone would handle all CSV cases, just the case the OP has to deal with. See https://stackoverflow.com/q/45420535/1745001 as referenced in my answer for awk handling embedded newlines and escaped quotation marks just fine. Feel free to post an example that conforms to the CSV [RFC 4180](https://tools.ietf.org/html/rfc4180) or can be generated by Excel that the script I posted there cannot handle as if such a thing exists I'd want to make that script more robust to handle it. If you have something that some other common tool generates I'd take a look at that too. – Ed Morton Aug 08 '17 at 11:16
@EdMorton That's true. And I never said Awk couldn't do this--just that you need to write a parser. You can't simply suck in the fields and records and expect it to work. I'm looking at my test case now and double checking my claims... – piojo Aug 09 '17 at 03:28
1

Sounds good and since this would be a discussion about robustly parsing CSV please post the example at https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk so we can take a look at it in the right context there. – Ed Morton Aug 09 '17 at 03:33

Awk Ignore the delimiter in double quotation marks

4 Answers4