37

I am using awk to parse my data with "," as separator as the input is a csv file. However, there are "," within the data which is escaped by double quotes ("...").

Example

filed1,filed2,field3,"field4,FOO,BAR",field5

How can i ignore the comma "," within the the double quote so that I can parse the output correctly using awk? I know we can do this in excel, but how do we do it in awk?

smci
  • 32,567
  • 20
  • 113
  • 146
joomanji
  • 473
  • 1
  • 4
  • 6
  • For a very comprehensive description look here: >[http://backreference.org/2010/04/17/csv-parsing-with-awk/](http://web.archive.org/web/20120531065332/http://backreference.org/2010/04/17/csv-parsing-with-awk/) (archive.org version) But this is not parsing, but cheating with regexp. For anything more complicated than the examples on the page you should use Perl/Python with a parser library like "csv" for Python. – Chris Oct 18 '11 at 09:07
  • Great articles on Text:CSV parsing using perl: http://perlmeme.org/tutorials/parsing_csv.html – joomanji Oct 18 '11 at 11:56
  • See https://stackoverflow.com/q/45420535/1745001 for parsing CSV with awk. – Ed Morton Sep 26 '17 at 22:07

3 Answers3

29

It's easy, with GNU awk 4:

zsh-4.3.12[t]% awk '{ 
 for (i = 0; ++i <= NF;)
   printf "field %d => %s\n", i, $i
 }' FPAT='([^,]+)|("[^"]+")' infile
field 1 => filed1
field 2 => filed2
field 3 => field3
field 4 => "field4,FOO,BAR"
field 5 => field5

Adding some comments as per OP requirement.

From the GNU awk manual on "Defining fields by content:

The value of FPAT should be a string that provides a regular expression. This regular expression describes the contents of each field. In the case of CSV data as presented above, each field is either “anything that is not a comma,” or “a double quote, anything that is not a double quote, and a closing double quote.” If written as a regular expression constant, we would have /([^,]+)|("[^"]+")/. Writing this as a string requires us to escape the double quotes, leading to:

FPAT = "([^,]+)|(\"[^\"]+\")"

Using + twice, this does not work properly for empty fields, but it can be fixed as well:

As written, the regexp used for FPAT requires that each field contain at least one character. A straightforward modification (changing the first ‘+’ to ‘*’) allows fields to be empty:

FPAT = "([^,]*)|(\"[^\"]+\")"

Jasper
  • 3,939
  • 1
  • 18
  • 35
Dimitre Radoulov
  • 27,252
  • 4
  • 40
  • 48
  • 1
    This is brilliant @DimitreRadoulov. You have a very thorough knowledge of `gawk` :-). I have used your suggestion over [here](http://stackoverflow.com/a/8949920/970195) and have made a reference to this answer. Hope thats ok. +1 – jaypal singh Jan 21 '12 at 01:26
  • 1
    Jumping across from [here](http://stackoverflow.com/a/8949920/970195), this solution doesn't appear to work for my data. In fact it splits the fields based on the spaces (default value of FS is a space), rather than the regex specified by FPAT. Adding `FS=","` seems to make awk ignore `FPAT` altogether, as it doesn't escape the quoted field with embedded comma – chrisbunney Jan 23 '12 at 10:37
  • Hi @chrisbunney, could you please confirm that you're using *GNU awk 4*? Why are you setting FS? – Dimitre Radoulov Jan 23 '12 at 12:16
  • Apparently I'm not using GNU awk 4: `gawk -W version: GNU Awk 3.1.7`. Although you specified GNU awk 4, I didn't realise this functionality might not be present in earlier versions. I was fiddling around, which is why I set FS, to see what happened – chrisbunney Jan 23 '12 at 12:31
  • 2
    The builtin variable *FPAT* was introduced in *GNU awk 4*. I've just posted a *Perl* solution in your original thread. – Dimitre Radoulov Jan 23 '12 at 12:37
  • Mac users can do `brew install gawk` and use `gawk` in place of `awk`. – scottgwald Mar 04 '16 at 20:04
  • Doe this solution work for mawk ? I tried in gawk and it's working but not working in mawk. – Data Origin Aug 13 '19 at 10:46
12

FPAT works when there are newlines and commas inside the quoted fields, but not when there are double quotes, like this:

field1,"field,2","but this field has ""escaped"" quotes"

You can use a simple wrapper program I wrote called csvquote to make data easy for awk to interpret, and then restore the problematic special characters, like this:

csvquote inputfile.csv | awk -F, '{print $4}' | csvquote -u

See https://github.com/dbro/csvquote for code and docs

D Bro
  • 543
  • 6
  • 10
  • 3
    Hi @DBro Double-doublequotes can be accounted for expanding the FPAT regexp a little bit:`BEGIN { FPAT = "(\"([^\"]|\"\")*\")|([^,\"]*)" }` – colemar Mar 11 '16 at 16:50
0

Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.

Suppose you only want to print the 4th field:

perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){ @f=$csv->fields(); print "\"$f[3]\"" }' file

The input line is split into array @f
Field 4 is $f[3] since Perl starts indexing at 0

I provided more explanation of Text::CSV_XS within my answer here: parse csv file using gawk

Community
  • 1
  • 1
Chris Koknat
  • 3,305
  • 2
  • 29
  • 30