14

How do you parse a CSV file using gawk? Simply setting FS="," is not enough, as a quoted field with a comma inside will be treated as multiple fields.

Example using FS="," which does not work:

file contents:

one,two,"three, four",five
"six, seven",eight,"nine"

gawk script:

BEGIN { FS="," }
{
  for (i=1; i<=NF; i++) printf "field #%d: %s\n", i, $(i)
  printf "---------------------------\n"
}

bad output:

field #1: one
field #2: two
field #3: "three
field #4:  four"
field #5: five
---------------------------
field #1: "six
field #2:  seven"
field #3: eight
field #4: "nine"
---------------------------

desired output:

field #1: one
field #2: two
field #3: "three, four"
field #4: five
---------------------------
field #1: "six, seven"
field #2: eight
field #3: "nine"
---------------------------
Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
MCS
  • 22,113
  • 20
  • 62
  • 76
  • 1
    see also: https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk – Sundeep Nov 28 '17 at 10:10
  • 4
    Possible duplicate of [What's the most robust way to efficiently parse CSV using awk?](https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk) – miken32 Dec 28 '18 at 00:47

9 Answers9

16

The gawk version 4 manual says to use FPAT = "([^,]*)|(\"[^\"]+\")"

When FPAT is defined, it disables FS and specifies fields by content instead of by separator.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
BCoates
  • 1,508
  • 10
  • 16
  • 1
    The FPAT concept is interesting. The regex quoted though does not allow for double quotes within the quoted string. That requires a more complex regex, such as: `FPAT="([^,]*)|(\"([^\"]|\"\")+\"[^,]*)"`. The final `[^,]*` allows for malformed fields which start with quotes, such as `"abc"def,`; it treats the `def` as part of the field. Within the double quotes, two consecutive double quotes are accepted. This stuff is _nasty_, which is why CSV-specific modules are usually the best way to deal with CSV data unless the CSV data is clean and simple. – Jonathan Leffler Jul 03 '12 at 22:45
  • 2
    FPAT requires gawk 4. Took me some time... ;) – Richard Kiefer Dec 14 '12 at 18:03
  • I've set up an alias with gawk to make it easy to run gawk on CSVS: `alias awkcsv="gawk -v FPAT='([^,]+)|(\"[^\"]+\")'"` – DeegC Jan 04 '16 at 18:12
14

The short answer is "I wouldn't use gawk to parse CSV if the CSV contains awkward data", where 'awkward' means things like commas in the CSV field data.

The next question is "What other processing are you going to be doing", since that will influence what alternatives you use.

I'd probably use Perl and the Text::CSV or Text::CSV_XS modules to read and process the data. Remember, Perl was originally written in part as an awk and sed killer - hence the a2p and s2p programs still distributed with Perl which convert awk and sed scripts (respectively) into Perl.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • 1
    See also the [`csvfix`](http://code.google.com/p/csvfix/) program. Of course, Python (and Ruby, and Tcl, and most other extensible scripting languages) can be used instead of Perl; that becomes a question of personal taste or company-mandated (Hobson's) choice. – Jonathan Leffler Jul 03 '12 at 21:54
  • I went down the `gawk` use `FPAT` path for the past hour. It can not cope with many real world situations like empty fields and double quotes within fields as pointed out. Time to switch to something more powerful! – pedz Mar 20 '22 at 17:50
4

You can use a simple wrapper function called csvquote to sanitize the input and restore it after awk is done processing it. Pipe your data through it at the start and end, and everything should work out ok:

before:

gawk -f mypgoram.awk input.csv

after:

csvquote input.csv | gawk -f mypgoram.awk | csvquote -u

See https://github.com/dbro/csvquote for code and documentation.

D Bro
  • 543
  • 6
  • 10
4

If permissible, I would use the Python csv module, paying special attention to the dialect used and formatting parameters required, to parse the CSV file you have.

torek
  • 448,244
  • 59
  • 642
  • 775
ayaz
  • 10,406
  • 6
  • 33
  • 48
2

csv2delim.awk

# csv2delim.awk converts comma delimited files with optional quotes to delim separated file
#     delim can be any character, defaults to tab
# assumes no repl characters in text, any delim in line converts to repl
#     repl can be any character, defaults to ~
# changes two consecutive quotes within quotes to '

# usage: gawk -f csv2delim.awk [-v delim=d] [-v repl=`"] input-file > output-file
#       -v delim    delimiter, defaults to tab
#       -v repl     replacement char, defaults to ~

# e.g. gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > test.txt

# abe 2-28-7
# abe 8-8-8 1.0 fixed empty fields, added replacement option
# abe 8-27-8 1.1 used split
# abe 8-27-8 1.2 inline rpl and "" = '
# abe 8-27-8 1.3 revert to 1.0 as it is much faster, split most of the time
# abe 8-29-8 1.4 better message if delim present

BEGIN {
    if (delim == "") delim = "\t"
    if (repl == "") repl = "~"
    print "csv2delim.awk v.m 1.4 run at " strftime() > "/dev/stderr" ###########################################
}

{
    #if ($0 ~ repl) {
    #   print "Replacement character " repl " is on line " FNR ":" lineIn ";" > "/dev/stderr"
    #}
    if ($0 ~ delim) {
        print "Temp delimiter character " delim " is on line " FNR ":" lineIn ";" > "/dev/stderr"
        print "    replaced by " repl > "/dev/stderr"
    }
    gsub(delim, repl)

    $0 = gensub(/([^,])\"\"/, "\\1'", "g")
#   $0 = gensub(/\"\"([^,])/, "'\\1", "g")  # not needed above covers all cases

    out = ""
    #for (i = 1;  i <= length($0);  i++)
    n = length($0)
    for (i = 1;  i <= n;  i++)
        if ((ch = substr($0, i, 1)) == "\"")
            inString = (inString) ? 0 : 1 # toggle inString
        else
            out = out ((ch == "," && ! inString) ? delim : ch)
    print out
}

END {
    print NR " records processed from " FILENAME " at " strftime() > "/dev/stderr"
}

test.csv

"first","second","third"
"fir,st","second","third"
"first","sec""ond","third"
" first ",sec   ond,"third"
"first" , "second","th  ird"
"first","sec;ond","third"
"first","second","th;ird"
1,2,3
,2,3
1,2,
,2,
1,,2
1,"2",3
"1",2,"3"
"1",,"3"
1,"",3
"","",""
"","""aiyn","oh"""
"""","""",""""
11,2~2,3

test.bat

rem test csv2delim
rem default is: -v delim={tab} -v repl=~
gawk                      -f csv2delim.awk test.csv > test.txt
gawk -v delim=;           -f csv2delim.awk test.csv > testd.txt
gawk -v delim=; -v repl=` -f csv2delim.awk test.csv > testdr.txt
gawk            -v repl=` -f csv2delim.awk test.csv > testr.txt
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
1

I am not exactly sure whether this is the right way to do things. I would rather work on a csv file in which either all values are to quoted or none. Btw, awk allows regexes to be Field Separators. Check if that is useful.

Vijay Dev
  • 26,966
  • 21
  • 76
  • 96
  • I would also go for the regexp approach and try to let it match something like this ^"|","|"$ (this is a quick shot, you have of course to escape the ", I want to keep it simple) – flolo Nov 24 '08 at 15:23
1
{
  ColumnCount = 0
  $0 = $0 ","                           # Assures all fields end with comma
  while($0)                             # Get fields by pattern, not by delimiter
  {
    match($0, / *"[^"]*" *,|[^,]*,/)    # Find a field with its delimiter suffix
    Field = substr($0, RSTART, RLENGTH) # Get the located field with its delimiter
    gsub(/^ *"?|"? *,$/, "", Field)     # Strip delimiter text: comma/space/quote
    Column[++ColumnCount] = Field       # Save field without delimiter in an array
    $0 = substr($0, RLENGTH + 1)        # Remove processed text from the raw data
  }
}

Patterns that follow this one can access the fields in Column[]. ColumnCount indicates the number of elements in Column[] that were found. If not all rows contain the same number of columns, Column[] contains extra data after Column[ColumnCount] when processing the shorter rows.

This implementation is slow, but it appears to emulate the FPAT/patsplit() feature found in gawk >= 4.0.0 mentioned in a previous answer.

Reference

kbulgrien
  • 4,384
  • 2
  • 26
  • 43
0

Here's what I came up with. Any comments and/or better solutions would be appreciated.

BEGIN { FS="," }
{
  for (i=1; i<=NF; i++) {
    f[++n] = $i
    if (substr(f[n],1,1)=="\"") {
      while (substr(f[n], length(f[n]))!="\"" || substr(f[n], length(f[n])-1, 1)=="\\") {
        f[n] = sprintf("%s,%s", f[n], $(++i))
      }
    }
  }
  for (i=1; i<=n; i++) printf "field #%d: %s\n", i, f[i]
  print "----------------------------------\n"
}

The basic idea is that I loop through the fields, and any field which starts with a quote but does not end with a quote gets the next field appended to it.

MCS
  • 22,113
  • 20
  • 62
  • 76
  • This looks more C-like.. Are we using the right tool for the right job? I am a novice in awk and can't think of any straight forward solutions though.. – Vijay Dev Nov 24 '08 at 15:17
  • @Vijay Dev, novice means beginner, not expert. – Robert Gamble Nov 24 '08 at 20:12
  • Ah My English !! I wanted to say - 'I am a novice and so I can't think of any straight forward solutions' – Vijay Dev Nov 25 '08 at 13:30
  • FYI, this works, but you need "n=0" as a final line for it to function on a multi-line file properly. – GoldenBoy Oct 28 '10 at 19:45
  • Note that a valid field could be: `"""Jump"", he said!"`. This will be split at the comma, but the character before is a double quote. The script splits at the comma, even though it shouldn't as the comma is embedded inside a quoted field. An odd number of double quotes preceding the comma indicates end-of-field; an even number does not. – Jonathan Leffler Sep 11 '15 at 01:27
0

Perl has the Text::CSV_XS module which is purpose-built to handle the quoted-comma weirdness.
Alternately try the Text::CSV module.

perl -MText::CSV_XS -ne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->fields();for $n (0..$#f) {print "field #$n: $f[$n]\n"};print "---\n"}' file.csv

Produces this output:

field #0: one
field #1: two
field #2: three, four
field #3: five
---
field #0: six, seven
field #1: eight
field #2: nine
---

Here's a human-readable version.
Save it as parsecsv, chmod +x, and run it as "parsecsv file.csv"

#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new();
open(my $data, '<', $ARGV[0]) or die "Could not open '$ARGV[0]' $!\n";
while (my $line = <$data>) {
    if ($csv->parse($line)) {
        my @f = $csv->fields();
        for my $n (0..$#f) {
            print "field #$n: $f[$n]\n";
        }
        print "---\n";
    }
}

You may need to point to a different version of perl on your machine, since the Text::CSV_XS module may not be installed on your default version of perl.

Can't locate Text/CSV_XS.pm in @INC (@INC contains: /home/gnu/lib/perl5/5.6.1/i686-linux /home/gnu/lib/perl5/5.6.1 /home/gnu/lib/perl5/site_perl/5.6.1/i686-linux /home/gnu/lib/perl5/site_perl/5.6.1 /home/gnu/lib/perl5/site_perl .).
BEGIN failed--compilation aborted.

If none of your versions of Perl have Text::CSV_XS installed, you'll need to:
sudo apt-get install cpanminus
sudo cpanm Text::CSV_XS

Chris Koknat
  • 3,305
  • 2
  • 29
  • 30