What's the most robust way to efficiently parse CSV using awk?

Question

The intent of this question is to provide a canonical answer.

Given a CSV as might be generated by Excel or other tools with embedded newlines and/or double quotes and/or commas in fields, and empty fields like:

$ cat file.csv
"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1

fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3,fld2""",

What's the most robust way efficiently using awk to identify the separate records and fields:

Record 1:
    $1=<rec1, fld1>
    $2=<>
    $3=<rec1","fld3.1
",
fld3.2>
    $4=<rec1
fld4>
----
Record 2:
    $1=<rec2, fld1.1

fld1.2>
    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $3=<>
    $4=<rec2 fld4>
----
Record 3:
    $1=<"">
    $2=<"rec3,fld2">
    $3=<>
----

so it can be used as those records and fields internally by the rest of the awk script.

A valid CSV would be one that conforms to RFC 4180 or can be generated by MS-Excel.

The solution must tolerate the end of record just being LF (\n) as is typical for UNIX files rather than CRLF (\r\n) as that standard requires and Excel or other Windows tools would generate. It will also tolerate unquoted fields mixed with quoted fields. It will specifically not need to tolerate escaping "s with a preceding backslash (i.e. \" instead of "") as some other CSV formats allow - if you have that then adding a gsub(/\\"/,"\"\"") up front would handle it and trying to handle both escaping mechanisms automatically in one script would make the script unnecessarily fragile and complicated.

See also https://stackoverflow.com/questions/1560393/bash-shell-scripting-csv-parsing — tripleee, Dec 23 '19 at 09:48

Ed Morton · Accepted Answer · 2022-03-20T12:36:35.020

If your CSV cannot contain newlines then all you need is (with GNU awk for FPAT):

$ echo 'foo,"field,""with"",commas",bar' |
    awk -v FPAT='[^,]*|("([^"]|"")*")' '{for (i=1; i<=NF;i++) print i " <" $i ">"}'
1 <foo>
2 <"field,""with"",commas">
3 <bar>

or the equivalent using any awk:

$ echo 'foo,"field,""with"",commas",bar' |
    awk -v fpat='[^,]*|("([^"]|"")*")' -v OFS=',' '{
        rec = $0
        $0 = ""
        i = 0
        while ( (rec!="") && match(rec,fpat) ) {
            $(++i) = substr(rec,RSTART,RLENGTH)
            rec = substr(rec,RSTART+RLENGTH+1)
        }
        for (i=1; i<=NF;i++) print i " <" $i ">"
    }'
1 <foo>
2 <"field,""with"",commas">
3 <bar>

See https://www.gnu.org/software/gawk/manual/gawk.html#More-CSV for info on the specific FPAT setting I use above.

If all you actually want to do is convert your CSV to individual lines by, say, replacing newlines with blanks and commas with semi-colons inside quoted fields then all you need is this, again using GNU awk for multi-char RS and RT:

$ awk -v RS='"([^"]|"")*"' -v ORS= '{gsub(/\n/," ",RT); gsub(/,/,";",RT); print $0 RT}' file.csv
"rec1; fld1",,"rec1"";""fld3.1 ""; fld3.2","rec1 fld4"
"rec2; fld1.1  fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3;fld2""",

Otherwise, though, the general, robust, portable solution to identify the fields that will work with any modern awk* is:

$ cat decsv.awk
function buildRec(      fpat,fldNr,fldStr,done) {
    CurrRec = CurrRec $0
    if ( gsub(/"/,"&",CurrRec) % 2 ) {
        # The string built so far in CurrRec has an odd number
        # of "s and so is not yet a complete record.
        CurrRec = CurrRec RS
        done = 0
    }
    else {
        # If CurrRec ended with a null field we would exit the
        # loop below before handling it so ensure that cannot happen.
        # We use a regexp comparison using a bracket expression here
        # and in fpat so it will work even if FS is a regexp metachar
        # or a multi-char string like "\\\\" for \-separated fields.
        CurrRec = CurrRec ( CurrRec ~ ("[" FS "]$") ? "\"\"" : "" )
        $0 = ""
        fpat = "([^" FS "]*)|(\"([^\"]|\"\")+\")"
        while ( (CurrRec != "") && match(CurrRec,fpat) ) {
            fldStr = substr(CurrRec,RSTART,RLENGTH)
            # Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>
            if ( gsub(/^"|"$/,"",fldStr) ) {
                gsub(/""/, "\"", fldStr)
            }
            $(++fldNr) = fldStr
            CurrRec = substr(CurrRec,RSTART+RLENGTH+1)
        }
        CurrRec = ""
        done = 1
    }
    return done
}

# If your input has \-separated fields, use FS="\\\\"; OFS="\\"
BEGIN { FS=OFS="," }
!buildRec() { next }
{
    printf "Record %d:\n", ++recNr
    for (i=1;i<=NF;i++) {
        # To replace newlines with blanks add gsub(/\n/," ",$i) here
        printf "    $%d=<%s>\n", i, $i
    }
    print "----"
}

.

$ awk -f decsv.awk file.csv
Record 1:
    $1=<rec1, fld1>
    $2=<>
    $3=<rec1","fld3.1
",
fld3.2>
    $4=<rec1
fld4>
----
Record 2:
    $1=<rec2, fld1.1

fld1.2>
    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $3=<>
    $4=<rec2 fld4>
----
Record 3:
    $1=<"">
    $2=<"rec3,fld2">
    $3=<>
----

The above assumes UNIX line endings of \n. With Windows \r\n line endings it's much simpler as the "newlines" within each field will actually just be line feeds (i.e. \ns) and so you can set RS="\r\n" (using GNU awk for multi-char RS) and then the \ns within fields will not be treated as line endings.

It works by simply counting how many "s are present so far in the current record whenever it encounters the RS - if it's an odd number then the RS (presumably \n but doesn't have to be) is mid-field and so we keep building the current record but if it's even then it's the end of the current record and so we can continue with the rest of the script processing the now complete record.

*I say "modern awk" above because there's apparently extremely old (i.e. circa 2000) versions of tawk and mawk1 still around which have bugs in their gsub() implementation such that gsub(/^"|"$/,"",fldStr) would not remove the start/end "s from fldStr. If you're using one of those then get a new awk, preferably gawk, as there could be other issues with them too but if that's not an option then I expect you can work around that particular bug by changing this:

        if ( gsub(/^"|"$/,"",fldStr) ) {

to this:

        if ( sub(/^"/,"",fldStr) && sub(/"$/,"",fldStr) ) {

Thanks to the following people for identifying and suggesting solutions to the stated issues with the original version of this answer:

@mosvy for escaped double quotes within fields.
@datatraveller1 for multiple contiguous pairs of escaped quotes in a field and null fields at the end of records.

Related: also see How do I use awk under cygwin to print fields from an excel spreadsheet? for how to generate CSVs from Excel spreadsheets.

score 12 · Answer 2 · edited Oct 07 '21 at 08:14

12

An improvement upon @EdMorton's FPAT solution, which should be able to handle double-quotes(") escaped by doubling ("" -- as allowed by the CSV standard).

gawk -v FPAT='[^,]*|("[^"]*")+' ...

This STILL

isn't able to handle newlines inside quoted fields, which are perfectly legit in standard CSV files.
assumes GNU awk (gawk), a standard awk won't do.

Example:

$ echo 'a,,"","y""ck","""x,y,z"," ",12' |
gawk -v OFS='|' -v FPAT='[^,]*|("[^"]*")+' '{$1=$1}1'
a||""|"y""ck"|"""x,y,z"|" "|12

$ echo 'a,,"","y""ck","""x,y,z"," ",12' |
gawk -v FPAT='[^,]*|("[^"]*")+' '{
  for(i=1; i<=NF;i++){
    if($i~/"/){ $i = substr($i, 2, length($i)-2); gsub(/""/,"\"", $i) }
    print "<"$i">"
  }
}'
<a>
<>
<>
<y"ck>
<"x,y,z>
< >
<12>

edited Oct 07 '21 at 08:14

Community

1
1

answered Mar 13 '20 at 19:10

5

That's a nice FPAT, I can't imagine any invalid case it'd allow - you should suggest the gawk folks update [their FPAT documentation](https://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content) to use it instead of `FPAT = "([^,]*)|(\"[^\"]+\")` as documented at the bottom of that section and I used. – Ed Morton Mar 13 '20 at 20:02
1

The gawk manual has been updated with [a new section](https://www.gnu.org/software/gawk/manual/gawk.html#More-CSV) addressing this and now provides a different `FPAT` that can also handle fields that contain both commas and escaped quotes so I've updated my answer to use that, `FPAT='([^,]*)|("([^"]|"")*")'`, for consistency with the documentation. Thanks. – Ed Morton Feb 20 '22 at 16:30

score 2 · Answer 3 · edited Apr 26 '23 at 14:19

This is exactly what csvquote is for - it makes things simple for awk and other command line data processing tools.

Some things are difficult to express in awk. Instead of running a single awk command and trying to get awk to handle the quoted fields with embedded commas and newlines, the data gets prepared for awk by csvquote, so that awk can always interpret the commas and newlines it finds as field separators and record separators. This makes the awk part of the pipeline simpler. Once awk is done with the data, it goes back through csvquote -u to restore the embedded commas and newlines inside quoted fields.

csvquote file.csv | awk -f my_awk_script | csvquote -u

EDIT:

For a complete description on csvquote, see: How it works. this also explains the `` characters which are shown in places where there was a carriage return.

csvquote file.csv | awk -f decsv.awk | csvquote -u

(for the source of decsv.awk see answer from Ed Morton ) output:

Record 1:
    $1=<rec1 fld1>
    $2=<>
    $3=<rec1","fld3.1",
fld3.2>
    $4=<rec1
fld4>
----
Record 2:
    $1=<rec2, fld1.1

fld1.2>
    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $3=<>
    $4=<rec2 fld4>
----
Record 3:
    $1=<"">
    $2=<"rec3fld2">
    $3=<>
----

Please [edit] your answer to include the output of `csvquote file.csv` so we can see what exactly it'd do to the CSV from my question. Also add the output of `csvquote file.csv | csvquote -u` so we can see if it reproduces the input unchanged. — Ed Morton, Jan 31 '22 at 15:22

chunyang.wen · Answer 4 · 2022-04-20T22:45:56.607

1

I have found csvkit a really useful toolkit to handle with csv files in command line.

line='test,t2,t3,"t5,"'
echo $line | csvcut -c 4
"t5,"

echo 'foo,"field,""with"",commas",bar'  | csvcut -c 3
bar

It also contains csvstat, csvstack etc. tools which are also very handy.

cat file.csv
"rec1, fld1",,"rec1"",""fld3.1
"",
fld3.2","rec1
fld4"
"rec2, fld1.1

fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4
"""""","""rec3,fld2""",

csvcut -c 1 file.csv

"rec1, fld1"
"rec2, fld1.1

fld1.2"
""""""

csvcut -c 3 file.csv
"rec1"",""fld3.1
"",
fld3.2"
""
""

edited Apr 20 '22 at 22:45

answered Apr 20 '22 at 05:42

chunyang.wen

204
1
9

1

There's lots of ways with lots of tools to handle CSVs but this question is how to do so **with awk**. If you decide to post an answer using some other tool, though, then please [edit] your answer to show the complete command you'd use to handle the CSV in my question and produce the posted expected output from that input, not just parse 1 line of text with no embedded newlines. – Ed Morton Apr 20 '22 at 11:45

score 0 · Answer 5 · answered Jul 26 '22 at 09:46

Awk (gawk) actually provides extensions, one of which being csv processing, which is the most robust way to do so with gawk in my opinion. The extension takes care of many gotchas and parses the csv for you.

Assuming that extension is installed, you can use awk to show all lines where a specific csv field matches 123.

Assuming test.csv contains the following:

Name,Phone
"Woo, John",425-555-1212
"James T. Kirk",123

The following will print all lines where the Phone (aka the second field) is equal to 123:

gawk -l csv 'csvsplit($0,a) && a[2] == 123 {print a[1]}'

The output is:

James T. Kirk

How does it work?

-l csv asks gawk to load the csv extension by looking for it in $AWKLIBPATH;
csvsplit($0, a) splits the current line, and stores each field into a new array named a
&& a[2] == 123 checks that the second field is 123
if both conditions are true, it { print a[1] }, aka prints first csv field of the line.

The CSV extension isn't released yet, see http://gawkextlib.sourceforge.net/ where under `Available Extensions` it says `csv (not yet released)`. It's been that way for at least a couple of years and last time I discussed the implementation with the person creating it, Manuel, in mid-2021,(see https://mail.gnu.org/archive/html/bug-gawk/2021-05/msg00014.html) there were still implementation and support discussions being held and I haven't heard of any plans to release it since so YMMV with using it. — Ed Morton, Jul 26 '22 at 13:24

Ben Hoyt · Answer 6 · 2023-01-16T21:36:39.897

If you're using one of the common AWK interpreters (Gawk, onetrueawk, mawk), the other solutions are your best bet. However, if you're able to use a different interpreter, frawk and GoAWK have proper CSV support built-in.

frawk is a very fast AWK implementation written in Rust. Use -i csv to process input in CSV mode. Note that frawk is not quite POSIX compatible (see differences).

GoAWK is a POSIX-compatible AWK implementation written in Go. Also supports -i csv mode, as well as -H (parse header row) with @"named_field" syntax (read more). Disclaimer: I'm the author of GoAWK.

With file.csv as per the question, you can simply use an AWK script with a regular for loop over the fields as follows:

$ cat records.awk
{
    printf "Record %d:\n", NR
    for (i=1; i<=NF; i++)
        printf "    $%d=<%s>\n", i, $i
    print "----"
}

Then use either frawk -i csv or goawk -i csv to get the expected output. For example:

$ frawk -i csv -f records.awk file.csv
Record 1:
    $1=<rec1, fld1>
    $2=<>
    $3=<rec1","fld3.1
",
fld3.2>
    $4=<rec1
fld4>
----
Record 2:
    $1=<rec2, fld1.1

fld1.2>
    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $3=<>
    $4=<rec2 fld4>
----
Record 3:
    $1=<"">
    $2=<"rec3,fld2">
    $3=<>
----

$ goawk -i csv -f records.awk file.csv 
Record 1:
... same as above ...
----

dawg · Answer 7 · 2023-02-26T02:18:56.243

Once you get into complicated CSV, it is really easier to use a proper parser. While the gawk scripts using FPAT are amazing, they are not a complete RFC 4180 solution.

My goto for CSV parsing at the command prompt is Ruby. There are others (Perl, Python are wonderful) but Ruby is closest to Awk syntax.

Here is a minimal Ruby to parse your example:

ruby -r csv -e '
CSV.parse($<.read).
    each_with_index{|record, idx| 
        puts "Record #{idx+1}:"
        record.each_with_index{|fld, i| 
            puts "\t$#{i+1}=<#{fld}>"
        }
        puts "----"
    }

' file

With your example input, prints:

Record 1:
    $1=<rec1, fld1>
    $2=<>
    $3=<rec1","fld3.1
",
fld3.2>
    $4=<rec1
fld4>
----
Record 2:
    $1=<rec2, fld1.1

fld1.2>
    $2=<rec2 fld2.1"fld2.2"fld2.3>
    $3=<>
    $4=<rec2 fld4>
----
Record 3:
    $1=<"">
    $2=<"rec3,fld2">
    $3=<>
----

Note that nothing at all needed to be done to the records or fields since they were properly parsed with default values. (Well, you need +1 since Ruby arrays are 0 based.)

What's the most robust way to efficiently parse CSV using awk?

7 Answers7

Linked

Related