How to parse a CSV in a Bash script?

Question

I am trying to parse a CSV containing potentially 100k+ lines. Here is the criteria I have:

The index of the identifier
The identifier value

I would like to retrieve all lines in the CSV that have the given value in the given index (delimited by commas).

Any ideas, taking in special consideration for performance?

For a robust awk solution, see https://stackoverflow.com/q/45420535/1745001. — Ed Morton, Dec 23 '19 at 17:41

score 59 · Answer 1 · edited Dec 23 '19 at 09:50

59

As an alternative to cut- or awk-based one-liners, you could use the specialized csvtool aka ocaml-csv:

$ csvtool -t ',' col "$index" - < csvfile | grep "$value"

According to the docs, it handles escaping, quoting, etc.

edited Dec 23 '19 at 09:50

tripleee

175,061
34
275
318

answered Oct 13 '09 at 14:11

Andrey Vlasovskikh

16,489
7
44
62

6

I think csvtool is my new best friend. Crazy to think I was trying to parse .csv files in bash. Replaced a large number of bash lines to handle parsing and escaping quotes, embedded commas, etc with a single call to csvtool! – Stéphane Oct 05 '16 at 12:03
1

Awesome solution, but requires the user has `csvtool` installed. This can be a problem when you need to use standard tools. – BugHunterUK Apr 04 '17 at 12:18
2

To get this thing: `sudo apt-get install csvtool` on ubuntu etc... then `csvtool --help` as the man page is, erm, sparse. – ErichBSchulz Aug 30 '18 at 09:47

score 43 · Answer 2 · answered Apr 16 '13 at 15:50

43

See this youtube video: BASH scripting lesson 10 working with CSV files

CSV file:

Bob Brown;Manager;16581;Main
Sally Seaforth;Director;4678;HOME

Bash script:

#!/bin/bash
OLDIFS=$IFS
IFS=";"
while read user job uid location
 do

    echo -e "$user \
    ======================\n\
    Role :\t $job\n\
    ID :\t $uid\n\
    SITE :\t $location\n"
 done < $1
 IFS=$OLDIFS

Output:

Bob Brown     ======================
    Role :   Manager
    ID :     16581
    SITE :   Main

Sally Seaforth     ======================
    Role :   Director
    ID :     4678
    SITE :   HOME

answered Apr 16 '13 at 15:50

FRV

691
5
4

6

This looks as it doesn't handle quoted values correctly (like `"Bob Brown";"Manager";16581;"Main"` or even `"Bob Brown";"Manager; Director";16581;"Main"`) – Paŭlo Ebermann May 15 '17 at 11:02
1

This answer doesn't fit the parameters of the original question, using a specific value and column index number. – Mr. Lance E Sloan Jul 14 '17 at 12:28
1

This answer also fails on lines ending in backslash characters (as I just learned) – Peter Russell Jul 06 '18 at 15:38

score 29 · Accepted Answer · edited Apr 01 '20 at 16:07

29

First prototype using plain old grep and cut:

grep "${VALUE}" inputfile.csv | cut -d, -f"${INDEX}"

If that's fast enough and gives the proper output, you're done.

edited Apr 01 '20 at 16:07

John Kugelman

349,597
67
533
578

answered Oct 13 '09 at 13:54

unwind

391,730
64
469
606

2

+1. This pipeline doesn't allow colon escaping (`\:`) or string quoting (`"foo: bar"`). But it is a good and simple way of solving the problem. – Andrey Vlasovskikh Oct 13 '09 at 13:59
1

there's no need to use 2 tools across a pipe. I would recommend using awk. – ghostdog74 Oct 13 '09 at 14:09
@ghostdog: I don't know awk, and looking at e.g. Nate Kohl's awk reply, I think this qualifies as being simpler, at least. – unwind Oct 13 '09 at 14:10
2

Although the answer is correct for some CSV files, imho it's more damaging than helping, because it encourages people on SO to prefer "one liner" commands and happily adopt them without realizing the problems associated with those (which the answer didn't warn about as well). In short, you parse a certain file format with a specific file format parser. Just like you don't use regex to validate html, but instead you use an html parser/validator. The fact that such "one-liners" work for some special cases of those file formats should be in bold/underline letters, always. – Mladen B. Feb 10 '21 at 13:25

score 14 · Answer 4 · answered Oct 13 '09 at 14:06

CSV isn't quite that simple. Depending on the limits of the data you have, you might have to worry about quoted values (which may contain commas and newlines) and escaping quotes.

So if your data are restricted enough can get away with simple comma-splitting fine, shell script can do that easily. If, on the other hand, you need to parse CSV ‘properly’, bash would not be my first choice. Instead I'd look at a higher-level scripting language, for example Python with a csv.reader.

score 11 · Answer 5 · answered Oct 13 '09 at 15:52

In a CSV file, each field is separated by a comma. The problem is, a field itself might have an embedded comma:

Name,Phone
"Woo, John",425-555-1212

You really need a library package that offer robust CSV support instead of relying on using comma as a field separator. I know that scripting languages such as Python has such support. However, I am comfortable with the Tcl scripting language so that is what I use. Here is a simple Tcl script which does what you are asking for:

#!/usr/bin/env tclsh

package require csv 
package require Tclx

# Parse the command line parameters
lassign $argv fileName columnNumber expectedValue

# Subtract 1 from columnNumber because Tcl's list index starts with a
# zero instead of a one
incr columnNumber -1

for_file line $fileName {
    set columns [csv::split $line]
    set columnValue [lindex $columns $columnNumber]
    if {$columnValue == $expectedValue} {
        puts $line
    }   
}

Save this script to a file called csv.tcl and invoke it as:

$ tclsh csv.tcl filename indexNumber expectedValue

Explanation

The script reads the CSV file line by line and store the line in the variable $line, then it split each line into a list of columns (variable $columns). Next, it picks out the specified column and assigned it to the $columnValue variable. If there is a match, print out the original line.

score 9 · Answer 6 · edited Mar 07 '19 at 12:46

9

Using awk:

export INDEX=2
export VALUE=bar

awk -F, '$'$INDEX' ~ /^'$VALUE'$/ {print}' inputfile.csv

Edit: As per Dennis Williamson's excellent comment, this could be much more cleanly (and safely) written by defining awk variables using the -v switch:

awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' inputfile.csv

Jeez...with variables, and everything, awk is almost a real programming language...

edited Mar 07 '19 at 12:46

sondra.kinsey

583
7
18

answered Oct 13 '09 at 14:05

Nate Kohl

35,264
10
43
55

4

The exports are likely unnecessary. And you should use `awk's` variable-passing feature, otherwise the quoting can get hairy: `awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' inputfile.csv` – Dennis Williamson Oct 13 '09 at 17:59
2

This doesn't handle nontrivial CSV files with quoted fields which might contain newlines. – tripleee Dec 23 '19 at 09:52

score 5 · Answer 7 · edited May 04 '13 at 21:02

For situations where the data does not contain any special characters, the solution suggested by Nate Kohl and ghostdog74 is good.

If the data contains commas or newlines inside the fields, awk may not properly count the field numbers and you'll get incorrect results.

You can still use awk, with some help from a program I wrote called csvquote (available at https://github.com/dbro/csvquote):

csvquote inputfile.csv | awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' | csvquote -u

This program finds special characters inside quoted fields, and temporarily replaces them with nonprinting characters which won't confuse awk. Then they get restored after awk is done.

score 3 · Answer 8 · answered Oct 13 '09 at 14:06

3

index=1
value=2
awk -F"," -v i=$index -v v=$value '$(i)==v' file

answered Oct 13 '09 at 14:06

ghostdog74

327,991
56
259
343

Thomas Guyot-Sionnest · Answer 9 · 2014-04-03T15:03:23.593

I was looking for an elegant solution that support quoting and wouldn't require installing anything fancy on my VMware vMA appliance. Turns out this simple python script does the trick! (I named the script csv2tsv.py, since it converts CSV into tab-separated values - TSV)

#!/usr/bin/env python

import sys, csv

with sys.stdin as f:
    reader = csv.reader(f)
    for row in reader:
        for col in row:
            print col+'\t',
        print

Tab-separated values can be split easily with the cut command (no delimiter needs to be specified, tab is the default). Here's a sample usage/output:

> esxcli -h $VI_HOST --formatter=csv network vswitch standard list |csv2tsv.py|cut -f12
Uplinks
vmnic4,vmnic0,
vmnic5,vmnic1,
vmnic6,vmnic2,

In my scripts I'm actually going to parse tsv output line by line and use read or cut to get the fields I need.

score 2 · Answer 10 · answered Apr 24 '21 at 06:50

Parsing CSV with primitive text-processing tools will fail on many types of CSV input.

xsv is a lovely and fast tool for doing this properly. To search for all records that contain the string "foo" in the third column:

cat file.csv | xsv search -s 3 foo

score 1 · Answer 11 · answered Oct 13 '09 at 14:00

1

A sed or awk solution would probably be shorter, but here's one for Perl:

perl -F/,/ -ane 'print if $F[<INDEX>] eq "<VALUE>"`

where <INDEX> is 0-based (0 for first column, 1 for 2nd column, etc.)

answered Oct 13 '09 at 14:00

mob

117,087
18
149
283

3

If you're going to use perl anyhow, surely there a perl csv library that would be better to use? – sondra.kinsey Mar 07 '19 at 09:13
For well-behaved input? [That's not true.](https://stackoverflow.com/questions/13916962/fastest-csv-parser-in-perl/13917440) – mob Feb 15 '21 at 18:15
@mob Better doesn't always mean faster. – jbg Jun 30 '21 at 07:15

score 0 · Answer 12 · answered Jul 26 '22 at 09:43

Awk (gawk) actually provides extensions, one of which being csv processing.

Assuming that extension is installed, you can use awk to show all lines where a specific csv field matches 123.

Assuming test.csv contains the following:

Name,Phone
"Woo, John",425-555-1212
"James T. Kirk",123

The following will print all lines where the Phone (aka the second field) is equal to 123:

gawk -l csv 'csvsplit($0,a) && a[2] == 123 {print $0}'

The output is:

"James T. Kirk",123

How does it work?

-l csv asks gawk to load the csv extension by looking for it in $AWKLIBPATH;
csvsplit($0, a) splits the current line, and stores each field into a new array named a
&& a[2] == 123 checks that the second field is 123
if both conditions are true, it { print $0 }, aka prints the full line as requested.

How to parse a CSV in a Bash script?

12 Answers12

Explanation

Linked