Bash: Parse CSV with quotes, commas and newlines

Question

Say I have the following csv file:

 id,message,time
 123,"Sorry, This message
 has commas and newlines",2016-03-28T20:26:39
 456,"It makes the problem non-trivial",2016-03-28T20:26:41

I want to write a bash command that will return only the time column. i.e.

time
2016-03-28T20:26:39
2016-03-28T20:26:41

What is the most straight forward way to do this? You can assume the availability of standard unix utils such as awk, gawk, cut, grep, etc.

Note the presence of "" which escape , and newline characters which make trivial attempts with

cut -d , -f 3 file.csv

futile.

I wholeheartedly agree with @chepner. For this task I would use `Python` or `Ruby` instead of Bash. — G. Demecki, Apr 21 '16 at 09:08
@chepner From some time ago, [tag:bash] offer a CSV parser as *loadable builtin*! See [How to parse CSV in bash](https://stackoverflow.com/a/69514496/1765658) — F. Hauri - Give Up GitHub, Oct 23 '21 at 11:50

hek2mgl · Answer 1 · 2016-03-29T15:19:05.130

16

As chepner said, you are encouraged to use a programming language which is able to parse csv.

Here comes an example in python:

import csv

with open('a.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile, quotechar='"')
    for row in reader:
        print(row[-1]) # row[-1] gives the last column

edited Mar 29 '16 at 15:19

answered Mar 29 '16 at 15:05

hek2mgl

152,036
28
249
266

See OP's problem. He has new line characters within the quotes. Awk will not consider that – Srini V Mar 29 '16 at 15:06
@realspirituals Yeah, that's true. I've replaced `awk` by `python`. – hek2mgl Mar 29 '16 at 15:13
1

Minor correction - file should be open rather in 'text mode' than 'binary mode' - use `rt` instead `rb` in open(). – izkeros Feb 12 '21 at 05:01

Srini V · Answer 2 · 2018-01-25T15:05:44.707

As said here

gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' file.csv \
 | awk -F, '{print $NF}'

To handle specifically those newlines that are in doubly-quoted strings and leave those alone that are outside them, using GNU awk (for RT):

gawk -v RS='"' 'NR % 2 == 0 { gsub(/\n/, "") } { printf("%s%s", $0, RT) }' file

This works by splitting the file along " characters and removing newlines in every other block.

Output

time
2016-03-28T20:26:39
2016-03-28T20:26:41

Then use awk to split the columns and display the last column

score 5 · Answer 3 · edited Apr 13 '17 at 12:36

5

CSV is a format which needs a proper parser (i.e. which can't be parsed with regular expressions alone). If you have Python installed, use the csv module instead of plain BASH.

If not, consider csvkit which has a lot of powerful tools to process CSV files from the command line.

See also:

https://unix.stackexchange.com/questions/7425/is-there-a-robust-command-line-tool-for-processing-csv-files

edited Apr 13 '17 at 12:36

Community

1
1

answered Mar 29 '16 at 15:17

Aaron Digulla

321,842
108
597
820

FWIW, a csv probably could be parsed with regex, but it would definitely be a pain. – combinatorist Jun 24 '19 at 20:15
Part of the reason it's annoying is that csv is actually a loose family of dialects and it would be particularly hard to find a one size fits all regex for all the variations. – combinatorist Jun 24 '19 at 20:16

score 0 · Answer 4 · answered Mar 29 '16 at 15:21

0

another awk alternative using FS

$ awk -F'"' '!(NF%2){getline remainder;$0=$0 OFS remainder}
                NR>1{sub(/,/,"",$NF); print $NF}' file

2016-03-28T20:26:39
2016-03-28T20:26:41

answered Mar 29 '16 at 15:21

karakfa

66,216
7
41
56

score 0 · Answer 5 · answered Dec 02 '16 at 19:39

I ran into something similar when attempting to deal with lspci -m output, but the embedded newlines would need to be escaped first (though IFS=, should work here, since it abuses bash' quote evaluation). Here's an example

f:13.3 "System peripheral" "Intel Corporation" "Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder" -r01 "Super Micro Computer Inc" "Device 0838"

And the only reasonable way I can find to bring that into bash is along the lines of:

# echo 'f:13.3 "System peripheral" "Intel Corporation" "Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder" -r01 "Super Micro Computer Inc" "Device 0838"' | { eval array=($(cat)); declare -p array; }
declare -a array='([0]="f:13.3" [1]="System peripheral" [2]="Intel Corporation" [3]="Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Memory Controller 0 - Channel Target Address Decoder" [4]="-r01" [5]="Super Micro Computer Inc" [6]="Device 0838")'
#

Not a full answer, but might help!

"Reasonable" only if you trust your data. A piece of hardware that reads out as `"Super 1337 $(rm -rf ~) Hey I Changed Your DMI Info"` would lead to bad times. — Charles Duffy, Jun 15 '18 at 15:25
sure.. any bash comment with 'eval' in it should probably be flagged 'caveat emptor' — Brian Chrisman, Jun 16 '18 at 15:55

Carlos Barcellos · Answer 6 · 2021-05-04T13:24:07.230

Vanilla bash script

save this code as parse_csv.sh, give it execution privilege (chmod +x parse_csv.sh)

#!/bin/bash                             
# vim: ts=4 sw=4 hidden nowrap          
# @copyright Copyright © 2021 Carlos Barcellos <carlosbar at gmail.com>         
# @license https://www.gnu.org/licenses/lgpl-3.0.en.html
                                    
if [ "$1" = "-h" -o "$1" = "--help" -o "$1" = "-v" ]; then
    echo "parse csv 0.1"                    
    echo ""
    echo "parse_csv.sh [csv file] [delimiter]"
    echo "  csv file    csv file to parse; default stdin"                           
    echo "  delimiter   delimiter to use. default is comma"
    exit 0
fi                                                                              
delim=,
if [ $# -ge 1 ]; then
    [ -n "$1" ] && file="$1"
    [ -n "$2" -a "$2" != "\"" ] && delim="$2"
fi                                                                             
processLine() {
    if [[ ! "$1" =~ \" ]]; then
        (                                               
           IFSS="$delim" fields=($1)                                                             
           echo  "${fields[@]}"  
        )
        return 0
    fi
    under_scape=0
    fields=()
    acc=
    for (( x=0; x < ${#1}; x++ )); do
        if [ "${1:x:1}" = "${delim:0:1}" -o $((x+1)) -ge ${#1} ] && [ $under_scape -ne 1 ]; then
            [ "${1:x:1}" != "${delim:0:1}" ] && acc="${acc}${1:x:1}"
            fields+=($acc)
            acc=
        elif [ "${1:x:1}" = "\"" ]; then
            if [ $under_scape -eq 1 ] && [ "${1:x+1:1}" = "\"" ]; then
                acc="${acc}${1:x:1}"
            else
                under_scape=$((!under_scape))                                           
            fi
            [ $((x+1)) -ge ${#1} ] && fields+=($acc)                                
        else
            acc="${acc}${1:x:1}"                                                    
        fi
    done
    echo  "${fields[@]}"
    return 0
 } 
 while read -r line; do
     processLine "$line"
 done < ${file:-/dev/stdin}

Then use: parse_csv.sh "csv file". To print only the last col, you can change the echo "${fields[@]}" to echo "${fields[-1]}"

choroba · Answer 7 · 2021-06-21T09:52:51.903

Perl to the rescue! Use the Text::CSV_XS module to handle CSV.

perl -MText::CSV_XS=csv -we 'csv(in => $ARGV[0],
                                 on_in => sub { $_[1] = [ $_[1][-1] ] })
                            ' -- file.csv

the csv subroutine processes the csv
in specifies the input file, $ARGV[0] contains the first command line argument, i.e. file.csv here
on_in specifies code to run. It gets the current row as the second argument, i.e. $_[1]. We just set the whole row to the contents of the last column.

score 0 · Answer 8 · answered Aug 04 '21 at 20:31

I think you are overthinking it.

$: echo time; grep -Eo '[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}$' file
time
2016-03-28T20:26:39
2016-03-28T20:26:41

If you want to check for that comma just to be sure,

$: echo time; sed -En '/,[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}$/{ s/.*,//; p; }' file
time
2016-03-28T20:26:39
2016-03-28T20:26:41

score 0 · Answer 9 · answered Jan 31 '22 at 11:53

0

csvquote is designed for exactly this kind of thing. It santizes the file (reversibly) and allows awk to depend on commas being field separators and newlines being record separators.

answered Jan 31 '22 at 11:53

D Bro

543
6
10

Ciro Santilli OurBigBook.com · Answer 10 · 2023-07-03T11:25:31.837

csvcut from csvkit example

csvkit was mentioned at: https://stackoverflow.com/a/36288388/895245 but here's the example.

Install:

pip install csvkit

Sample CSV input file:

main.csv

a,"b
c",d
e,f

Get the first column:

csvcut -c 1 main.csv

which outputs:

a
e

or to get the second column:

csvcut -c 1 main.csv

which outputs the following valid CSV with a single column:

"b
c"
f

Or to swap the two columns around:

csvcut -c 2,1 main.csv

which outputs another valid CSV file:

"b
c",a
f,e

Tested on Ubuntu 23.04, csvkit==1.1.1.

score -3 · Answer 11 · answered Dec 02 '16 at 20:26

-3

awk -F, '!/This/{print $NF}' file

time
2016-03-28T20:26:39
2016-03-28T20:26:41

answered Dec 02 '16 at 20:26

Claes Wikner

1,457
1
9
8

Bash: Parse CSV with quotes, commas and newlines

11 Answers11

Linked

Related