10

I'm creating a bash script to generate some output from a CSV file (I have over 1000 entries and don't fancy doing it by hand...).

The content of the CSV file looks similar to this:

Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation

I have some code that can separate the fields using the comma as delimiter, but some values actually contain commas, such as Adygeya, Republic. These values are surrounded by quotes to indicate the characters within should be treated as part of the field, but I don't know how to parse it to take this into account.

Currently I have this loop:

while IFS=, read province provinceCode criteriaId countryCode country
do
    echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $input

which produces this output for the sample data given above:

[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
["Adygeya] [ Republic"] [RU-AD] [21250] [RU,Russian Federation]

As you can see, the third entry is parsed incorrectly. I want it to output

[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]
chrisbunney
  • 5,819
  • 7
  • 48
  • 67
  • 1
    See http://stackoverflow.com/questions/7804673/escaping-separator-within-double-quotes – Tom Whittock Jan 20 '12 at 11:06
  • Thanks @TomWhittock, I will investigate the link given by that answer, I've never used `awk` before, so will probably need to swot up on it (for everyone else's benefit, the link is: http://backreference.org/2010/04/17/csv-parsing-with-awk/) – chrisbunney Jan 20 '12 at 14:07
  • can't you just re-export the data with '|', tab, or some other character that doesn't appear in the input? Good luck. – shellter Jan 20 '12 at 14:09
  • @shellter Unfortunately, I don't have control over the export of the data – chrisbunney Jan 20 '12 at 14:33
  • also search on googles groups for comp.lang.awk. There was a 3 month discussion of processing CSVs back 10 years ago. Some very sophisticated solutions. Good luck. – shellter Jan 20 '12 at 14:47
  • [Similar](http://stackoverflow.com/questions/314384/parse-csv-file-using-gawk) – Dennis Williamson Jan 20 '12 at 15:40
  • @DennisWilliamson the accepted answer on that question is essentially "don't use awk to process difficult CSV data", that doesn't help me process my data, although it does help eliminate options – chrisbunney Jan 20 '12 at 16:12
  • That's not the only thing it says. It also says use a Python or Perl module that is made especially for processing csv files. – Dennis Williamson Jan 20 '12 at 17:14
  • Having never used Python/Perl, I was shying away from that, not a great reason, but my motivation for this was hacking together something to save me manually editing a few thousand lines into a specific format... – chrisbunney Jan 20 '12 at 19:02

6 Answers6

9

If you want to do it all in awk (GNU awk 4 is required for this script to work as intended):

awk '{ 
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == "\"" && 
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }   
 }' FPAT='([^,]+)|("[^"]+")' infile

Sample output:

% cat infile
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
% awk '{    
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == "\"" &&
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }
 }' FPAT='([^,]+)|("[^"]+")' infile
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation]

With Perl:

perl -MText::ParseWords -lne'
 print join " ", map "[$_]", 
   parse_line(",",0, $_);
  ' infile 

This should work with your awk version (based on this c.u.s. post, removed the embedded commas too).

awk '{
 n = parse_csv($0, data)
 for (i = 0; ++i <= n;) {
    gsub(/,/, " ", data[i])
    printf "[%s]%s", data[i], (i < n ? OFS : RS)
    }
  }
function parse_csv(str, array,   field, i) { 
  split( "", array )
  str = str ","
  while ( match(str, /[ \t]*("[^"]*(""[^"]*)*"|[^,]*)[ \t]*,/) ) { 
    field = substr(str, 1, RLENGTH)
    gsub(/^[ \t]*"?|"?[ \t]*,$/, "", field)
    gsub(/""/, "\"", field)
    array[++i] = field
    str = substr(str, RLENGTH + 1)
  }
  return i
}' infile
Dimitre Radoulov
  • 27,252
  • 4
  • 40
  • 48
5

After looking at @Dimitre's solution over here. You can do something like this -

#!/usr/local/bin/gawk -f

BEGIN {
    FS="," 
    FPAT="([^,]+)|(\"[^\"]+\")"
    }

      {
    for (i=1;i<=NF;i++) 
        printf ("[%s] ",$i);
    print ""
    } 

Test:

[jaypal:~/Temp] cat filename
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation

[jaypal:~/Temp] ./script.awk  filename
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia] 
[Piaui] [BR-PI] [20100] [BR] [Brazil] 
["Adygeya, Republic"] [RU-AD] [21250] [RU] [Russian Federation] 

For removing " you can pipe the output to sed.

[jaypal:~/Temp] ./script.awk  filename | sed 's#\"##g'
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia] 
[Piaui] [BR-PI] [20100] [BR] [Brazil] 
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation] 
Community
  • 1
  • 1
jaypal singh
  • 74,723
  • 23
  • 102
  • 147
2

After thinking about the problem, I realised that since the comma in the string isn't important to me, it'd be easier to simply remove it from the input before parsing.

To that end, I've concocted a sed command that matches strings surrounded by doubled quotes that contain a comma. The command then removes the bits you don't want from the matched string. It does this by separating the regex into remembered sections.

This solution only works where the string contains a single comma between double quotes.

The unescaped regex is

(")(.*)(,)(.*)(")

The first, third, and fifth pairs of parentheses capture the opening double quote, comma, and closing double quote respectively.

The second and third pairs of parentheses capture the actual content of the field which we want to keep.

sed Command To Remove Comma:

echo "$input" | sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\1\2\3\4/' 

sed Command To Remove Comma and Double Quotes:

echo "$input" | sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\2\3/' 

Updated Code:

tmpFile=$input"Temp"
sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\2\4/' < $input > $tmpFile
while IFS=, read province provinceCode criteriaId countryCode country
do
    echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $tmpFile
rm $tmpFile

Output:

[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]
[Bío-Bío] [CL-BI] [20154] [CL] [Chile]
chrisbunney
  • 5,819
  • 7
  • 48
  • 67
  • There are a few specific cases where this may work and a lot of cases where it won't. One significant problem is that in `sed` matches such as `.*` are greedy. – Dennis Williamson Jan 20 '12 at 17:16
  • Thanks for the feedback. With my input, I believe this will work fine, but I'm interested in finding out how to improve the general solution. Would this be an improvement? `(")(^,*)(,)(^"*)(")` Apparently `sed` doesn't support lazy matches, but a negated character class may work. Escaped quotes would also cause issues, I expect – chrisbunney Jan 20 '12 at 18:57
1

If you can tolerate having the surrounding quotes persist in the output, you can use a small script I wrote called csvquote to enable awk and cut (and other UNIX text tools) to properly handle quoted fields that contain commas. You wrap the command like this:

csvquote inputfile.csv | awk -F, '{print "["$1"] ["$2"] ["$3"] ["$4"] ["$5"]"}' | csvquote -u

see https://github.com/dbro/csvquote for the code and documentation

D Bro
  • 543
  • 6
  • 10
0

Using Dimitre's solution (thank you for that) I noticed that his program ignores empty fields.

Here is the fix:

awk '{ 
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == "\"" && 
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }   
 }' FPAT='([^,]*)|("[^"]+")' infile
Sven L.
  • 36
  • 2
0

Owing to the slightly outdated version of awk on my system and a personal preference to stick to a Bash script, I've arrived a slightly different solution.

I've produced a utility script based on this blog post that parses the CSV file and replaces the delimiters with a delimiter of your choice so that the output can be captured and used to easily process the data. The script respects quoted strings and embedded commas, but will remove the double quotes it finds and doesn't work with escaped double quotes within fields.

#!/bin/bash

input=$1
delimiter=$2

if [ -z "$input" ];
then
    echo "Input file must be passed as an argument!"
    exit 98
fi

if ! [ -f $input ] || ! [ -e $input ];
then
    echo "Input file '"$input"' doesn't exist!"
    exit 99
fi

if [ -z "$delimiter" ];
then
    echo "Delimiter character must be passed as an argument!"
    exit 98
fi

gawk '{
    c=0
    $0=$0","                                   # yes, cheating
    while($0) {
        delimiter=""
        if (c++ > 0) # Evaluate and then increment c
        {
            delimiter="'$delimiter'"
        }

        match($0,/ *"[^"]*" *,|[^,]*,/)
        s=substr($0,RSTART,RLENGTH)             # save what matched in f
        gsub(/^ *"?|"? *,$/,"",s)               # remove extra stuff
        printf (delimiter s)
        $0=substr($0,RLENGTH+1)                 # "consume" what matched
    }
    printf ("\n")
}' $input

Just posting it up in case someone else finds it useful.

chrisbunney
  • 5,819
  • 7
  • 48
  • 67