43

I have a csv file where each row defines a room in a given building. Along with room, each row has a floor field. What I want to extract is all floors in all buildings.

My file looks like this...

"u_floor","u_room","name"
0,"00BDF","AIRPORT TEST            "
0,0,"BRICKER HALL, JOHN W    "
0,3,"BRICKER HALL, JOHN W    "
0,5,"BRICKER HALL, JOHN W    "
0,6,"BRICKER HALL, JOHN W    "
0,7,"BRICKER HALL, JOHN W    "
0,8,"BRICKER HALL, JOHN W    "
0,9,"BRICKER HALL, JOHN W    "
0,19,"BRICKER HALL, JOHN W    "
0,20,"BRICKER HALL, JOHN W    "
0,21,"BRICKER HALL, JOHN W    "
0,25,"BRICKER HALL, JOHN W    "
0,27,"BRICKER HALL, JOHN W    "
0,29,"BRICKER HALL, JOHN W    "
0,35,"BRICKER HALL, JOHN W    "
0,45,"BRICKER HALL, JOHN W    "
0,59,"BRICKER HALL, JOHN W    "
0,60,"BRICKER HALL, JOHN W    "
0,61,"BRICKER HALL, JOHN W    "
0,63,"BRICKER HALL, JOHN W    "
0,"0006M","BRICKER HALL, JOHN W    "
0,"0008A","BRICKER HALL, JOHN W    "
0,"0008B","BRICKER HALL, JOHN W    "
0,"0008C","BRICKER HALL, JOHN W    "
0,"0008D","BRICKER HALL, JOHN W    "
0,"0008E","BRICKER HALL, JOHN W    "
0,"0008F","BRICKER HALL, JOHN W    "
0,"0008G","BRICKER HALL, JOHN W    "
0,"0008H","BRICKER HALL, JOHN W    "

What I want is all floors in all buildings.

I am using cat, awk, sort and uniq to obtain this list although I am having a problem with the "," in the building name field such as "BRICKER HALL, JOHN W" and it is throwing off my entire csv generation.

cat Buildings.csv | awk -F, '{print $1","$2}' | sort | uniq > Floors.csv 

How can I get awk to use the comma but ignore a comma in between "" of a field? Alternatively, does someone have a better solution?

Based on the answer provided suggesting a awk csv parser I was able to get the solution:

cat Buildings.csv | awk -f csv.awk | awk -F" -> 2|"  '{print $2}' | awk -F"|" '{print $2","$3}' | sort | uniq > floors.csv 

There we want to use the csv awk program and then from there I want to use a " -> 2|" which is formatting based on the csv awk program. The print $2 there prints only the csv parsed contents, this is because the program prints the original line followed by " -> #" where # is the count parsed from csv. (Ie. the columns.) From there I can split this awk csv result on the "|" whcih is what it replaces the comma's with. Then the sort, uniq and pipe out to a file and done!

Thanks for the help.

TaggerBot
  • 57
  • 3
Chris
  • 11,780
  • 13
  • 48
  • 70
  • Does this answer your question? [What's the most robust way to efficiently parse CSV using awk?](https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk) – kvantour May 26 '20 at 12:07

7 Answers7

57
gawk -vFPAT='[^,]*|"[^"]*"' '{print $1 "," $3}' | sort | uniq

This is an awesome GNU Awk 4 extension, where you define a field pattern instead of a field-separator pattern. Does wonders for CSV. (docs)

ETA (thanks mitchus): To remove the surrounding quotes, gsub("^\"|\"$","",$3); if there's more fields than just $3 to process that way, just loop through them.
Note this simple approach is not tolerant of malformed input, nor of some possible special characters between quotes – covering all of those would go beyond the scope of a neat one-liner.

hemflit
  • 2,819
  • 3
  • 22
  • 17
  • 1
    This is a great find! Makes an external CSV lib unnecessary in many cases. – kermatt Sep 01 '13 at 02:04
  • Awesome!! - could it also be modified so that the quotes are stripped if present. I have an output that only has quotes if a comma is present in the field itself – nwaltham Nov 26 '13 at 12:37
  • 1
    Just for other people who are using macs: OS X doesn't come with GAWK, they have awk from 2007. So basically you need to install it yourself `brew install gawk` and it really does wonders for CSV. – Anoosh Ravan Jan 31 '14 at 18:31
  • 1
    @nwaltham `gsub("^\"|\"$","",$3)` where $3 is the field that might be under quotes. (If there's more than one, loop through them all.) Note this simple approach is not tolerant of newlines between quotes, quoted quotes within quotes, nor of malformed input with unbalanced quotes. Covering all of these goes beyond the scope of a neat one-liner. – hemflit Mar 11 '14 at 17:09
  • @hemflit: Very useful addendum, you could add it to the answer. – mitchus Jun 19 '14 at 11:40
  • This is a nice solution except it does not work if some of the CSV fields are empty, e.g. `A,,"C"`. I replaced the Kleene plus with a star like this `-vFPAT='[^,]*|"[^"]*"'` and it works in my use case though I may not have thought about some eventuality and it may bite back later. – Frigo Nov 10 '19 at 07:34
  • @Frigo, I'm not sure what you mean - you've recreated the same exact regex as in the answer. If your use case is longer-term and needs to work robustly or autonomously with future data that may use richer CSV (e.g. with newlines or quoted quotes inside cells) then Awk might not be the best tool anymore. (Though it's still doable in Awk.) – hemflit Nov 14 '19 at 12:05
  • 1
    You are right, I'm sorry. I had multiple pages open at once and commented on the wrong answer. The one I thought I was responding to was this one: https://stackoverflow.com/a/46627337/3737935 Sorry about that. – Frigo Nov 14 '19 at 12:55
12

The extra output you're getting from csv.awk is from demo code. It's intended that you use the functions within the script to do the parsing and then output it how you want.

At the end of csv.awk is the { ... } loop which demonstrates one of the functions. It's that code that's outputting the -> 2|.

Instead most of that, just call the parsing function and do print csv[1], csv[2].

That part of the code would then look like:

{
    num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);
    if (num_fields < 0) {
        printf "ERROR: %s (%d) -> %s\n", csverr, num_fields, $0;
    } else {
#        printf "%s -> ", $0;
#        printf "%s", num_fields;
#        for (i = 0;i < num_fields;i++) {
#            printf "|%s", csv[i];
#        }
#        printf "|\n";
        print csv[1], csv[2]
    }
}

Save it as your_script (for example).

Do chmod +x your_script.

And cat is unnecessary. Also, you can do sort -u instead of sort | uniq.

Your command would then look like:

./yourscript Buildings.csv | sort -u > floors.csv
Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
  • This works great except "print csv[1], csv[2]" should actually be "print csv[0], csv[1]" Thanks! – Chris Nov 17 '10 at 16:50
  • Any idea how to get awk to get rid of the extra whitespace on fields and not use a fixed width? "AIRPORT TEST " I want to be "AIRPORT TEST" – Chris Nov 17 '10 at 16:51
  • @Chris: Is the whitespace a separate question, because if I `print csv[0], csv[1]` I get "0 00BDF" rather than "AIRPORT TEST "? – Dennis Williamson Nov 17 '10 at 16:56
  • Sorry I did not realize I modified input file and removed the one column. You are correct based on original question. Cheers. Also sed fixed my whitespace issue. – Chris Nov 17 '10 at 17:32
  • 1
    @Chris: Just do `sub(/ *$/, "", csv[0])` before you `print csv[0], csv[1]`. – Dennis Williamson Nov 17 '10 at 17:46
7

My workaround is to strip commas from the csv using:

decommaize () {
  cat $1 | sed 's/"[^"]*"/"((&))"/g' | sed 's/\(\"((\"\)\([^",]*\)\(,\)\([^",]*\)\(\"))\"\)/"\2\4"/g' | sed 's/"(("/"/g' | sed 's/"))"/"/g' > $2
}

That is, first substitute opening quotes with "((" and closing quotes with "))", then substitute "(("whatever,whatever"))" with "whateverwhatever", then change all remaining instances of "((" and "))" back to ".

5

You could try this awkbased csv paser:

http://lorance.freeshell.org/csv/

Marcus Whybrow
  • 19,578
  • 9
  • 70
  • 90
3

Since the problem is really to distinguish between a comma inside a CSV field and the one that separates fields, we can replace the first kind of comma with something else so that it easier to parse further, i.e., something like this:

0,"00BDF","AIRPORT TEST            "
0,0,"BRICKER HALL<comma> JOHN W    "

This gawk script (replace-comma.awk) does that:

BEGIN { RS = "(.)" } 
RT == "\x022" { inside++; } 
{ if (inside % 2 && RT == ",") printf("<comma>"); else printf(RT); }

This uses a gawk feature that captures the actual record separator into a variable called RT. It splits every character into a record, and as we are reading through the records, we replace the comma encountered inside a quote (\x022) with <comma>.

The FPAT solution fails in one special case where you have both escaped quotes and a comma inside quotes but this solution works in all cases, i.e,

§ echo '"Adams, John ""Big Foot""",1' | gawk -vFPAT='[^,]*|"[^"]*"' '{ print $1 }'
"Adams, John "
§ echo '"Adams, John ""Big Foot""",1' | gawk -f replace-comma.awk | gawk -F, '{ print $1; }'
"Adams<comma> John ""Big Foot""",1

As a one-liner for easy copy-paste:

gawk 'BEGIN { RS = "(.)" } RT == "\x022" { inside++; } { if (inside % 2 && RT == ",") printf("<comma>"); else printf(RT); }'
Community
  • 1
  • 1
Raghu Dodda
  • 1,505
  • 1
  • 21
  • 28
2

You can use a script I wrote called csvquote to let awk ignore the commas inside the quoted fields. The command would then become:

csvquote Buildings.csv | awk -F, '{print $1","$2}' | sort | uniq | csvquote -u > Floors.csv

and cut might be a bit easier than awk for this:

csvquote Buildings.csv | cut -d, -f1,2 | sort | uniq | csvquote -u > Floors.csv

You can find the csvquote code here: https://github.com/dbro/csvquote

D Bro
  • 543
  • 6
  • 10
2

Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.

perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){ @f=$csv->fields(); print "$f[0],$f[1]" }' file

The input line is split into array @f
Field 1 is $f[0] since Perl starts indexing at 0

output:

u_floor,u_room
0,00BDF
0,0
0,3
0,5
0,6
0,7
0,8
0,9
0,19
0,20
0,21
0,25
0,27
0,29
0,35
0,45
0,59
0,60
0,61
0,63
0,0006M
0,0008A
0,0008B
0,0008C
0,0008D
0,0008E
0,0008F
0,0008G
0,0008H

I provided more explanation of Text::CSV_XS within my answer here: parse csv file using gawk

Community
  • 1
  • 1
Chris Koknat
  • 3,305
  • 2
  • 29
  • 30