1

I'm trying to compare two very large csv files and compare them by a free text field in both. I've tried using awk, e.g.

awk -F, 'FNR==NR{a[$2]++;next} a[$2]' table1.csv table2.csv > duplicates.csv

but the output csv file isn't properly formatted, probably because the free text field contains commas. I'd like to save the entire duplicate row in table1 in the duplicate table. Thanks for your help.

Jotne
  • 40,548
  • 12
  • 51
  • 55
  • Parsing a CSV is hard with AWK. How are fields with commas quoted or escaped? – Gabe Nov 28 '13 at 07:15
  • They're simply quoted, and no spaces between commas and quotes between fields. I can read data in python/pandas just fine, but I was wondering if there was a nice one-liner out there. –  Nov 28 '13 at 07:19
  • Can you provide some sample input and output? It would be helpful. – Birei Nov 28 '13 at 11:28
  • You may have a look at this question: http://stackoverflow.com/questions/20220854/source-file-delimiter-issue/20227644#20227644 – Håkon Hægland Nov 28 '13 at 12:29
  • 2
    If you are using Gnu Awk version 4, you could use the `FPAT` variable.. See: http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content – Håkon Hægland Nov 28 '13 at 12:48
  • 1
    For earlier versions of awk, this page looks promising: http://backreference.org/2010/04/17/csv-parsing-with-awk/ – Håkon Hægland Nov 28 '13 at 13:08
  • Perl's `Text::CSV` or `Text::CSV_XS` would be more appropriate. –  Apr 16 '14 at 06:02

1 Answers1

0

Did I understood right you need something like:

$ more filea fileb
::::::::::::::
filea
::::::::::::::
1,a
2,b
3,c
::::::::::::::
fileb
::::::::::::::
1,d
x,e
3,f
$ awk 'BEGIN {
    while (getline a < "filea" && getline b < "fileb") {
        split(a, aa, ",")
        split(b, bb, ",")
        if (aa[1] == bb[1]) {
            print a, b
        }
    }
}'
1,a 1,d
3,c 3,f
kerolasa
  • 401
  • 2
  • 5