Comparing efficiently two csv with +300000 records (with awk ?)

Question

I have two csv (A with 300 000 lines, B with 1 000 000 lines). The records are related by a column ID (relation 1-1), but have not the same column order and are not order by ID. I need to get the 700 000 lines of B for which ID is absent in A.

A.csv

A.ID,A.Field01,A.Field02   
2,a,d                      
4,b,e                      
1,c,f

B.csv

B.Field01,B.ID,B.Field02 
g,2,f
f,4,r
h,6,k
a,1,3

(I want to select h,6,k)

I guess I could solve it with a sgbd like sqlite, but I'm sure it can be done with something more simple and more efficient like awk.

I wanted to adapt this command I found:

awk 'ARGIND==1 {x[$0]++; next} !x[$0]' B.csv A.csv

However this compare the entire lines and I don't know how to select specifically A.ID and B.ID as the fields to compare.

I'm also interested in any other efficient approach to solve this !

Cheers

score 5 · Accepted Answer · answered Jan 23 '17 at 11:34

5

$ awk -F","  'FNR==NR{a[$1];next}!($2 in a)' a b
B.Field01,B.ID,B.Field02 
h,6,k

answered Jan 23 '17 at 11:34

James Brown

36,089
7
43
59

Thank you ! That's exactly what I was looking for. I also appreciated the other contributions and comments. All the best. – Megamini Jan 23 '17 at 17:57

score 3 · Answer 2 · edited May 23 '17 at 12:16

3

Update:

I would have expected that this performs better than James Brown's awk solution, but my tests couldn't prove that.

I've tested with two files, both containing 700.000 lines. The awk solution was significantly faster.

The awk solution needs as much memory as the id column of B.csv. That's why, for a really (really!) big B.csv the join solution might be the only feasible solution. If A.csv and B.csv would be already sorted this would of course perform much better since the sort commands take most of the time.

Even if this requires to sort both input files, I guess the join command will perform pretty well in this case:

join -v2 -t, -11 -22 <(sed 1d A.csv|sort -t, -k1,1) <(sed 1d B.csv|sort -t, -k2,2)

Explanation:

join
----

-v2    only report lines with id unique to file2
-t,    input/output field separator
-11    use field 1 from field 1 as id
-22    use field 2 from field 2 as id

sed
---

1d   removes the headers


 sort
 ----

 -t,    delimiter
 -k1,1  sort based on field

edited May 23 '17 at 12:16

Community

1
1

answered Jan 23 '17 at 11:35

hek2mgl

152,036
28
249
266

1

The join solution should be more efficient than awk. The OP said 300k lines so its a large file. +1 – NinjaGaiden Jan 23 '17 at 13:17
1

I often prefer sed to head/tail with "fancy" options. Here: `sed 1d file` – glenn jackman Jan 23 '17 at 14:25
@glennjackman For a portable shell script `sed` is the better choice, I agree. – hek2mgl Jan 23 '17 at 14:42
1

@NinjaGaiden I would have expected that as well, but I've tested with two files, both containing 700.000 lines. The awk solution above was ~50% faster. Sure the `awk` solution needs as much memory as the file's sizes. For really big files the join solution might be better.. – hek2mgl Jan 23 '17 at 14:54
1

@hek2mgl: ++; I suggest you add your last comment directly to your answer (but note that only the 1st column needs to be stored in memory as a whole in the `awk` solution). Incidentally, `tail -n +` _is_ POSIX-compliant - see http://pubs.opengroup.org/onlinepubs/9699919799/utilities/tail.html – mklement0 Jan 23 '17 at 14:59
The awk solution will be linear time. The sort|join solution needs at best `O(n log n)` – glenn jackman Jan 23 '17 at 15:08
@glennjackman Yes, but that's just the efficiency of the algorithms. When the `awk` solution takes more and more time on `malloc()` or it even does require swapping I assume(d) that join will be the only feasible way, but I'm not sure any more if this advantage ever makes a difference on a (modern) average computer. – hek2mgl Jan 23 '17 at 15:13
1

Yes, the time/space tradeoff. Once you get huge datasets, using a proper database would be the way to go. – glenn jackman Jan 23 '17 at 15:14
My observation showed that the the join solution makes much use of parallelism. The `user` amount of time is pretty much higher than the `total` time. I'm using a 8-core at work. – hek2mgl Jan 23 '17 at 16:06
@EdMorton: Good point re IDs. However, from what I can tell, neither `sort` nor process substitutions (necessarily) keep the entire input / output in memory: [`sort` uses temporary files on demand](http://stackoverflow.com/a/930051/45375) and [process substitutions use pipes](http://unix.stackexchange.com/a/92519/54804). – mklement0 Jan 23 '17 at 19:00
1

@mklement0 OK, deleted the comment thanks for the correction. – Ed Morton Jan 23 '17 at 19:33

Comparing efficiently two csv with +300000 records (with awk ?)

2 Answers2