Fast method for matching two CSV's files on column value

Question

I have two csv's in which the rows can be matched by the value in one column (after some tweaking of this column). After the matching I want to take some values from both off them and make a new, combined row.

I thought of a simple script using csv.DictReader for both of them and then a double for-loop:

for row1 in csv1:
    for row2 in csv2:
        if row1['someID'] == row2['someID]:
            newdict = ... etc

However, 1 file is 9 million rows and the other is 500k rows. So my code would take 4.5 * 10^12 iterations. Hence my question: what is a fast way to match them?

Important:

This 'someID' on which they are matched is in neither csv unique.
I want additional rows for every match. So if a 'someID' appears
twice in csv1 and 3 times csv2, I expect 6 rows with this 'someID' in the final result.

using `pandas` and `pd.merge` if your computer is decent enough, if not you can use `sort` followed by `join` in `bash` since that is not limited by your computer's RAM — gold_cy, Aug 04 '17 at 11:45
if you are using pandas, you can follow this thread - https://stackoverflow.com/questions/14224172/equality-in-pandas-dataframes-column-order-matters — Clock Slave, Aug 04 '17 at 11:48

score 0 · Accepted Answer · answered Aug 08 '17 at 15:41

0

The solution was indeed pandas and pd.merge. Matching 500k with 9 mil was for pandas just a matter of seconds :O Thanks for the comments

answered Aug 08 '17 at 15:41

CorneeldH

593
1
8
21

Fast method for matching two CSV's files on column value

1 Answers1