2

I have two csv's in which the rows can be matched by the value in one column (after some tweaking of this column). After the matching I want to take some values from both off them and make a new, combined row.

I thought of a simple script using csv.DictReader for both of them and then a double for-loop:

for row1 in csv1:
    for row2 in csv2:
        if row1['someID'] == row2['someID]:
            newdict = ... etc

However, 1 file is 9 million rows and the other is 500k rows. So my code would take 4.5 * 10^12 iterations. Hence my question: what is a fast way to match them?

Important:

  • This 'someID' on which they are matched is in neither csv unique.
  • I want additional rows for every match. So if a 'someID' appears
    twice in csv1 and 3 times csv2, I expect 6 rows with this 'someID' in the final result.
CorneeldH
  • 593
  • 1
  • 8
  • 21
  • using `pandas` and `pd.merge` if your computer is decent enough, if not you can use `sort` followed by `join` in `bash` since that is not limited by your computer's RAM – gold_cy Aug 04 '17 at 11:45
  • if you are using pandas, you can follow this thread - https://stackoverflow.com/questions/14224172/equality-in-pandas-dataframes-column-order-matters – Clock Slave Aug 04 '17 at 11:48

1 Answers1

0

The solution was indeed pandas and pd.merge. Matching 500k with 9 mil was for pandas just a matter of seconds :O Thanks for the comments

CorneeldH
  • 593
  • 1
  • 8
  • 21