Record linking two large CSVs in Python?

Question

I'm somewhat new to Pandas and Python Record Linkage Toolkit, so please forgive me if the answer is obvious. I'm trying to cross-reference one large dataset, "CSV_1", against another, "CSV_2", in order to create a third CSV consisting only of matches that concatenates all columns from CSV_1 and CSV_2 regardless of overlap in order to preserve the original record, e.g.

CSV_1                               CSV_2
Name     City     Date              Name_of_thing     City_of_Origin     Time
Examp.   Bton     7/11              THE EXAMPLE, LLC  Bton, USA          7/11/2020 00:00
Nomatch  Cton     10/10             huh, inc.         Lton, AMERICA      9/8/2020 00:00

Would output

CSV_3
Name     City     Date    Name_of_thing     City_of_Origin     Time
Examp.   Bton     7/11    THE EXAMPLE, LLC  Bton, USA          7/11/2020 00:00

The data is not well structured, and CSV_2 has many more columns than CSV_1, which is why I have been attempting to find fuzzy matches based on the name column with the city column as an index block. Having trouble getting the matching stage to even execute, never mind efficiently, and haven't even tackled the concatenation step. Any help on how to tackle this?

Edit: The files are each very large (both ~1M lines with 8-20 columns, 80-200mb), even loading single columns with pandas is troublesome. For context, this is a data project for a job application which indicated a preference for a 'passing familiarity with Python or R'. Under normal circumstances this title requires no coding knowledge whatsoever, which is why I found it so strange the company decided to assign this complex data problem. Parameters are: Single Python file running locally in a lower-mem (think 2013 Dell Inspiron) environment without modification (i.e. no increasing page file size).

Have you looked at [this](https://pbpython.com/record-linking.html) from Practical Business Python? For matching you only need to load the target CSV_2 dataset in full; the first one you can process in chunks, writing out the resulting data-frame to file. — gherka, Aug 13 '20 at 19:55
That's a great source and what I was using as my template, I just don't know how to cross-reference chunks so that all entries in CSV_1 are checked against all entries in CSV_2, and the CSV_2 matches are concatenated at the right CSV_1 line in the final result. — Mortrin, Aug 13 '20 at 20:03

score 0 · Answer 1 · answered Aug 13 '20 at 19:24

0

For your problem statement and considering the size of the data involved, I recommend loading your data into a database. Then, I would use the following SQL to solve your problem, then I would read the result into my local python env / pandas dataframe:

select *
from csv_1
inner join csv_2
on csv_1.city = csv_2.city_of_origin
where STRPOS( lower(csv_1.name) , lower(csv_2.name_of_thing) )>0
or STRPOS( lower(csv_2.name_of_thing) , lower(csv_1.name) )>0

answered Aug 13 '20 at 19:24

barker

1,005
18
36

Unfortunately the use of databases is off the table, the solution needs to be entirely portable. For context, this is a data project for a job application which indicated a preference for a 'passing familiarity with Python or R'. Under normal circumstances this title requires no coding knowledge whatsoever, which is why I found it so strange the company decided to assign this complex data problem. Parameters are: Single Python file running locally in a lower-mem (think 2013 Dell Inspiron) environment without modification (i.e. no increasing page file size). – Mortrin Aug 13 '20 at 19:34

Record linking two large CSVs in Python?

1 Answers1

Linked