1

I would like to compare two columns between two CSVs, find partial string matches, then append all data from the matching line (including equivalent within columns) at the correct index in the first CSV, then output the compiled matches in a third CSV, without editing either of the first two CSVs. All lines in the first CSV must appear on the third, whether or not matches were found, but only matching lines from the second CSV. For example:

CSV_1                               CSV_2
Name     City     Date              Name_of_thing     City_of_Origin     Time
Examp.   Bton     7/11              huh, inc.         Lton, AMERICA  7/10/2020 00:00
Nomatch  Cton     10/10             THE EXAMPLE, LLC  Bton, USA        7/11/2020 00:00

Would appear as

CSV_3
Name     City     Date    Name_of_thing     City_of_Origin     Time
Examp.   Bton     7/11    THE EXAMPLE, LLC  Bton, USA          7/11/2020 00:00
Nomatch  Cton     10/10

I posted earlier here asking about a fuzzy matching approach (Record linking two large CSVs in Python?), but the solution ended up being very time-inefficient, so I'm hoping a simple regex search will be sufficient. These files are both very large and not the same length, so I'm wondering if there's a solution which allows all entries in CSV_1 to be compared against those in CSV_2 in an efficient way.

Unfortunately, the constraints of the project do not allow me to load the CSVs into a database, which would make this much easier, so I need to rely entirely upon Pandas for this.

Thanks!

Mortrin
  • 23
  • 3
  • Does this answer your question? [is it possible to do fuzzy match merge with python pandas?](https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas) – AtanuCSE Aug 14 '20 at 23:07
  • Sorry, that's what I was trying to do before, I was looking for a simple partial string match – Mortrin Aug 14 '20 at 23:40

1 Answers1

0

Just a little leg work on making dates and city consistent. Then it's a simple left merge()

import pandas as pd
import io
import datetime as dt

CSV_1 = pd.read_csv(io.StringIO(                              
"""Name     City     Date              
Examp.   Bton     7/11              
Nomatch  Cton     10/10"""), sep="\s\s+", engine="python")             
    
CSV_2 = pd.read_csv(io.StringIO(
"""Name_of_thing     City_of_Origin     Time
huh, inc.         Lton, AMERICA  7/10/2020 00:00
THE EXAMPLE, LLC  Bton, USA        7/11/2020 00:00"""), sep="\s\s+", engine="python") 

# need to make dates consistent and joinable
# need to pull city out of City_of_origin
CSV_3 = CSV_1.assign(
    datekey=pd.to_datetime(CSV_1["Date"]+f"/{dt.date.today().year}")
).merge(
    CSV_2.assign(
        datekey=pd.to_datetime(CSV_2["Time"]),
        City=lambda dfa: dfa["City_of_Origin"].str.extract("([A-Za-z]*)")
    ),
    on=["datekey","City"],
    how="left"
).drop(columns="datekey")

print(CSV_3.to_string())

output

      Name  City   Date     Name_of_thing City_of_Origin             Time
0   Examp.  Bton   7/11  THE EXAMPLE, LLC      Bton, USA  7/11/2020 00:00
1  Nomatch  Cton  10/10               NaN            NaN              NaN
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
  • I found a way to make the fuzzy match work in a reasonable amount of time, this is a good alternative for this specific example, though. – Mortrin Aug 16 '20 at 02:07