I have read this post and would like to do something similar.
I have 2 dfs:
df1:
file_num | city | address_line |
---|---|---|
1 | Toronto | 123 Fake St |
2 | Montreal | 456 Sample Ave |
df2:
DB_Num | Address |
---|---|
AB1 | Toronto 123 Fake St |
AB3 | 789 Random Drive, Toronto |
I want to know which DB_Num in df2 match to addres_line and city in df1, and include which file_num the match was from.
My ideal output is:
file_num | city | address_line | DB_Num | Address |
---|---|---|---|---|
1 | Toronto | 123 Fake St | AB1 | Toronto 123 Fake St |
Based on the above linked post, I have made a look ahead regex, and am searching using the insert
and str.extract
method.
df1['search_field'] = "(?=.*" + df1['city'] + ")(?=.*" + df1['address_line'] + ")"
pat = "|".join(df1['search_field'])
df = df2.insert(0, 'search_field', df2['Address'].str.extract("(" + pat + ')', expand=False))
Since my address in df2 is entered manually, it is sometimes out of order.
Because it is out of order, I am using the look ahead method of regex.
The look ahead method is causing str.extract
to not output any value. Although I can still filter out nulls and it will keep only the correct matches.
My main problem is I have no way to join back to df1 to get the file_num.
I can do this problem with a for loop and iterating each record to search, but it takes too long. df1 is actually around 5000 records, and df2 has millions, so it takes over 2 hours to run. Is there a way to leverage vectorization for this problem?
Thanks!