How to search a substring from one df in another df?

Question

I have read this post and would like to do something similar.

I have 2 dfs:

df1:

file_num	city	address_line
1	Toronto	123 Fake St
2	Montreal	456 Sample Ave

df2:

DB_Num	Address
AB1	Toronto 123 Fake St
AB3	789 Random Drive, Toronto

I want to know which DB_Num in df2 match to addres_line and city in df1, and include which file_num the match was from.

My ideal output is:

file_num	city	address_line	DB_Num	Address
1	Toronto	123 Fake St	AB1	Toronto 123 Fake St

Based on the above linked post, I have made a look ahead regex, and am searching using the insert and str.extract method.

df1['search_field'] = "(?=.*" + df1['city'] + ")(?=.*" + df1['address_line'] + ")"
pat = "|".join(df1['search_field'])
df = df2.insert(0, 'search_field', df2['Address'].str.extract("(" + pat + ')', expand=False))

Since my address in df2 is entered manually, it is sometimes out of order.

Because it is out of order, I am using the look ahead method of regex.

The look ahead method is causing str.extract to not output any value. Although I can still filter out nulls and it will keep only the correct matches.

My main problem is I have no way to join back to df1 to get the file_num.

I can do this problem with a for loop and iterating each record to search, but it takes too long. df1 is actually around 5000 records, and df2 has millions, so it takes over 2 hours to run. Is there a way to leverage vectorization for this problem?

Thanks!

I think you can't vectorize your operation but you can use `multiprocess` or `numba` or `Dask`. I think you should use [fuzzy logic](https://stackoverflow.com/a/67837581/15239951) to merge your 2 dataframes — Corralien, May 27 '22 at 19:37

Drakax · Accepted Answer · 2022-05-27T20:23:21.463

Start by creating a new series which is the row each "Address" in df2 corresponds to "address_line" in df1, if such a row exists:

r = '({})'.format('|'.join(df1.address_line))
merge_df = df2.Address.str.extract(r, expand=False)
merge_df

#output:

0    123 Fake St
1            NaN
Name: Address, dtype: object

Now we merge our df1 on the "address_line" column, and our df2 on our "merge_df" series:

df1.merge(df2, left_on='address_line', right_on=merge_df)

index	file_num	City	address_line	DB_num	Address
0	1.0	Toronto	123 Fake St	AB1	Toronto 123 Fake St

How to search a substring from one df in another df?

1 Answers1