1

i want to merge two dataframes by partial string match. I have two data frames to combine. First df1 consists of 130.000 rows like this:

id    text                        xc1       xc2
1     adidas men shoes            52465     220
2     vakko men suits             49220     224
3     burberry men shirt          78248     289
4     prada women shoes           45780     789
5     lcwaikiki men sunglasses    34788     745

and second df2 consists of 8000 rows like this:

id    keyword               abc1     abc2
1     men shoes             1000     11
2     men suits             2000     12
3     men shirt             3000     13
4     women socks           4000     14
5     men sunglasses        5000     15

After matching between keyword and text, outputshould look like this:

id    text                        xc1       xc2   keyword         abc1  abc2
1     adidas men shoes            52465     220   men shoes       1000  11
2     vakko men suits             49220     224   men suits       2000  12
3     burberry men shirt          78248     289   men shirt       3000  13
4     lcwaikiki men sunglasses    34788     745   men sunglasses  5000  15
Bushmaster
  • 4,196
  • 3
  • 8
  • 28
  • Is it possible the partial string match is at the middle of text in `df1` ? e.g. `vakko men suits blue` ? – SeaBean May 23 '21 at 08:02
  • @SeaBean Yes. There are such examples when I examine the dataset. – Bushmaster May 23 '21 at 08:12
  • Are the id are in same order? – Anurag Dabas May 23 '21 at 08:14
  • @AnuragDabas unfortunately no. the row numbers of both data frames are different – Bushmaster May 23 '21 at 08:18
  • https://stackoverflow.com/a/60908516/14066512 – Rajesh May 23 '21 at 08:50
  • One more complication would be `men suits` in `df2` should not match `abc women suits` in `df1`, right ? – SeaBean May 23 '21 at 08:51
  • @SeaBean yes I wouldn't want it to match like that. – Bushmaster May 24 '21 at 06:35
  • In such case, you cannot use most of the similar answers in StackOverflows that use simple substring test like `stringA in stringB`. Such kind of test will fail with false match of `men suits` in `keyword` with `women suits` in `text` since it returns `True` for test of `'men suits' in 'women suits'` We have to make use of regex with the consideration of word boundary to avoid the false match. – SeaBean May 24 '21 at 09:55

2 Answers2

2

Let's start by ordering the keywords longest-first, so that "women suits" matches "before "men suits"

lkeys = df2.keyword.reindex(df2.keyword.str.len().sort_values(ascending=False).index)

Now define a matching function; each text value from df1 will be passed as s to find a matching keyword:

def is_match(arr, s):
    for a in arr:
        if a in s:
            return a
    return None

Now we can extract the keyword from each text in df1, and add it to a new column:

df1['keyword'] = df1['text'].apply(lambda x: is_match(lkeys, x))

We now have everything we need for a standard merge:

pd.merge(df1, df2, on='keyword')
joao
  • 2,220
  • 2
  • 11
  • 15
  • thank you for the answer. I guess some values ​​contain int values. for example for example: "16 inch Bicycle Rim". therefore I get an error like this: `TypeError: argument of type 'int' is not iterable` – Bushmaster May 24 '21 at 06:29
2

Let's approach by cross join the 2 dataframes and then filter by matching string with substring, as follows:

df3 = df1.merge(df2, how='cross')    # for Pandas version >= 1.2.0 (released in Dec 2020)

import re
mask = df3.apply(lambda x: (re.search(rf"\b{x['keyword']}\b", str(x['text']))) != None, axis=1)
df_out = df3.loc[mask]

If your Pandas version is older than 1.2.0 (released in Dec 2020) and does not support merge with how='cross', you can replace the merge statement with:

# For Pandas version < 1.2.0
df3 = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)   

After the cross join, we created a boolean mask to filter for the cases that keyword is found within text by using re.search within .apply().

We have to use re.search instead of simple Python substring test like stringA in stringB found in most of the similar answers in StackOverflow. Such kind of test will fail with false match of 'men suits' in keyword with 'women suits' in text since it returns True for test of 'men suits' in 'women suits'.

We use regex with a pair of word boundary \b meta-characters around the keyword (regex pattern: rf"\b{x['keyword']}\b") to ensure matching only for whole word match for text in df1, i.e. men suits in df2 would not match with women suits in df1 since the word women does not have a word boundary between the letters wo and men.

Result:

print(df_out)


    id_x                      text    xc1  xc2  id_y         keyword  abc1  abc2
0      1          adidas men shoes  52465  220     1       men shoes  1000    11
6      2           vakko men suits  49220  224     2       men suits  2000    12
12     3        burberry men shirt  78248  289     3       men shirt  3000    13
24     5  lcwaikiki men sunglasses  34788  745     5  men sunglasses  5000    15

Here, columns id_x and id_y are the original id column in df1 and df2 respectively. As seen from the comment, these are just row numbers of the dataframes that you may not care about. We can then remove these 2 columns and reset index to clean up the layout:

df_out = df_out.drop(['id_x', 'id_y'], axis=1).reset_index(drop=True)

Final outcome

print(df_out)


                       text    xc1  xc2         keyword  abc1  abc2
0          adidas men shoes  52465  220       men shoes  1000    11
1           vakko men suits  49220  224       men suits  2000    12
2        burberry men shirt  78248  289       men shirt  3000    13
3  lcwaikiki men sunglasses  34788  745  men sunglasses  5000    15
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • thank you for the answer. I get an error like this: `TypeError: expected string or bytes-like object` could some of the values ​​be because it is integer? – Bushmaster May 24 '21 at 06:34
  • 1
    @muratmert41 In such case, cast `x['text']` to string first, using `str(x['text'])`. You can refer to my edited codes above. – SeaBean May 24 '21 at 08:29
  • 1
    @muratmert41 Feel free to let me know if any further thing I can help! :-) – SeaBean May 25 '21 at 06:02
  • your answers really help me a lot. I'm not getting an error anymore. but there are some minor problems that prevent me from getting the right result. I'm working on them. – Bushmaster May 25 '21 at 15:18
  • @muratmert41 See if I can help you out if it's related to your question here. If not, hope you solve it soon! Anyway, please remember to [accept and upvote the solution](https://stackoverflow.com/help/someone-answers) ! Good luck! – SeaBean May 25 '21 at 15:24
  • I'm sorry I forgot that. thank you again. – Bushmaster May 25 '21 at 15:25
  • @muratmert41 Welcome! Good luck to solving your minor problems soon! – SeaBean May 25 '21 at 15:27