-1

I want to merge 2 dataframes on string columns with values containing wildcards as we can do with like in SQL.

Example :

import pandas as pd

df1 = pd.DataFrame({'A': ["He eat an apple in his office.", "There are many apples on the tree."], 'B': [1, 2]})
df2 = pd.DataFrame({'A': ["apple*tree", "apple*pie"], 'C': [4, 9]})

df1
                                    A  B
0      He eat an apple in his office.  1
1  There are many apples on the tree.  2

df2
            A  C
0  apple*tree  4
1   apple*pie  9


pd.merge(df1, df2, on = ['A']) 

# What it gives me :

Empty DataFrame
Columns: [A, B, C]
Index: []


# What I want:
                                    A  B  C
0  There are many apples on the tree.  2  4

I want to join the two dataframes and "apple*tree" of df2 has to match the sentence "There are many apples on the tree." of df1.

Can you help me to do this please?

I have found the function fnmatch.fnmatch(string, pattern) but can I use it in this case with a merge?

Ahmed K
  • 21
  • 2

1 Answers1

0

This can be done by using apply to search for df2's patterns in each row of df1. This will require runtime proportional to O(n*m), where n is the number of rows in df1, and m is the number of rows in df2. This is not very efficient, but that's fine for small dataframes.

Once we identify the matches between df1 and df2, we can merge the two dataframes. After that, we just need to clean up the dataframe and drop unneeded columns.

Code:

import pandas as pd
import fnmatch

df1 = pd.DataFrame({'A': ["He eat an apple in his office.", "There are many apples on the tree."], 'B': [1, 2]})
df2 = pd.DataFrame({'A': ["apple*tree", "apple*pie"], 'C': [4, 9]})

def wildcard_search(pattern):
    # Comment this line to require exact match
    pattern = "*" + pattern + "*"
    # Apply pattern to every A values within df1
    matching = df1['A'].apply(lambda x: fnmatch.fnmatch(x, pattern))
    # Get index of largest member
    first_match = matching.idxmax()
    # If we have all zeros, then first_match will refer to the first
    # zero. Check for this.
    if matching.loc[first_match] == 0:
        return None
    # print(first_match)
    return df1.loc[first_match, 'A']

# Using df2 patterns, search through df1. Record values found.
df2['merge_key'] = df2['A'].apply(wildcard_search)

# Merge two dataframes, on cols merge_key and A
res = df2.merge(
    df1,
    left_on='merge_key',
    right_on='A',
    suffixes=("_x", "")  # Don't add a suffix to df1's columns
)
# Reorder cols, drop unneeded
res = res[['A', 'B', 'C']]
print(res)

This answer is adapted from this post.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66