0

I have two data frames I want to match partial strings by using str.contains function then merge them.

Here is an example:

data1

      email     is_mane         name           id
    hi@amal.com     1   there is rain          10
    hi2@amal.com    1   here is the food        9
    hi3@amal.com    1   let's go together       8
    hi4@amal.com    1   today is my birthday    6


data2

    id  name
    1   the rain is beautiful
    1   the food
    2   together
    4   my birthday
    3   your birthday

And here is the code I wrote:

data.loc[data.name.str.contains('|'.join(data2.name)),:]

and the output:

        email   is_mane     name               id
    hi2@amal.com    1   here is the food        9
    hi3@amal.com    1   let's go together       8
    hi4@amal.com    1   today is my birthday    6

As you can see it did not return "there is rain" even that rain word is contained in dara2: could it be because of space?

Also I want to merge data1 with data2 so that will help me to know what email has match.

I would like to have the following output:


        email   is_mane     name               id      id2       name2
    hi2@amal.com    1   here is the food        9       1       the food
    hi3@amal.com    1   let's go together       8       2       together
    hi4@amal.com    1   today is my birthday    6       4       my birthday
    hi4@amal.com    1   today is my birthday    6       3       your birthday

Is there is any way to do it?

Jean-Francois T.
  • 11,549
  • 7
  • 68
  • 107
Fatima
  • 497
  • 5
  • 21
  • 1
    Have a look at [fuzzy merging](https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas) – Erfan Jan 26 '20 at 12:14
  • @ Erfan yes and it didn't work well so i want to use contains – Fatima Jan 26 '20 at 12:30

1 Answers1

2

If you're good with matching only full words you can do (so e.g. dog and dogs won't match)

data1["key"]=data1["name"].str.split(r"[^\w+]")
data2["key"]=data2["name"].str.split(r"[^\w+]")

data3=data1.explode("key").merge(data2.explode("key"), on="key", suffixes=["", "2"]).drop("key", axis=1).drop_duplicates()

Otherwise it's a matter of cross join, and applying str.contains(...) to filter out the ones, which aren't matching.

Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34