1

I have two dataframes, one with 300 names and one with 2000. I want to check if all of the words in each of the 300 names are contained in the 2000 in any iteration. For example:

Name 1: Mark, Alex, Smith,

Name 2: Mark, Joseph, Smith, Alex, the, first

Dataframe 1

Name 1
'Mark', 'Alex', 'Smith'

Dataframe 2

Name 2
'Mark', 'Joseph', 'Alex', 'Smith', 'the', First'

As you can see, the column in dataframe 2 contains all of the words from column in dataframe 1, but additional words in the name too.

My query should match here, because Name 2 contains all of the words from name 1 even though it is not an exact match. Each of the names is split into individual words in each cell.

Ideally, I would run a function across dataframe 2 which contains 2,000 names and see if any of those names have contain all of the words from dataframe 1.

Edit: Someone kindly pointed out in the comments that what I am trying to say, is can I find if Name 1 is a subset of Name 2.

work_python
  • 101
  • 6
  • Your question says check if any of the 300 words and later you show an example and state all of the words. Are you looking for any or all? Can you add code and sample dataframes too? – Trilokinath Modi Feb 24 '22 at 15:53
  • Sorry, I mean I want to see if all of the words exist in the second data frame which may potentially contain a lot more words in that one name. For example, the second name could contain three middle names too, while the first name only has a first and last name. – work_python Feb 24 '22 at 15:56
  • 1
    Please have a look at [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and [edit] your question to include a [mcve] with sample inputs, expected output, and _code_ for what you've tried so far based on your own research, and a description of what went wrong with your attempts – G. Anderson Feb 24 '22 at 15:58
  • Thanks I have edited the question hopefully with more context. – work_python Feb 24 '22 at 16:36
  • Are you trying to find if Name 1 is a subset of Name 2? if not, then rephrase the question. Your question sounds like you are trying to see if all the 300 words in Name 1 exists in 2000 words in Name 2. Is this so? – Trilokinath Modi Feb 24 '22 at 22:01
  • That’s the perfect way of phrasing it! I would like to find if Name 1 is a subset of Name 2, because Name 2 may contain additional middle or last names. Thanks. – work_python Feb 25 '22 at 09:10

1 Answers1

0

Assuming that each of your dataframes have a column with list of strings:

>>> df1 = pd.DataFrame({
        "Name 1": [['Mark', 'Alex', 'Smith'], ['S1', 'S2', 'S3']],
    })
>>> df2 = pd.DataFrame({
        "Name 2": [['Mark', 'Joseph', 'Alex', 'Smith', 'the', 'First'], ['S3', 'S4', 'S5']],
    })

You could merge both dataframes first:

>>> df = pd.merge(df1, df2, left_index=True, right_index=True)
>>> print(df)
                Name 1                                   Name 2
0  [Mark, Alex, Smith]  [Mark, Joseph, Alex, Smith, the, First]
1         [S1, S2, S3]                             [S3, S4, S5]

And now apply this function to identify if the subset condition is True:

>>> df = df.apply(lambda row: set(row["Name 1"]).issubset(set(row["Name 2"])), axis=1)
>>> df
0     True
1    False
dtype: bool
aaossa
  • 3,763
  • 2
  • 21
  • 34
  • Thanks. When I apply this merge, it reduces dataframe two from ~38,000 to 300 or so. Is there a way to keep the entries from the entire dataframe? I tried 'how = outer', but then when I try run the function you provided it says float objects are not iterable. – work_python Feb 28 '22 at 09:32
  • That's because when using `how=outer` you're keeping every row, even if it doesn't have a match. In those cases, the rows with no match will be completed using `NaN`, a null value. You could replace those `NaN` with empty lists, like [this answer](https://stackoverflow.com/a/42958907/3281097) suggests: `isnull = df.ids.isnull()` and then `df.loc[isnull, 'ids'] = [ [[]] * isnull.sum() ]` – aaossa Feb 28 '22 at 12:54