0

I have a dataframe1 with a column that has cell entries like so (each cell is a set of comma separated words) - "are, boy, cat, dog, ear, far, gone".

Dataframe2 has a column with cell entries like so (each cell is a single letter or word) - "are" or "boy" or "gone".

I want to add a column to dataframe1 that will have a boolean entry if a word in each cell in dataframe1 has the word in dataframe2. For example,

DF1 = (are, boy, cat, dog, ear, far, gone), (home, guy, tall, egg), (cat, done, roof, grass), etc.....

DF2 = (are), (boy), (gone), etc....

New column cell value in dataframe1 = (1), (0), (1), etc....

eugenhu
  • 1,168
  • 13
  • 22
ramjfb
  • 1
  • It would be better if in the future you gave a small and *complete* example (the third value in the new column values should be 0 according to the visible words), preferably with the python code to generate the sample inputs and expected result. This will help others understand what you want and make it easier to write up a solution. – eugenhu Oct 13 '21 at 05:34

1 Answers1

0

Assuming these are your inputs,

import pandas as pd

df1 = pd.DataFrame(
    {'A': [
        'are, boy, cat, dog, ear, far, gone',
        'home, guy, tall, egg',
        'cat, done, roof, grass',
    ]}
)

df2 = pd.DataFrame({'A': ['are', 'boy', 'gone']})

print('%s\n%s' % (df1, df2))
#                                     A
# 0  are, boy, cat, dog, ear, far, gone
# 1                home, guy, tall, egg
# 2         cat, done, roof, grass, boy
#       A
# 0   are
# 1   boy
# 2  gone

We can use Series.str.split() to convert the comma-separated strings to lists:

df1['A'].str.split(r'\s*,\s*')
# 0    [are, boy, cat, dog, ear, far, gone]
# 1                  [home, guy, tall, egg]
# 2           [cat, done, roof, grass, boy]
# Name: A, dtype: object

The argument to split r'\s*,\s*' is a RegEx pattern matching commas and any whitespace between. The r prefix means to treat the string literal as a raw string.

Then we apply set.isdisjoint() to check which cells do not contain the values in df2:

df1['A'].str.split(r'\s*,\s*').apply(set(df2['A']).isdisjoint)
# 0    False
# 1     True
# 2     True
# Name: A, dtype: bool

And then finally negate this and assign to a new column 'B':

df1['B'] = ~df1['A'].str.split(r'\s*,\s*').apply(set(df2['A']).isdisjoint)
print(df1)
#                                     A      B
# 0  are, boy, cat, dog, ear, far, gone   True
# 1                home, guy, tall, egg  False
# 2              cat, done, roof, grass  False
eugenhu
  • 1,168
  • 13
  • 22