1

I have a pandas df with a column of lists of words, like so:

df = pd.DataFrame({'Text':[['cat','dog','mouse'], ['horse','dog','rabbit'], ['cow','pig','hawk']] })

and a second df with a list of words to match, like so:

df2 = pd.DataFrame({'FarmAnimals':['horse','cow','pig'] })

And I'm looking to get a new column with the intersection of df and df2, such as:

df3 = pd.DataFrame({'Text':[['cat','dog','mouse'], ['horse','dog','rabbit'], ['cow','pig','hawk'], 'Intersect':[],['horse'],['cow', 'pig'] })

This is my current code, but at this point I've practically tried everything

def match(df):

    def find_match(v):
        inter = set(v).intersection(list(df2['FarmAnimals']))
        return inter

    df['Intersect'] = df['Text'].apply(find_match)
    df = df[['Intersect']]
    return df

But it keeps transforming v into a list of characters and then returning single character matches. I've probably dropped 20+ hours into this single bug and have goner through Pandas multiple column intersection Finding the intersection between two series in Pandas using index How can I get intersection of two pandas series text column? etc etc etc but none have been remotely helpful. As an added note, df is massive in size (1mil rows) and df2 is also massice (100k rows) and the fxn is being run through ray python (apache arrow wrapper).

Edit: it might be a datatype bug, outputs of data going into the intersection apply below:

print(df.dtypes)
print(type(df['Text']))
print(df2.dtypes)
print(type(df2['FarmAnimals']))
print(type(df.iloc[0,0]))
print(df.iloc[0,0]))
print(type(df2.iloc[0,0]))

dtype: object
<class 'pandas.core.series.Series'>
dtype: object
<class 'pandas.core.series.Series'>
<class 'str'>
['cat', 'rat', 'rabbit']
<class 'str'>

Shouldn't the class of df['Text'] cells be list??

Solved: issue with pd.read_csv parsing a list col into a string. Treated the col right after csv read with:

from ast import literal_eval
text_data.raw_text = text_data.raw_text.apply(literal_eval)
Garglesoap
  • 565
  • 6
  • 18
  • Maybe: [Add ID found in list to new column in pandas dataframe](https://stackoverflow.com/questions/60989914/add-id-found-in-list-to-new-column-in-pandas-dataframe). Perhaps swap the list `bad_ids` for `df2['FarmAnimals']` – MDR Aug 08 '21 at 18:28
  • can you print `df2['FarmAnimals'].iloc[0]`? – mozway Aug 08 '21 at 18:59
  • just added a few of those outputs – Garglesoap Aug 08 '21 at 19:06

3 Answers3

2

You can use pandas.Series.apply and sets:

df['Intersect'] = (df['Text']
                     .apply(lambda x: list(set(x).intersection(set(df2['FarmAnimals'])))
                           )
                  )

output:

                   Text   Intersect
0     [cat, dog, mouse]          []
1  [horse, dog, rabbit]     [horse]
2      [cow, pig, hawk]  [cow, pig]
mozway
  • 194,879
  • 13
  • 39
  • 75
  • I realize, this actually looks very close to what you did… Are you sure you code it not running? – mozway Aug 08 '21 at 18:32
  • 1
    this is producing the same output of single characters. I might have a data type error, updating question... – Garglesoap Aug 08 '21 at 18:53
1

Use the numpy function numpy.intersect1d() to find the intersection for faster execution time (especially when your dataset is large):

import numpy as np

df['Intersect'] = df['Text'].map(lambda x: np.intersect1d(x, df2['FarmAnimals'].values))

Here, for each list of strings under column text, we use numpy.intersect1d() to find the intersection between the list and the column values of column FarmAnimals.

Result:

print(df)

                   Text   Intersect
0     [cat, dog, mouse]          []
1  [horse, dog, rabbit]     [horse]
2      [cow, pig, hawk]  [cow, pig]
SeaBean
  • 22,547
  • 3
  • 13
  • 25
0

Another way with series.str.findall:

import re
p = r'\b(?:{})\b'.format('|'.join(map(re.escape, df2['FarmAnimals'])))
df['Intersect'] = df['Text'].astype(str).str.findall("("+p+")")

print(df)

                   Text   Intersect
0     [cat, dog, mouse]          []
1  [horse, dog, rabbit]     [horse]
2      [cow, pig, hawk]  [cow, pig]
anky
  • 74,114
  • 11
  • 41
  • 70