Compare list with data with a list in a python dataframe column

Question

I'm a graduate student and working wit python to prepare my data set for my research. I'm not that confident with the use of python, so I would really appreciate your Help.

Continuing on a previous asked question (how to do re.compile() with a list in python)

I would like to apply this kind of word recognition one column in a dataframe.

import pandas as pd
from itertools import compress

fruits = ['apple', 'banana', 'cherry']
df = pd.DataFrame({"a":['green apple and red cherry', 'blue', 'apple, banana and cherry', 'banana banana split'],"b":[0,2,0,1]},
                  index = [1,2,3,4])

# Create a list to store the data
grades = []
grades = list(compress(fruits, (f in df.a for f in fruits)))
df['grades'] = pd.Series(grades)

This doesn't work out, since a data frame is generated where all 'grade-values' are NaN.

Additionally, I would like to know if this is also possible with a list of sentences, instead of a list of words. And how this could be done.

Thank you in advance!

df.grades should be ['apple, cherry', 'NAN', 'apple, banana, cherry', 'banana banana'] — L. Scheipers, Feb 15 '18 at 14:46

score 0 · Accepted Answer · answered Feb 15 '18 at 15:19

0

This is one way:

df['grades'] = df['a'].apply(lambda x: ', '.join(i for i in x.split(' ') if i in fruits))

# 1     apple, cherry
# 2                  
# 3    banana, cherry
# 4    banana, banana
# Name: grades, dtype: object

answered Feb 15 '18 at 15:19

jpp

159,742
34
281
339

Thanks a lot! Many thanks! Imagine that I would like to find fruits = ['green apple', 'banana', 'cherry'], then the script does not work anymore for the combination green apple. What do I have to change in the script to let it work out? – L. Scheipers Feb 15 '18 at 17:09
This is will be more expensive and logic will be completely different. You might want to ask a separate question. – jpp Feb 15 '18 at 17:11

score 0 · Answer 2 · answered Feb 15 '18 at 15:33

In [422]: t = df.a.str.split(expand=True)

In [423]: df['grades'] = t[t.isin(fruits)].T.agg(lambda x: x.dropna().str.cat(sep=', '))

In [424]: df
Out[424]:
                            a  b          grades
1  green apple and red cherry  0   apple, cherry
2                        blue  2
3    apple, banana and cherry  0  banana, cherry
4         banana banana split  1  banana, banana

Explanation:

In [425]: t
Out[425]:
        0       1      2       3       4
1   green   apple    and     red  cherry
2    blue    None   None    None    None
3  apple,  banana    and  cherry    None
4  banana  banana  split    None    None

In [426]: t.isin(fruits)
Out[426]:
       0      1      2      3      4
1  False   True  False  False   True
2  False  False  False  False  False
3  False   True  False   True  False
4   True   True  False  False  False

In [427]: t[t.isin(fruits)]
Out[427]:
        0       1    2       3       4
1     NaN   apple  NaN     NaN  cherry
2     NaN     NaN  NaN     NaN     NaN
3     NaN  banana  NaN  cherry     NaN
4  banana  banana  NaN     NaN     NaN

Compare list with data with a list in a python dataframe column

2 Answers2