How to subset a DataFrame according to terms in text column of the DataFrame

Question

I am trying to create a subset of my data according to certain terms in the text column of my DataFrame.

df = pd.DataFrame({'id': [123, 456, 789, 101, 402],
                   'text': [[{'the meeting was amazing'}, {'we should do it more often'}],         
                            [{'start': '15', 'tag': 'Meeting'}],
                            [],
                            [{'Let this be the end of it'}],
                            [{'end': '164', 'tag': 'meetingno2'}]
                            ]
                    })

I want to get a subset with rows 1, 2, and 5 where the term 'meeting' appears in some form.

I have tried the following code:

df_sub = df[df['text'].isin(df['text'].str.findall(r'[Mm]eeting+'))]

But the resulting subset I get with this code only contains the rows where the text column is empty. However, when I try doing

df['text_2'] = df['text'].str.findall(r'[Mm]eeting+'))

--it produces a new column in the df with the value 'meeting' for rows 1, 2, and 5. Therefore, I think it is picking up the text but not splitting it correctly. How can I get the desired output?

Use `df[df['text'].str.contains(r'[Mm]eeting+')]` – jezrael Apr 19 '22 at 08:42 — jezrael, Apr 19 '22 at 08:42

mozway · Accepted Answer · 2022-04-19T08:51:42.297

1

The "in some form" is ambiguous, but one quick hack could be to convert to string and test if it contains the value. This this will match anything (dictionary keys, values, set values, etc.) as long as it is present in the string representation of the python object:

df[df['text'].astype(str).str.contains('meeting', case=False)]

output:

    id                                               text
0  123  [{the meeting was amazing}, {we should do it m...
1  456                [{'start': '15', 'tag': 'Meeting'}]
4  402              [{'end': '164', 'tag': 'meetingno2'}]

edited Apr 19 '22 at 08:51

answered Apr 19 '22 at 08:45

mozway

194,879
13
39
75

hmmm, if working `df['text_2'] = df['text'].str.findall(r'[Mm]eeting+'))` for new column then OP has strings in real data – jezrael Apr 19 '22 at 08:48
@jezrael not according to provided sample – mozway Apr 19 '22 at 08:49
yop, in sample data not working. – jezrael Apr 19 '22 at 08:49
the syntax was missing commas, I fixed it – mozway Apr 19 '22 at 08:50
Is using df[df['text'].str.contains(r'[Mm]eeting+', na=False)] safe for the said error? – Abir Apr 19 '22 at 09:46
You can either `fillna('')` before changing to string, or `fillna(False)` before slicing – mozway Apr 19 '22 at 09:53

How to subset a DataFrame according to terms in text column of the DataFrame

1 Answers1