Filtering dataframe based on items present in column

Question

I need to filter from a dataframe if search item in found in column (e.g. identifying all employees-empID with the searched expertise area

import pandas as pd
date=[
      [123,['abc','def','efg']],
      [124,['abc','qwe','mno']],
      [124,['abc','qwe','mno']],
      [126,['wer','abc']]
     ]
expertise_df=pd.DataFrame(date,columns=['EmpId','areas'])

I am looking to filter for rows that have 'abc' or 'qwe' etc- but without explicitly creating loops across all rows and checking for match. I tried the expertise_df['areas'.str.contains but that doesn't return results. This will be matching around 1000 search strings on 200,000 rows in the expertise_df and hence looking for time-efficient approaches. The brute-force approach takes over 4 hours and not practical

Also, if I need to create union operations (contains either 'abc' or 'pqr' etc)- is there a way the results could be handled as sets to enable this?

zabop · Answer 1 · 2020-12-26T18:29:52.927

To filter for rows which contain qwe as a list element in second column of expertise_df, can do:

expertise_df[
    pd.DataFrame(
        expertise_df.explode('areas')\
        ['areas'].str.contains('qwe'))\
    .reset_index()\
    .groupby('index')\
    .any()['areas']]

This will give you:

   EmpId            areas
1    124  [abc, qwe, mno]
2    124  [abc, qwe, mno]

If want to find rows which contain either element of a list, ie:

searchfor = ['wer','def']

then do, using How to filter rows containing a string pattern from a Pandas dataframe:

expertise_df[
    pd.DataFrame(
        expertise_df.explode('areas')\
        ['areas'].str.contains('|'.join(searchfor)))\
    .reset_index()\
    .groupby('index')\
    .any()['areas']]

score 2 · Accepted Answer · answered Dec 26 '20 at 18:25

2

If you want to check for multiple values (['def', 'wer'] for example), you could use the negation of apply with isdisjoint.

mask = ~expertise_df['areas'].apply(frozenset(['def', 'wer']).isdisjoint)
res = expertise_df[mask]
print(res)

Output

   EmpId            areas
0    123  [abc, def, efg]
3    126       [wer, abc]

answered Dec 26 '20 at 18:25

Dani Mesejo

61,499
6
49
76

1

Nice answer, simpler than mine. – zabop Dec 26 '20 at 18:30

score 0 · Answer 3 · answered Dec 26 '20 at 19:34

You can use set intersection to filter the rows you want.

import pandas as pd
date=[
      [123,['abc','def','efg']],
      [124,['abc','qwe','mno']],
      [124,['abc','qwe','mno']],
      [126,['wer','abc']]
     ]
expertise_df=pd.DataFrame(date,columns = ['EmpId','areas'])

filter_set = {'def','wer'}
filtered_df=expertise_df[expertise_df['areas'].
    apply(lambda el: bool(filter_set.intersection(el)))]

print(filtered_df)

Output:

EmpId            areas
0    123  [abc, def, efg]
3    126       [wer, abc]

Filtering dataframe based on items present in column

3 Answers3