1

I need to filter from a dataframe if search item in found in column (e.g. identifying all employees-empID with the searched expertise area

import pandas as pd
date=[
      [123,['abc','def','efg']],
      [124,['abc','qwe','mno']],
      [124,['abc','qwe','mno']],
      [126,['wer','abc']]
     ]
expertise_df=pd.DataFrame(date,columns=['EmpId','areas'])

I am looking to filter for rows that have 'abc' or 'qwe' etc- but without explicitly creating loops across all rows and checking for match. I tried the expertise_df['areas'.str.contains but that doesn't return results. This will be matching around 1000 search strings on 200,000 rows in the expertise_df and hence looking for time-efficient approaches. The brute-force approach takes over 4 hours and not practical

Also, if I need to create union operations (contains either 'abc' or 'pqr' etc)- is there a way the results could be handled as sets to enable this?

zabop
  • 6,750
  • 3
  • 39
  • 84
raghu
  • 339
  • 2
  • 12

3 Answers3

2

To filter for rows which contain qwe as a list element in second column of expertise_df, can do:

expertise_df[
    pd.DataFrame(
        expertise_df.explode('areas')\
        ['areas'].str.contains('qwe'))\
    .reset_index()\
    .groupby('index')\
    .any()['areas']]

This will give you:

   EmpId            areas
1    124  [abc, qwe, mno]
2    124  [abc, qwe, mno]

If want to find rows which contain either element of a list, ie:

searchfor = ['wer','def']

then do, using How to filter rows containing a string pattern from a Pandas dataframe:

expertise_df[
    pd.DataFrame(
        expertise_df.explode('areas')\
        ['areas'].str.contains('|'.join(searchfor)))\
    .reset_index()\
    .groupby('index')\
    .any()['areas']]
zabop
  • 6,750
  • 3
  • 39
  • 84
2

If you want to check for multiple values (['def', 'wer'] for example), you could use the negation of apply with isdisjoint.

mask = ~expertise_df['areas'].apply(frozenset(['def', 'wer']).isdisjoint)
res = expertise_df[mask]
print(res)

Output

   EmpId            areas
0    123  [abc, def, efg]
3    126       [wer, abc]
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
0

You can use set intersection to filter the rows you want.

import pandas as pd
date=[
      [123,['abc','def','efg']],
      [124,['abc','qwe','mno']],
      [124,['abc','qwe','mno']],
      [126,['wer','abc']]
     ]
expertise_df=pd.DataFrame(date,columns = ['EmpId','areas'])

filter_set = {'def','wer'}
filtered_df=expertise_df[expertise_df['areas'].
    apply(lambda el: bool(filter_set.intersection(el)))]

print(filtered_df)

Output:

EmpId            areas
0    123  [abc, def, efg]
3    126       [wer, abc]
LevB
  • 925
  • 6
  • 10