2

I have the following example pandas dataframe:

df = pd.DataFrame({'A': ['a1', 'a2', 'a3', 'a1', 'a2', 'a4'],
                   'B': ['b1', 'b1', 'b2', 'b3', 'b6', 'b6']})

I want to find the values in A that have association with all of the values of an input list in B.

For example, for an input lst = ['b1', 'b6'], the desired output is ['a2']. I have tried df[df['B'].isin(lst)] from here but it is indeed not yet sufficient, or may not be even necessary.

Reveille
  • 4,359
  • 3
  • 23
  • 46

3 Answers3

1

We can do it groupby + filter with isin + all

df.groupby('A').filter(lambda x : pd.Series(lst).isin(x['B']).all())
    A   B
1  a2  b1
4  a2  b6

Or

s=pd.crosstab(df.A,df.B).ge(1)[lst].all(1) # you have the series 
s # s[s].index
A
a1    False
a2     True
a3    False
a4    False
dtype: bool
BENY
  • 317,841
  • 20
  • 164
  • 234
1

First filter df on each element of lst using a list comprehension

a_list = [ df.loc[df['B']==el,'A'].tolist() for el in lst]
# [['a1', 'a2'], ['a2', 'a4']]

The values you are looking for are the intersection of all lists in a_list

list(set.intersection(*[set(x) for x in a_list]))
# ['a2']
fmarm
  • 4,209
  • 1
  • 17
  • 29
  • Accepted this answer because it is the fastest on my actual dataframe with ~ 2M rows, 1M unique A values, and 60 unique B values. The other two answers also work perfectly, but this one takes < 1s to run, while the other two take > 10s. Depending on your data/preferences, one may use any of the three. – Reveille Apr 08 '20 at 17:52
1

Using set.issubset to check if one list is in the other:

m = df.groupby('A')['B'].apply(lambda x: set(lst).issubset(set(x)))
df[df['A'].map(m)]

    A   B
0  a1  b1
1  a2  b1
3  a1  b3
4  a2  b6
7  a2  b7
8  a1  b6
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • Thanks! The first one seem to be not correct as "a4" is also wrongly returned. The second one, seem to be working. I am checking for efficiency now on the actual large dataframe. – Reveille Apr 07 '20 at 23:18
  • Yes, you are right, I made an edit. First solution should be also quite efficient now – Erfan Apr 07 '20 at 23:23
  • It does not work on my actual data. I found this example now, for which, again it does not work for your reference: `df = pd.DataFrame({'A': ['a1', 'a2', 'a3', 'a1', 'a2', 'a4', 'a5', 'a2', 'a1'], 'B': ['b1', 'b1', 'b2', 'b3', 'b6', 'b6', 'b7', 'b7', 'b6']})` the output should now include a1 and a2. But yours (first one) seem to return empty. The two other answers seem to be working, but I want to share this feedback with you. – Reveille Apr 08 '20 at 00:07