Find values in one column that have association with certain values in another column

Question

I have the following example pandas dataframe:

df = pd.DataFrame({'A': ['a1', 'a2', 'a3', 'a1', 'a2', 'a4'],
                   'B': ['b1', 'b1', 'b2', 'b3', 'b6', 'b6']})

I want to find the values in A that have association with all of the values of an input list in B.

For example, for an input lst = ['b1', 'b6'], the desired output is ['a2']. I have tried df[df['B'].isin(lst)] from here but it is indeed not yet sufficient, or may not be even necessary.

score 1 · Answer 1 · answered Apr 07 '20 at 22:51

We can do it groupby + filter with isin + all

df.groupby('A').filter(lambda x : pd.Series(lst).isin(x['B']).all())
    A   B
1  a2  b1
4  a2  b6

Or

s=pd.crosstab(df.A,df.B).ge(1)[lst].all(1) # you have the series 
s # s[s].index
A
a1    False
a2     True
a3    False
a4    False
dtype: bool

score 1 · Accepted Answer · answered Apr 07 '20 at 22:55

1

First filter df on each element of lst using a list comprehension

a_list = [ df.loc[df['B']==el,'A'].tolist() for el in lst]
# [['a1', 'a2'], ['a2', 'a4']]

The values you are looking for are the intersection of all lists in a_list

list(set.intersection(*[set(x) for x in a_list]))
# ['a2']

answered Apr 07 '20 at 22:55

fmarm

4,209
1
17
29

Accepted this answer because it is the fastest on my actual dataframe with ~ 2M rows, 1M unique A values, and 60 unique B values. The other two answers also work perfectly, but this one takes < 1s to run, while the other two take > 10s. Depending on your data/preferences, one may use any of the three. – Reveille Apr 08 '20 at 17:52

Erfan · Answer 3 · 2020-04-08T08:21:24.103

1

Using set.issubset to check if one list is in the other:

m = df.groupby('A')['B'].apply(lambda x: set(lst).issubset(set(x)))
df[df['A'].map(m)]

    A   B
0  a1  b1
1  a2  b1
3  a1  b3
4  a2  b6
7  a2  b7
8  a1  b6

edited Apr 08 '20 at 08:21

answered Apr 07 '20 at 22:58

Erfan

40,971
8
66
78

Thanks! The first one seem to be not correct as "a4" is also wrongly returned. The second one, seem to be working. I am checking for efficiency now on the actual large dataframe. – Reveille Apr 07 '20 at 23:18
Yes, you are right, I made an edit. First solution should be also quite efficient now – Erfan Apr 07 '20 at 23:23
It does not work on my actual data. I found this example now, for which, again it does not work for your reference: `df = pd.DataFrame({'A': ['a1', 'a2', 'a3', 'a1', 'a2', 'a4', 'a5', 'a2', 'a1'], 'B': ['b1', 'b1', 'b2', 'b3', 'b6', 'b6', 'b7', 'b7', 'b6']})` the output should now include a1 and a2. But yours (first one) seem to return empty. The two other answers seem to be working, but I want to share this feedback with you. – Reveille Apr 08 '20 at 00:07

Find values in one column that have association with certain values in another column

3 Answers3