1

How can I use .isin for pandas where it will use values from each of the rows in the dataframe, and not static values.

For example lets say we have dataframe like:

import pandas as pd
import datetime

l = []

for i in range(100000):
    d = {'a':i,'b':{1,2,3},'c':0}
    l.append(d)

df = pd.DataFrame(l)

If I use .isin, it can only take 1 list of values (in this example {1,2,3}) and will be compared to each of the values in the column you want to compare (ie df['a'])

test = df['a'].isin({1,2,3})

If I want to compare each value of the column 'b' if values in 'a' is in df['b'] I can do the following below:

def check(a, b):
    return a in b

test = list(map(check, df['a'], df['b']))

Of course in this example all values in df['b'] is the same, but can pretend it is not.

Unfortunately this is about 5x slower than just using the .isin. My question is, is there a way to use .isin but for each of the values in df['b]? Or dont have to necessarily use .isin, but what would be a more efficient way to do it?

user1179317
  • 2,693
  • 3
  • 34
  • 62
  • Does this answer your question? [Check if a value in one column is in a list in another column](https://stackoverflow.com/questions/32233069/check-if-a-value-in-one-column-is-in-a-list-in-another-column) – G. Anderson Nov 05 '19 at 21:13
  • Somewhat a solution, but the .apply is much slower than the map() solution – user1179317 Nov 05 '19 at 21:18

1 Answers1

3

You can use DataFrame.apply with in here:

df.apply(lambda x: x['a'] in x['b'], axis=1)
0        False
1         True
2         True
3         True
4        False
         ...  
99995    False
99996    False
99997    False
99998    False
99999    False
Length: 100000, dtype: bool

Or list_comprehension with zip which is faster:

[a in b for a, b in zip(df['a'], df['b'])]
[False,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 ...]

Timings:

%%timeit
def check(a, b):
    return a in b

list(map(check, df['a'], df['b']))

28.6 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
[a in b for a, b in zip(df['a'], df['b'])]

22.5 ms ± 851 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df.apply(lambda x: x['a'] in x['b'], axis=1)

2.27 s ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • 1
    The .apply solution is almost 100x slower than the map() solution. But the list comprehension solution is a bit faster than the map() one – user1179317 Nov 05 '19 at 21:17
  • Not faster but similar `df.b >= [{x} for x in df.a]` Then a little bit different `from operator import contains; [*map(contains, df.b, df.a)]` – piRSquared Nov 05 '19 at 22:49