Check whether tuple column in pandas contains some value from a list

Question

I have a pandas DataFrame with a tuple column. I would like a mask identifying for each row whether any of the values in the tuple column matches any value in a predetermined tuple. My attempt is below:

import pandas as pd

df = pd.DataFrame([{'a': 1, 'b': (2, 3, 4)}, {'a': 5, 'b': (6, 7, 8)}])
print(df)

codes = (3, 4, 20, 22)
mask = df.b.str.contains_any(codes)  # This line is incorrect

Desired output:

0     True
1    False

I was hopeful based on https://stackoverflow.com/a/51689894/10499953 that str functions would work for tuples, but I couldn't get that to work for even a single value from codes:

a = df['has_code'] = df['b'].str.contains(4)

gives

TypeError: first argument must be string or compiled pattern.

score 4 · Answer 1 · answered Jun 26 '20 at 04:20

4

Try this:

res = df['b'].apply(lambda x: any(val in x for val in codes))
print(res)

Output:

0     True
1    False

answered Jun 26 '20 at 04:20

deadshot

8,881
4
20
39

score 4 · Answer 2 · answered Jun 26 '20 at 04:27

4

Another option

df['b'].apply(lambda x: any(set(x).intersection(codes)))

answered Jun 26 '20 at 04:27

Adam Zeldin

898
4
6

Ch3steR · Accepted Answer · 2020-06-26T07:01:52.787

You can use set.intersection and use astype(bool)

code = set(codes)
df.b.map(code.intersection).astype(bool)

0     True
1    False
Name: b, dtype: bool

Timeit analysis

#setup
o = [np.random.randint(0,10,(3,)) for _ in range(10_000)]
len(o)
# 10000

s = pd.Series(o)
s
0       [6, 2, 5]
1       [7, 4, 0]
2       [1, 8, 2]
3       [4, 8, 9]
4       [7, 3, 4]
          ...
9995    [3, 9, 4]
9996    [6, 2, 9]
9997    [2, 0, 5]
9998    [5, 0, 7]
9999    [7, 4, 2]
Length: 10000, dtype: object

# Adam's answer
In [38]: %timeit s.apply(lambda x: any(set(x).intersection(codes)))
19.1 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#komatiraju's answer
In [39]: %timeit s.apply(lambda x: any(val in x for val in codes))
83.8 ms ± 974 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

#My answer
In [42]: %%timeit
    ...: code = set(codes)
    ...: s.map(code.intersection).astype(bool)
    ...:
    ...:
15.5 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#wwnde's answer
In [74]: %timeit s.apply(lambda x:len([*{*x}&{*codes}])>0)
19.5 ms ± 372 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

For Series of size 1 million

bigger_o = np.repeat(o,100,axis=0)
bigger_o.shape
# (1000000, 3)
s = pd.Series((list(bigger_o)))
s
0         [6, 2, 5]
1         [6, 2, 5]
2         [6, 2, 5]
3         [6, 2, 5]
4         [6, 2, 5]
            ...
999995    [7, 4, 2]
999996    [7, 4, 2]
999997    [7, 4, 2]
999998    [7, 4, 2]
999999    [7, 4, 2]
Length: 1000000, dtype: object

In [54]: %timeit s.apply(lambda x: any(set(x).intersection(codes)))
1.89 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [55]: %timeit s.apply(lambda x: any(val in x for val in codes))
8.9 s ± 652 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [56]: %%timeit
    ...: code = set(codes)
    ...: s.map(code.intersection).astype(bool)
    ...:
    ...:
1.54 s ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [79]: %timeit s.apply(lambda x:len([*{*x}&{*codes}])>0)
1.95 s ± 88.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

codes should really be a set anyway, based on my use case. This works well. — Attila the Fun, Jun 26 '20 at 10:36
@AttilatheFun Yes, set suits your use-case. Glad this helped. ;) — Ch3steR, Jun 26 '20 at 10:39

wwnde · Answer 4 · 2020-06-26T13:50:11.550

0

df.b.apply(lambda x:len([*{*x}&{*codes}])>0)#my preferred speed wise
#df.b.apply(lambda x:[*{*x}&{*codes}]).str.len()>0 #Works as well

0     True
1    False
Name: b, dtype: bool

%timeit df.b.apply(lambda x:len([*{*x}&{*codes}])>0)
220 µs ± 2.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit 

code = set(codes)
df.b.map(code.intersection).astype(bool)
364 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['b'].apply(lambda x: any(val in x for val in codes))
210 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['b'].apply(lambda x: any(set(x).intersection(codes)))
211 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edited Jun 26 '20 at 13:50

answered Jun 26 '20 at 06:29

wwnde

26,119
6
18
32

Added timeit results ;) – Ch3steR Jun 26 '20 at 07:03
done, just added – wwnde Jun 26 '20 at 07:06
1

You should always benchmark with larger size dataframes. Then you observe how each of the proposed solution works, for example, my solution is slower than all of the proposed solution for small data, but faster when used with larger data. komatiraju's solution works well with small data but when used with large data it's almost 8X slower. – Ch3steR Jun 26 '20 at 07:18
interesting gained no vote – wwnde Jun 26 '20 at 13:56

Check whether tuple column in pandas contains some value from a list

4 Answers4