2

I have a pandas DataFrame with a tuple column. I would like a mask identifying for each row whether any of the values in the tuple column matches any value in a predetermined tuple. My attempt is below:

import pandas as pd

df = pd.DataFrame([{'a': 1, 'b': (2, 3, 4)}, {'a': 5, 'b': (6, 7, 8)}])
print(df)

codes = (3, 4, 20, 22)
mask = df.b.str.contains_any(codes)  # This line is incorrect

Desired output:

0     True
1    False

I was hopeful based on https://stackoverflow.com/a/51689894/10499953 that str functions would work for tuples, but I couldn't get that to work for even a single value from codes:

a = df['has_code'] = df['b'].str.contains(4)

gives

TypeError: first argument must be string or compiled pattern.
Attila the Fun
  • 327
  • 2
  • 13

4 Answers4

4

Try this:

res = df['b'].apply(lambda x: any(val in x for val in codes))
print(res)

Output:

0     True
1    False
deadshot
  • 8,881
  • 4
  • 20
  • 39
4

Another option

df['b'].apply(lambda x: any(set(x).intersection(codes)))
Adam Zeldin
  • 898
  • 4
  • 6
2

You can use set.intersection and use astype(bool)

code = set(codes)
df.b.map(code.intersection).astype(bool)

0     True
1    False
Name: b, dtype: bool

Timeit analysis

#setup
o = [np.random.randint(0,10,(3,)) for _ in range(10_000)]
len(o)
# 10000

s = pd.Series(o)
s
0       [6, 2, 5]
1       [7, 4, 0]
2       [1, 8, 2]
3       [4, 8, 9]
4       [7, 3, 4]
          ...
9995    [3, 9, 4]
9996    [6, 2, 9]
9997    [2, 0, 5]
9998    [5, 0, 7]
9999    [7, 4, 2]
Length: 10000, dtype: object

# Adam's answer
In [38]: %timeit s.apply(lambda x: any(set(x).intersection(codes)))
19.1 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#komatiraju's answer
In [39]: %timeit s.apply(lambda x: any(val in x for val in codes))
83.8 ms ± 974 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

#My answer
In [42]: %%timeit
    ...: code = set(codes)
    ...: s.map(code.intersection).astype(bool)
    ...:
    ...:
15.5 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#wwnde's answer
In [74]: %timeit s.apply(lambda x:len([*{*x}&{*codes}])>0)
19.5 ms ± 372 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

For Series of size 1 million

bigger_o = np.repeat(o,100,axis=0)
bigger_o.shape
# (1000000, 3)
s = pd.Series((list(bigger_o)))
s
0         [6, 2, 5]
1         [6, 2, 5]
2         [6, 2, 5]
3         [6, 2, 5]
4         [6, 2, 5]
            ...
999995    [7, 4, 2]
999996    [7, 4, 2]
999997    [7, 4, 2]
999998    [7, 4, 2]
999999    [7, 4, 2]
Length: 1000000, dtype: object

In [54]: %timeit s.apply(lambda x: any(set(x).intersection(codes)))
1.89 s ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [55]: %timeit s.apply(lambda x: any(val in x for val in codes))
8.9 s ± 652 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [56]: %%timeit
    ...: code = set(codes)
    ...: s.map(code.intersection).astype(bool)
    ...:
    ...:
1.54 s ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [79]: %timeit s.apply(lambda x:len([*{*x}&{*codes}])>0)
1.95 s ± 88.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Ch3steR
  • 20,090
  • 4
  • 28
  • 58
0
df.b.apply(lambda x:len([*{*x}&{*codes}])>0)#my preferred speed wise
#df.b.apply(lambda x:[*{*x}&{*codes}]).str.len()>0 #Works as well

0     True
1    False
Name: b, dtype: bool

%timeit df.b.apply(lambda x:len([*{*x}&{*codes}])>0)
220 µs ± 2.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit 

code = set(codes)
df.b.map(code.intersection).astype(bool)
364 µs ± 1.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['b'].apply(lambda x: any(val in x for val in codes))
210 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df['b'].apply(lambda x: any(set(x).intersection(codes)))
211 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • Added timeit results ;) – Ch3steR Jun 26 '20 at 07:03
  • done, just added – wwnde Jun 26 '20 at 07:06
  • 1
    You should always benchmark with larger size dataframes. Then you observe how each of the proposed solution works, for example, my solution is slower than all of the proposed solution for small data, but faster when used with larger data. komatiraju's solution works well with small data but when used with large data it's almost 8X slower. – Ch3steR Jun 26 '20 at 07:18
  • interesting gained no vote – wwnde Jun 26 '20 at 13:56