Vectorized version of pd.Series.str.contains

Question

Given two identically indexed pd.Series of strings, what's the most efficient way to check if each element of the first pd.Series is a substring of the corresponding element of the second pd.Series?

Example:

s1 = pd.Series(['cat', 'dog', 'ham'])
s2 = pd.Series(['catbird', 'frog', 'hamster'])  

pd.Series([t[0] in t[1] for t in zip(s1, s2)], index=s1.index)

yields

0     True
1    False
2     True
dtype: bool

What you have done is good. Do you have any specific problem? Like, the code is slow for large series. BTW `str.contains` is a [vectorized string method](https://pandas.pydata.org/docs/getting_started/basics.html#vectorized-string-methods). — Vishnudev Krishnadas, Mar 11 '20 at 07:51
I think good solution, small modification should be `pd.Series([t in v for t, v in zip(s1, s2)], index=s1.index)` — jezrael, Mar 11 '20 at 07:58
@Vishnudev No, I just want to make sure I'm not overlooking a built-in or easier way. — Rations, Mar 11 '20 at 08:42

score 5 · Accepted Answer · answered Mar 11 '20 at 08:04

I think your solution is good, because also pandas .str functions use loops (and working with missing values), so sometimes slowier.

I change solution with small modification - unpacking tuples to variables t and v, in tested data it is faster a bit:

np.random.seed(2020)

N = 10000
s1 = pd.Series(np.random.choice(list(string.ascii_letters), size=N))
s2 = pd.DataFrame(np.random.choice(list(string.ascii_letters), size=(N, 3))).sum(axis=1)

In [82]: %timeit (pd.Series([t[0] in t[1] for t in zip(s1, s2)], index=s1.index))
3.47 ms ± 271 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [83]: %timeit (pd.Series([t in v for t, v in zip(s1, s2)], index=s1.index))
2.89 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That seems quite significant @jezrael. – Vishnudev Krishnadas Mar 11 '20 at 08:45 — Vishnudev Krishnadas, Mar 11 '20 at 08:45

dzang · Answer 2 · 2020-03-11T11:01:06.333

0

Also:

import numpy as np
import pandas as pd
import string

np.random.seed(2020)

N = 10000
s1 = pd.Series(np.random.choice(list(string.ascii_letters), size=N))
s2 = pd.DataFrame(np.random.choice(list(string.ascii_letters), size=(N, 3))).sum(axis=1)

%%timeit
s1.apply(lambda x: x[0] in s2.loc[x.name, 0], axis=1)

218 ms ± 8.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Maybe not the best approach :)

edited Mar 11 '20 at 11:01

answered Mar 11 '20 at 07:56

dzang

2,160
2
12
21

Have you measured it? See: https://stackoverflow.com/questions/47749018/why-is-pandas-apply-lambda-slower-than-loop-here – sophros Mar 11 '20 at 07:57
I got `TypeError: ("'Series' objects are mutable, thus they cannot be hashed", 'occurred at index 0')`, for you working? – jezrael Mar 11 '20 at 08:06
@jezrael didn't get to test it sorry. now it's working. – dzang Mar 11 '20 at 11:02

Vectorized version of pd.Series.str.contains

2 Answers2