4

Given two identically indexed pd.Series of strings, what's the most efficient way to check if each element of the first pd.Series is a substring of the corresponding element of the second pd.Series?

Example:

s1 = pd.Series(['cat', 'dog', 'ham'])
s2 = pd.Series(['catbird', 'frog', 'hamster'])  

pd.Series([t[0] in t[1] for t in zip(s1, s2)], index=s1.index)

yields

0     True
1    False
2     True
dtype: bool
Rations
  • 177
  • 1
  • 8

2 Answers2

5

I think your solution is good, because also pandas .str functions use loops (and working with missing values), so sometimes slowier.

I change solution with small modification - unpacking tuples to variables t and v, in tested data it is faster a bit:

np.random.seed(2020)

N = 10000
s1 = pd.Series(np.random.choice(list(string.ascii_letters), size=N))
s2 = pd.DataFrame(np.random.choice(list(string.ascii_letters), size=(N, 3))).sum(axis=1)

In [82]: %timeit (pd.Series([t[0] in t[1] for t in zip(s1, s2)], index=s1.index))
3.47 ms ± 271 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [83]: %timeit (pd.Series([t in v for t, v in zip(s1, s2)], index=s1.index))
2.89 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

Also:

import numpy as np
import pandas as pd
import string

np.random.seed(2020)

N = 10000
s1 = pd.Series(np.random.choice(list(string.ascii_letters), size=N))
s2 = pd.DataFrame(np.random.choice(list(string.ascii_letters), size=(N, 3))).sum(axis=1)

%%timeit
s1.apply(lambda x: x[0] in s2.loc[x.name, 0], axis=1)

218 ms ± 8.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Maybe not the best approach :)

dzang
  • 2,160
  • 2
  • 12
  • 21
  • Have you measured it? See: https://stackoverflow.com/questions/47749018/why-is-pandas-apply-lambda-slower-than-loop-here – sophros Mar 11 '20 at 07:57
  • I got `TypeError: ("'Series' objects are mutable, thus they cannot be hashed", 'occurred at index 0')`, for you working? – jezrael Mar 11 '20 at 08:06
  • @jezrael didn't get to test it sorry. now it's working. – dzang Mar 11 '20 at 11:02