Performance of Pandas string contains for column

Question

I have a DataFrame of 83k rows and a column "Text" of text that i have to search for ~200 masks. Is there a way to pass a column to .str.contains()? I'm able to do it like this:

start = time.time()
[a["Text"].str.contains(m).sum() for m in \
b["mask"].values]
print time.time() - start

But it's taking 34.013s. Is there any faster way?

Edit: b["mask"] looks like:

'PR347856|P5478'

'BS7623|B5763'

and i want the count of occurances for each mask, so i can't join them.

Edit:

a["text"] contains strings of the size of ~ 3 sentences

You could try something like `a["Text"].str.contains("|".join(b["mask"].values))` where the `"|"` is the regex OR operator. — pault, May 01 '18 at 15:02
A [mcve] would be helpful here. You're using a list comp to get a list of ints (the sums) yet you're asking about passing a column to `str.contains()`. I'm not sure what your desired output is. — pault, May 01 '18 at 15:04
"so i can't join them.", well, technically you can, as long as you escape those pipe characters (use `re.escape` or the like) and you can get it done. Although I'd recommend trying what I've outlined in my answer below, and then revisiting the regex if that is too slow for you. — cs95, May 01 '18 at 15:11
*Note*: There is a solution [described by @unutbu](https://stackoverflow.com/a/48600345/9209546) which is more efficient than using `pd.Series.str.contains`. If performance is an issue, then this may be worth investigating. — jpp, May 06 '18 at 22:09

BallpointBen · Answer 1 · 2018-05-07T00:03:58.160

2

Maybe you can vectorize the containment operation.

text_contains = a['Text'].str.contains
b['mask'].map(lambda m: text_contains(m).sum())

edited May 07 '18 at 00:03

answered May 01 '18 at 15:23

BallpointBen

9,406
1
32
62

Thanks, this is what i want but it takes 32seconds – TobSta May 01 '18 at 15:46
There may not be a magic bullet solution to the problem you have. Large data sets take a long time to process. – BallpointBen May 01 '18 at 17:25
@TobSta See my edit to remove repeated computation of `a['Text'].str.contains` – BallpointBen May 07 '18 at 00:04

Performance of Pandas string contains for column

1 Answers1