0

I have a DataFrame of 83k rows and a column "Text" of text that i have to search for ~200 masks. Is there a way to pass a column to .str.contains()? I'm able to do it like this:

start = time.time()
[a["Text"].str.contains(m).sum() for m in \
b["mask"].values]
print time.time() - start

But it's taking 34.013s. Is there any faster way?

Edit: b["mask"] looks like:

'PR347856|P5478'

'BS7623|B5763'

and i want the count of occurances for each mask, so i can't join them.

Edit:

a["text"] contains strings of the size of ~ 3 sentences

Community
  • 1
  • 1
TobSta
  • 766
  • 2
  • 10
  • 29
  • You could try something like `a["Text"].str.contains("|".join(b["mask"].values))` where the `"|"` is the regex OR operator. – pault May 01 '18 at 15:02
  • A [mcve] would be helpful here. You're using a list comp to get a list of ints (the sums) yet you're asking about passing a column to `str.contains()`. I'm not sure what your desired output is. – pault May 01 '18 at 15:04
  • "so i can't join them.", well, technically you can, as long as you escape those pipe characters (use `re.escape` or the like) and you can get it done. Although I'd recommend trying what I've outlined in my answer below, and then revisiting the regex if that is too slow for you. – cs95 May 01 '18 at 15:11
  • *Note*: There is a solution [described by @unutbu](https://stackoverflow.com/a/48600345/9209546) which is more efficient than using `pd.Series.str.contains`. If performance is an issue, then this may be worth investigating. – jpp May 06 '18 at 22:09

1 Answers1

2

Maybe you can vectorize the containment operation.

text_contains = a['Text'].str.contains
b['mask'].map(lambda m: text_contains(m).sum())
BallpointBen
  • 9,406
  • 1
  • 32
  • 62