1

I am using python and regex to try to grab all sentences in a list of tweets containing a certain word for each word within a series of a pandas df.

My df stocks_df contains certain stock names e.g.

  Symbol
0   $GSX
1  $NVDA
2  $MBRX
5  $BBBY
6   $DIS

I want all sentences in the tweets that contain these strings. My attempted solution follows another regex question I had: Key error when using regex quantifier python

However my solution mostly grabs sentences the symbol at the start of the sentence and doesn't grab it if in the middle of the sentence. It also seems to match only symbols without getting the rest of the sentence. My code is as such:

pattern2 = r'(?:{}) (?:[^.]*[^.]*\.)'.format("|".join(map(re.escape, stocks_df['Symbol'])))

Does anyone understand why full sentences are not being matched?

geds133
  • 1,503
  • 5
  • 20
  • 52
  • Not sure what you intended by `[^.]*[^.]*` Why repeat? Give us some examples of the input. – Happy Green Kid Naps Jun 02 '20 at 13:28
  • 1
    There are not enough details here. If you plan to match "sentences" with no abbreviations that contain the "words" you have, you may try `r'[^.?!]*(?:{})\b[^.?!]*[.?!]'` instead of `r'(?:{}) (?:[^.]*[^.]*\.)'`. – Wiktor Stribiżew Jun 02 '20 at 13:28
  • @WiktorStribiżew Thanks, my explanation may be a little poor but your pattern there is what I was attempting to get. Thanks. – geds133 Jun 02 '20 at 13:32

1 Answers1

1

If you do not have to deal with abbreviations and other messy formats, you may match those strings using

r'[^.?!]*(?:{})\b[^.?!]*[.?!]'.format("|".join(map(re.escape, stocks_df['Symbol'])))

The pattern will look like [^.?!]*(?:\$GSX|\$NVDA|...)\b[^.?!]*[.?!] and will match

  • [^.?!]* - 0 or more chars other than !, ? and .
  • (?:\$GSX|\$NVDA) - a word from the Symbol column
  • \b - whole word is required, \b is a word boundary
  • [^.?!]* - 0 or more chars other than !, ? and .
  • [.?!] - a ?, ! or .
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563