2

I am trying to capture words following specified stocks in a pandas df. I have several stocks in the format $IBM and am setting a python regex pattern to search each tweet for 3-5 words following the stock if found.

My df called stock_news looks as such:

   Word       Count

0  $IBM     10
1  $GOOGL   8  
etc

pattern = ''
for word in stock_news.Word:
    pattern += '{} (\w+\s*\S*){3,5}|'.format(re.escape(word))

However my understanding is that {} should be a quantifier, in my case matching between 3 to 5 times however I receive the following KeyError:

KeyError: '3,5'

I have also tried using rawstrings with r'{} (\w+\s*\S*){3,5}|' but to no avail. I also tried using this pattern on regex101 and it seems to work there but not in my Pycharm IDE. Any help would be appreciated.

Code for finding:

pat = re.compile(pattern, re.I)

for i in tweet_df.Tweets:
    for x in pat.findall(i):
        print(x)
geds133
  • 1,503
  • 5
  • 20
  • 52

1 Answers1

2

When you build your pattern, there is an empty alternative left at the end, so your pattern effectively matches any string, every empty space before non-matching texts.

You need to build the pattern like

(?:\$IBM|\$GOOGLE)\s+(\w+(?:\s+\S+){3,5})

You may use

pattern = r'(?:{})\s+(\w+(?:\s+\S+){{3,5}})'.format(
              "|".join(map(re.escape, stock_news['Word'])))

Mind that the literal curly braces inside an f-string or a format string must be doubled.

Regex details

  • (?:\$IBM|\$GOOGLE) - a non-capturing group matching either $IBM or $GOOGLE
  • \s+ - 1+ whitespaces
  • (\w+(?:\s+\S+){3,5}) - Capturing group 1 (when using str.findall, only this part will be returned):
    • \w+ - 1+ word chars
    • (?:\s+\S+){3,5} - a non-capturing* group matching three, four or five occurrences of 1+ whitespaces followed with 1+ non-whitespace characters

Note that non-capturing groups are meant to group some patterns, or quantify them, without actually allocating any memory buffer for the values they match, so that you could capture only what you need to return/keep.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563