2

I am trying to use to do a regex extract with Pandas by using the value from another column as a variable.

df = pd.DataFrame({'text': ["The final is one of the most famous snooker matches of all time and pa", "Davis trailed for the first time at the event in the quarter-finals, as he played Terry Griffiths. "],'key': ["snooker", 'quarter-finals']})

I was thinking of building a string as a parameter and passing it to the function like so

reg = '((?:\S+\s+){0,10}\b'+'snooker'+'\b\s*(?:\S+\b\s*){0,10})' df['text'].str.extract(r'reg')

but it generates this error

ValueError: pattern contains no capture groups

which I am assuming is due to the syntax of "(r'reg')"

ulrich
  • 3,547
  • 5
  • 35
  • 49
  • There are a couple of issues here: 1) word boundaries are set with literal `\b`, not with a backspace char, 2) you cannot place variables into a string literal which is not an f-string like that, but 3) what do you need to do? – Wiktor Stribiżew Apr 27 '21 at 17:22
  • nop I need the "r" in the parameter ```(r'something')``` – ulrich Apr 27 '21 at 17:22
  • Try just `df['text'].str.extract(fr'((?:\S+\s+){{0,10}}\b{keyword_var}\b(?:\s+\S+){{0,10}})')` where `keyword_var` is your alphanumeric word variable. – Wiktor Stribiżew Apr 27 '21 at 17:25
  • yes that works Wiktor Stribiżew thanks – ulrich Apr 27 '21 at 17:28

1 Answers1

2

There are a couple of issues here:

  • Word boundaries are set with literal \b (r"\b"), not with a backspace char ("\b"),
  • You cannot place variables into a regular, normal string literal, you need to use format() or f-strings
  • You also need a capturing group in the pattern.

You can use

df['result'] = df['text'].str.extract(fr'((?:\S+\s+){{0,10}}\b{keyword_var}\b(?:\s+\S+){{0,10}})')

Note:

  • fr'...' - define a raw f-string literal with variable interpolation support and parsing backslashes as literal chars
  • ((?:\S+\s+){{0,10}}\b{keyword_var}\b(?:\s+\S+){{0,10}}) - a pattern with a single capturing group wrapping the whole pattern, this group value will be the return value.
  • If your keyword is not a purely alphanumeric string, you will need to reconsider using word boundaries and will have to escape the contents, e.g. {re.escape(keyword_var)}
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563