How to extract a word before another word in pandas

Question

I've got this pandas DataFrame, which is a description of plays during a football match:

play_id	type	Text
1	pass	Jon pass complete to Ben.
2	pass	Clock 14:52, Jon pass complete to Mitch.
3	rush	Frank rush.

My objective is to create a new column called "passer" with a script that will go through the description in the "text' column, and will take the name that is placed before the word 'pass'. So I first used this:

df['passer'] = df['Text'].str.extract(r'(.*?)pass', expand=False).str.strip()

Which gives me this:

play_id	type	Text	passer
1	pass	Jon pass complete to Ben.	Jon
2	pass	Clock 14:52, Jon pass complete to Mitch.	Clock 14:52, Jon
3	rush	Frank rush.	NaN

It works correctly for the 1st and 3rd playid, but not for the second one, as it takes the clock, that can sometimes be included in the description.

I have tried to implement conditions on the creation of my column, where the code checks if 'Clock' is included in the description or not, and use the correct regex, but this does not work:

conditions = [
    (np.where(df.Text.str.contains('Clock', case=False))),
    (np.where(~df.Text.str.contains('Clock', case=False)))
    ]

choices = [
    df['Text'].str.extract(r', (.*?) pass', expand=False).str.strip(), 
    df['Text'].str.extract('(.*?) pass', expand=False).str.strip()
    ]

df['passerNEW'] = np.select(conditions, choices, default='NaN')
df

I get the following error:

TypeError: invalid entry 0 in condlist: should be boolean ndarray

Is there a way to make this function work? That seemed like a good way to do it, as in other cases I can have three different conditions to check in order to know which regex to use.

Trenton McKinney · Accepted Answer · 2021-02-04T17:28:32.193

Use pandas.Series.str.extract with a positive lookahead conditional.
- See SO: Regex lookahead, lookbehind and atomic groups for a thorough explanation of lookahead.
- \s+pass will work for any amount of whitespace between the name and 'pass' (e.g. 'Jon PaSs')
flags=re.IGNORECASE is used to ignore the case of 'pass'
- df.Text.str.lower().str.extract('(\w+(?=\s+pass))') can be used instead of importing re for the flag.

import pandas as pd
import re

# test dataframe
data = {'play_id': ['1', '2', '3'], 'type': ['pass', 'pass', 'rush'], 'Text': ['Jon PASS complete to Ben.', 'Clock 14:52, Jon pass complete to Mitch.', 'Frank rush.']}
df = pd.DataFrame(data)

# display(df)
play_id type                                     Text
      1 pass                Jon PASS complete to Ben.
      2 pass Clock 14:52, Jon pass complete to Mitch.
      3 rush                              Frank rush.

# extract
df['passer'] = df.Text.str.extract('(\w+(?=\s+pass))', flags=re.IGNORECASE)

# display(df)
play_id type                                     Text passer
      1 pass                Jon PASS complete to Ben.    Jon
      2 pass Clock 14:52, Jon pass complete to Mitch.    Jon
      3 rush                              Frank rush.    NaN

It works perfectly, thank you for the code and the extra information! — Nicotops, Feb 04 '21 at 18:09

xcapt · Answer 2 · 2021-02-04T17:27:36.453

Try to use (?=) a positive lookahead, a type of zero-width assertion. So, the code will look like:

df['passer'] = df['Text'].str.extract(r'(\w+)(?=\spass)', expand=False)

Output:

| play_id  | type  | Text                                     | passer |
| -------- | ----- | ---------------------------------------- | ------ |
| 1        | pass  | Jon pass complete to Ben.                | Jon    |
| 2        | pass  | Clock 14:52, Jon pass complete to Mitch. | Jon    |
| 3        | rush  | Frank rush.                              | NaN    |

How to extract a word before another word in pandas

2 Answers2