Create Dataframe Exatracting Words With Period After A Specicfic Word

Question

I've the following text:

text_main = "The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d. The other leagues aren't that important."

I need to extract all the sport names (which are coming after sport:) and style (which are coming after style:) and create new columns as sports and style. I'm trying the following code to extract the main sentence (sometimes text are huge):

m = re.split(r'(?<=\.)\s+(?=[A-Z]\w+)', text_main)
text = list(filter(lambda x: re.search(r'leagues were identified', x, flags=re.IGNORECASE), m))[0]
print(text)

The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d.

Then I'm extracting the sport and style names and putting them into a dataframe:

if 'sport:' in text:
    sport_list = re.findall(r'sport:\W*(\w+)', text)

df = pd.DataFrame({'sports': sport_list})
print(df)

    sports
0   basketball
1   soccer
2   football

However, I'm having troubles to extract the styles, as all the styles have period . after the 1st letter (c) and few has sign >. Also, not all the sports have style info.

Desired output:

    sports        style
0   basketball    c.123>d
1   soccer        NA
2   football      c.124>d

What would be the smartest way of doing it? Any suggestions would be appreciated. Thanks!

score 1 · Accepted Answer · answered May 18 '22 at 14:35

1

You can use

\bsport:\s*(\w+)(?:(?:(?!\bsport:).)*?\bstyle:\s*(\S+))?

See the regex demo. Details:

\b - a word boundary
sport: - a fixed string
\s* - zero or more whitespaces
(\w+) - Group 1: one or more word chars
(?: - start of an optional non-capturing group:
- (?:(?!\bsport:).)*? - any char other than line break chars, zero or more occurrences but as few as possible, that does not start a whole word sport: char sequence
- \bstyle: - a whole word style and then :
- \s* - zero or more whitespaces
- (\S+) - Group 1: one or more non-whitespace chars
)? - end of the optional non-capturing group.

See the Python demo:

import pandas as pd
text_main = "The following leagues were identified: sport: basketball league: N.B.A. style: c.123>d sport: soccer league: E.P.L. sport: football league: N.F.L. style: c.124>d. The other leagues aren't that important."
matches = re.findall(r'\bsport:\s*(\w+)(?:(?:(?!\bsport:).)*?\bstyle:\s*(\S+))?', text_main)
df = pd.DataFrame(matches, columns=['sports', 'style'])

Output:

>>> df
       sports    style
0  basketball   c.123>d
1      soccer          
2    football  c.124>d.

answered May 18 '22 at 14:35

Wiktor Stribiżew

607,720
39
448
563

1

FYI: if you really need `NaN`s instead of empty strings, add the `df.loc[ df['styles'] == '', 'styles'] = np.nan` line (and make sure you `import numpy as np`). – Wiktor Stribiżew May 18 '22 at 14:40
Thank you @wiktor-stribiżew. You are so insightful and generous enough to explain. Appreciate. One more thing -- if I have another style2 element `style2: p.b123`, how should I proceed? – Roy May 18 '22 at 14:56
@Roy If it always comes after `style`, `r'\bsport:\s*(\w+)(?:(?:(?!\bsport:).)*?\bstyle:\s*(\S+)(?:(?:(?!\bsport:).)*?\bstyle2:\s*(\S+))?)?'`. Note that the `(?:(?!\bsport:).)*?` [tempered greedy token](https://stackoverflow.com/a/37343088/3832970) is used to make sure we stay within one single sports section. – Wiktor Stribiżew May 18 '22 at 15:00
If the info for all styles is there, it comes sequentially. However, sometimes, only `style` is there; sometimes only `style2` @wiktor-stribiżew – Roy May 18 '22 at 15:07
1

@Roy Then use `r'\bsport:\s*(\w+)(?:(?:(?!\bsport:).)*?\bstyle:\s*(\S+))?(?:(?:(?!\bsport:).)*?\bstyle2:\s*(\S+))?'` – Wiktor Stribiżew May 18 '22 at 16:03

Create Dataframe Exatracting Words With Period After A Specicfic Word

1 Answers1