Python Pandas Extract text between words and symbols with regex

Question

I am trying to extract certain words between some text and symbol using regex in Python Pandas. The problem is that sometimes there are non-characters after the word I want to extract, and sometimes there is nothing.

Here is the input table.

Here is the expected output.

I've tried this str.extract(r'[a-zA-Z\W]+\s+Reason to buy:([a-zA-Z\s]+)[a-zA-Z\W]+\s+') but does not work.

Any advice is appreciated.

Thanks.

lemon · Accepted Answer · 2023-01-25T21:15:16.167

2

You can use a lookaround to extract your reason to buy:

(?<=Reason to buy: )([^<]+)

And use it in your Python code as follows:

import pandas as pd
import re

df = pd.DataFrame([
    [1, 'Team: Asian<>Reason to buy: leisure'             ],
    [2, 'Team: SouthAmerica<>Reason to buy: educational<>'],
    [3, 'Team: Australia<>Reason to buy: commercial'      ],
    [4, 'Team: North America<>Reason to buy: leisure<>'   ],
    [5, 'Team: Europe<>Reason to buy: leisure<>'          ],
], columns = ['team_id', 'reasons'])

pattern = r'(?<=Reason to buy: )([A-Za-z]+)'

df['reasons'] = df['reasons'].str.extract(pattern)

Output:

   team_id      reasons
0        1      leisure
1        2  educational
2        3   commercial
3        4      leisure
4        5      leisure

Check the Regex demo and the Python demo.

edited Jan 25 '23 at 21:15

answered Jan 25 '23 at 19:16

lemon

14,875
6
18
38

Thanks for the comment. What does ?<= do here exactly? – user13617491 Jan 25 '23 at 19:55
Checks if before the match there's the words `Reason to buy: `, but doesn't include it in the final match. That's why it is called lookaround (specifically, lookbehind). – lemon Jan 25 '23 at 19:56
so it checks for any words,digits and symbols? – user13617491 Jan 25 '23 at 19:57
You need to change `([A-Za-z]+)` to change what the regex matches. If you want to match anything other than `<`, you can use `(?<=Reason to buy: )([^<]+)`. – lemon Jan 25 '23 at 20:01

score 2 · Answer 2 · answered Jan 25 '23 at 20:34

You just need to capture the part matching any zero or more chars up to <> or end of string:

df['reasons'] = df['reasons'].str.extract(r"Reason to buy:\s*(.*?)(?=<>|$)", expand=False)

See the regex demo.

Details:

Reason to buy: - a string
\s* - zero or more whitespace
(.*?) - a capturing group with ID 1 that matches zero or more chars other than line break chars as few as possible
(?=<>|$) - a positive lookahead that requires a <> string or end of string immediately to the right of the current location.

Note that Series.str.extract requires at least one capturing group in the pattern to actually return a value, so you can even use a non-capturing group instead of the positive lookahead:

df['reasons'] = df['reasons'].str.extract(r"Reason to buy:\s*(.*?)(?:<>|$)", expand=False)

For the capturing group (.*?), do we need the '?' ? I try to capture any text/digits/symbols from the capturing group. Thanks. — user13617491, Jan 26 '23 at 02:24
@user13617491 Yes, `.*?` "*matches zero or more chars other than line break chars as few as possible*" — Wiktor Stribiżew, Jan 26 '23 at 08:18

Python Pandas Extract text between words and symbols with regex

2 Answers2