1

I am trying to extract certain words between some text and symbol using regex in Python Pandas. The problem is that sometimes there are non-characters after the word I want to extract, and sometimes there is nothing.

Here is the input table.

enter image description here

Here is the expected output.

enter image description here

I've tried this str.extract(r'[a-zA-Z\W]+\s+Reason to buy:([a-zA-Z\s]+)[a-zA-Z\W]+\s+') but does not work.

Any advice is appreciated.

Thanks.

2 Answers2

2

You can use a lookaround to extract your reason to buy:

(?<=Reason to buy: )([^<]+)

And use it in your Python code as follows:

import pandas as pd
import re

df = pd.DataFrame([
    [1, 'Team: Asian<>Reason to buy: leisure'             ],
    [2, 'Team: SouthAmerica<>Reason to buy: educational<>'],
    [3, 'Team: Australia<>Reason to buy: commercial'      ],
    [4, 'Team: North America<>Reason to buy: leisure<>'   ],
    [5, 'Team: Europe<>Reason to buy: leisure<>'          ],
], columns = ['team_id', 'reasons'])

pattern = r'(?<=Reason to buy: )([A-Za-z]+)'

df['reasons'] = df['reasons'].str.extract(pattern)

Output:

   team_id      reasons
0        1      leisure
1        2  educational
2        3   commercial
3        4      leisure
4        5      leisure

Check the Regex demo and the Python demo.

lemon
  • 14,875
  • 6
  • 18
  • 38
  • Thanks for the comment. What does ?<= do here exactly? – user13617491 Jan 25 '23 at 19:55
  • Checks if before the match there's the words `Reason to buy: `, but doesn't include it in the final match. That's why it is called lookaround (specifically, lookbehind). – lemon Jan 25 '23 at 19:56
  • so it checks for any words,digits and symbols? – user13617491 Jan 25 '23 at 19:57
  • You need to change `([A-Za-z]+)` to change what the regex matches. If you want to match anything other than `<`, you can use `(?<=Reason to buy: )([^<]+)`. – lemon Jan 25 '23 at 20:01
2

You just need to capture the part matching any zero or more chars up to <> or end of string:

df['reasons'] = df['reasons'].str.extract(r"Reason to buy:\s*(.*?)(?=<>|$)", expand=False)

See the regex demo.

Details:

  • Reason to buy: - a string
  • \s* - zero or more whitespace
  • (.*?) - a capturing group with ID 1 that matches zero or more chars other than line break chars as few as possible
  • (?=<>|$) - a positive lookahead that requires a <> string or end of string immediately to the right of the current location.

Note that Series.str.extract requires at least one capturing group in the pattern to actually return a value, so you can even use a non-capturing group instead of the positive lookahead:

df['reasons'] = df['reasons'].str.extract(r"Reason to buy:\s*(.*?)(?:<>|$)", expand=False)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563