How to find multiple substrings between <> in one column in pandas data frame + python

Question

I am using Pandas and Python. My data is:

a=pd.DataFrame({'ID':[1,2,3,4,5],
                'Str':['aa <aafae><afre> ht4',
                       'v fef <><433>',
                       '<1234334> <a>',
                       '<bijf> 04<9tu0>q4g <vie>',
                       'aaa 1']})

I want to extract all the sub strings between < > and merge them with blank. For example, the above example should have the result:

aafae afre
  433
1234334 a
bijf 9tu0 vie
nan

So all the sub strings between < > are extracted. There will be nan if no such strings. I have already tried re library and str functions. But I am really new to regex. Could anyone help me out here.

score 3 · Accepted Answer · answered Aug 09 '19 at 05:13

3

Use pandas.Series.str.findall:

a['Str'].str.findall('<(.*?)>').apply(' '.join)

Output:

0       aafae afre
1              433
2        1234334 a
3    bijf 9tu0 vie
4                 
Name: Str, dtype: object

answered Aug 09 '19 at 05:13

Chris

29,127
3
28
51

Thanks a lot. Could you please also explain why we have to put ? after .* ? I did not do so. Then I can only find the last >, instead of the immediately next one. – Feng Chen Aug 09 '19 at 05:20
@FengChen `?` is used to make regex _non greedy_: once it finds what it looks for it stops, rather than goes on until the last match. Perhaps https://stackoverflow.com/questions/2824302/how-to-make-regular-expression-into-non-greedy will guide you better about what it does. – Chris Aug 09 '19 at 05:27

score 1 · Answer 2 · answered Aug 09 '19 at 05:17

Maybe, this expression might work somewhat and to some extent:

import pandas as pd

a=pd.DataFrame({'ID':[1,2,3,4,5],
                'Str':['aa <aafae><afre> ht4',
                       'v fef <><433>',
                       '<1234334> <a>',
                       '<bijf> 04<9tu0>q4g <vie>',
                       'aaa 1']})

a["new_str"]=a["Str"].str.replace(r'.*?<([^>]+)>|(?:.+)', r'\1 ',regex=True)

print(a)

How to find multiple substrings between <> in one column in pandas data frame + python

2 Answers2