2

I am using Pandas and Python. My data is:

a=pd.DataFrame({'ID':[1,2,3,4,5],
                'Str':['aa <aafae><afre> ht4',
                       'v fef <><433>',
                       '<1234334> <a>',
                       '<bijf> 04<9tu0>q4g <vie>',
                       'aaa 1']})

I want to extract all the sub strings between < > and merge them with blank. For example, the above example should have the result:

aafae afre
  433
1234334 a
bijf 9tu0 vie
nan

So all the sub strings between < > are extracted. There will be nan if no such strings. I have already tried re library and str functions. But I am really new to regex. Could anyone help me out here.

Feng Chen
  • 2,139
  • 4
  • 33
  • 62

2 Answers2

3

Use pandas.Series.str.findall:

a['Str'].str.findall('<(.*?)>').apply(' '.join)

Output:

0       aafae afre
1              433
2        1234334 a
3    bijf 9tu0 vie
4                 
Name: Str, dtype: object
Chris
  • 29,127
  • 3
  • 28
  • 51
  • Thanks a lot. Could you please also explain why we have to put ? after .* ? I did not do so. Then I can only find the last >, instead of the immediately next one. – Feng Chen Aug 09 '19 at 05:20
  • @FengChen `?` is used to make regex _non greedy_: once it finds what it looks for it stops, rather than goes on until the last match. Perhaps https://stackoverflow.com/questions/2824302/how-to-make-regular-expression-into-non-greedy will guide you better about what it does. – Chris Aug 09 '19 at 05:27
1

Maybe, this expression might work somewhat and to some extent:

import pandas as pd

a=pd.DataFrame({'ID':[1,2,3,4,5],
                'Str':['aa <aafae><afre> ht4',
                       'v fef <><433>',
                       '<1234334> <a>',
                       '<bijf> 04<9tu0>q4g <vie>',
                       'aaa 1']})

a["new_str"]=a["Str"].str.replace(r'.*?<([^>]+)>|(?:.+)', r'\1 ',regex=True)

print(a)
Emma
  • 27,428
  • 11
  • 44
  • 69