2

I am trying to extract all matches contained in between "><" in a string

The code below only returns the first match in the string.

In:    
import pandas as pd
import re
df = pd.Series(['<option value="85">APOE</option><option value="636">PICALM1<'])
reg = '(>([A-Z])\w+<)'
df2 = df.str.extract(reg)
print df2

Out:
    0   1
0   >APOE<  A

I would like to return "APOE" and "PICALM1" and not just "APOE"

Thanks for your help!

alacoste
  • 29
  • 2

2 Answers2

2
import re
import pandas as pd
df['new_col'] =  df['old_col'].str.findall(r'>([A-Z][^<]+)<')

This will store all matches as a list in new_col of dataframe.

Qaswed
  • 3,649
  • 7
  • 27
  • 47
user2335580
  • 398
  • 1
  • 4
  • 16
0

No need for pandas.

df = '<option value="85">APOE</option><option value="636">PICALM1<'
reg = '>([A-Z][^<]+)<'
print re.findall(reg,df)
['APOE', 'PICALM1']

Parsing HTML with regular expressions may not be the best idea, have you considered using BeautifulSoup?

Josep Valls
  • 5,483
  • 2
  • 33
  • 67
  • Thank you for the comprehensive answer. This was easier than I made it to be. I didn't know about BeautifulSoup, but I will definitely check it out! It looks very useful. – alacoste Nov 04 '15 at 22:09