Return multiple matches of regular expression within a string in python pandas

Question

I am trying to extract all matches contained in between "><" in a string

The code below only returns the first match in the string.

In:    
import pandas as pd
import re
df = pd.Series(['<option value="85">APOE</option><option value="636">PICALM1<'])
reg = '(>([A-Z])\w+<)'
df2 = df.str.extract(reg)
print df2

Out:
    0   1
0   >APOE<  A

I would like to return "APOE" and "PICALM1" and not just "APOE"

Thanks for your help!

Why you should not parse xml with a regex: http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not. You might consider using a proper xml or html parser instead — Emilien, Nov 04 '15 at 17:29
Agreed with @Emilien, for HTML you may want to use BeautifulSoup although in some specific tasks this may be overkill. — Josep Valls, Nov 04 '15 at 17:30

score 2 · Answer 1 · edited Jul 29 '20 at 12:23

2

import re
import pandas as pd
df['new_col'] =  df['old_col'].str.findall(r'>([A-Z][^<]+)<')

This will store all matches as a list in new_col of dataframe.

edited Jul 29 '20 at 12:23

Qaswed

3,649
7
27
47

answered Mar 25 '20 at 22:04

user2335580

398
1
4
16

score 0 · Answer 2 · answered Nov 04 '15 at 17:28

0

No need for pandas.

df = '<option value="85">APOE</option><option value="636">PICALM1<'
reg = '>([A-Z][^<]+)<'
print re.findall(reg,df)
['APOE', 'PICALM1']

Parsing HTML with regular expressions may not be the best idea, have you considered using BeautifulSoup?

answered Nov 04 '15 at 17:28

Josep Valls

5,483
2
33
67

Thank you for the comprehensive answer. This was easier than I made it to be. I didn't know about BeautifulSoup, but I will definitely check it out! It looks very useful. – alacoste Nov 04 '15 at 22:09

Return multiple matches of regular expression within a string in python pandas

2 Answers2