python regular expression did not catch some field

Question

Hi all: I have a string

s2 = '[u\'\\n\', <td><a href="/etf/UVXY/">UVXY</a></td>, u\'\\n\', <td><a href="/etf/
       UVXY/">Ultra VIX Short-Term Futures ETF</a></td>, u\'\\n\', <td class="rightnobr">+7'
pat = re.compile('<a href=.+>(.+)</a>')
re.findall(pat,s2) only returns ['Ultra VIX Short-Term Futures ETF']..

why it can't catch the field ['UVXY']? if i do

s22 ='[u\'\\n\', <td><a href="/etf/UVXY/">UVXY</a></td>, u\'\\n\', <td><'
re.findall(pat,s2) did return ['UVXY']

This is a rather strange input data format, where is it coming from? — alecxe, Jul 13 '14 at 04:47
or, instead of using regex to match HTML, which all agree is a generally awful idea, why not use a parser like [`BeautifulSoup`](http://www.crummy.com/software/BeautifulSoup/)? — MattDMo, Jul 13 '14 at 04:58
@alecxe http://etfdb.com/compare/volume/ don't worry guys. problem solved — baozi, Jul 13 '14 at 05:05
@XunBao so this is an HTML you are parsing with regex. You should not do this, there are HTML parsers out there. — alecxe, Jul 13 '14 at 05:48
@alecxe i don't know. i used Beautifulsoup to deal with webpage first.. — baozi, Jul 13 '14 at 05:52
@XunBao I'm pretty sure you can extract the data without regex here. What is your desired output? — alecxe, Jul 13 '14 at 06:33
@alecxe i am building a simple local database, and collect [s&500][ndx100][etfs] symbols from different webiste. the way i did it is using beautifulsoup to clean up the website `File = urllib2.urlopen(item)` `redditHtml = File.read()` `soup = BeautifulSoup(redditHtml)`, and then use re.findall or re.search to retrive all the symbols.. — baozi, Jul 13 '14 at 06:46
@XunBao ok, I've posted an answer with the code that retrieves all of the data from the table. Check it out. Hope that helps. — alecxe, Jul 13 '14 at 06:53

score 3 · Accepted Answer · answered Jul 13 '14 at 04:46

+ is a greedy operator, so <a href=.+> will capture <a href="/etf/UVXY/">UVXY</a></td>, u\'\\n\', <td><a href="/etf/UVXY/"> and the rest will be captured by (.+). That is why you are getting only Ultra VIX Short-Term Futures ETF. You need to make it non-greedy like this

pat = re.compile('<a href=.+?>(.+?)</a>')

Output

['UVXY', 'Ultra VIX Short-Term Futures ETF']

If you make only the first part as non-greedy, then (.+) will match everything till the last </a>. So, if the RegEx is

pat = re.compile('<a href=.+?>(.+)</a>')

then the output will be

['UVXY</a></td>, u\'\\n\', <td><a href="/etf/UVXY/">Ultra VIX Short-Term Futures ETF']

That is why you need to make both the greedy quantifiers as non-greedy, like in my first example.

score 1 · Answer 2 · answered Jul 13 '14 at 04:46

1

.+ is greedy match. (href=.+> matches upto the last > that satisfy the rest of the pattern) Use non-greedy version: .+?.

>>> import re
>>> s2 = '[u\'\\n\', <td><a href="/etf/UVXY/">UVXY</a></td>, u\'\\n\', <td><a href="/etf/UVXY/">Ultra VIX Short-Term Futures ETF</a></td>, u\'\\n\', <td class="rightnobr">+7'
>>> pat = re.compile('<a href=.+?>(.+?)</a>')
>>> re.findall(pat,s2)
['UVXY', 'Ultra VIX Short-Term Futures ETF']

answered Jul 13 '14 at 04:46

falsetru

357,413
63
732
636

http://stackoverflow.com/questions/766372/python-non-greedy-regexes Related Question – Casey Falk Jul 13 '14 at 04:46

hwnd · Answer 3 · 2014-07-13T05:16:34.017

The problem is that your match is being greedy in which the pattern consumes maximum characters. Technically speaking, it's actually the quantifier + that is being greedy. To get a non-greedy match, use +?

>>> pat = re.compile('<a href=.+?>(.+?)</a>')
>>> re.findall(pat, s2)
['UVXY', 'Ultra VIX Short-Term Futures ETF']

You may consider using a tool for the job as well.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s2)
>>> links = [str(x.text) for x in soup.find_all('a')]
['UVXY', 'Ultra VIX Short-Term Futures ETF']

score 0 · Answer 4 · edited May 23 '17 at 12:33

Do not use regex for parsing HTML, use a specialized tool called HTML parser, like BeautifulSoup:

import urllib2
from bs4 import BeautifulSoup

URL = 'http://etfdb.com/compare/volume/'

soup = BeautifulSoup(urllib2.urlopen(URL))
for row in soup.select('table.msdata tr')[1:]:
    print [td.text.strip() for td in row('td')]

Prints:

[u'SPY', u'SPDR S&P 500', u'86,697,703', u'$172,868.1 M']
[u'EEM', u'iShares MSCI Emerging Markets ETF', u'46,298,734', u'$40,803.4 M']
[u'IWM', u'iShares Russell 2000 ETF', u'45,452,871', u'$25,882.6 M']
[u'QQQ', u'QQQ', u'35,422,355', u'$43,725.0 M']
...

staggart · Answer 5 · 2014-07-14T03:32:19.133

0

I don't have enough StackOverflow juice to post a comment, so this appears as an answer. I regularly use online RE parsers to experiment and test my REs. Here is one of the better ones that also includes some good documentation: http://www.freeformatter.com/regex-tester.html

edited Jul 14 '14 at 03:32

answered Jul 14 '14 at 03:23

staggart

267
1
6
15

python regular expression did not catch some field

5 Answers5