-1

Hi all: I have a string

s2 = '[u\'\\n\', <td><a href="/etf/UVXY/">UVXY</a></td>, u\'\\n\', <td><a href="/etf/
       UVXY/">Ultra VIX Short-Term Futures ETF</a></td>, u\'\\n\', <td class="rightnobr">+7'
pat = re.compile('<a href=.+>(.+)</a>')
re.findall(pat,s2) only returns ['Ultra VIX Short-Term Futures ETF']..

why it can't catch the field ['UVXY']? if i do

s22 ='[u\'\\n\', <td><a href="/etf/UVXY/">UVXY</a></td>, u\'\\n\', <td><'
re.findall(pat,s2) did return ['UVXY']
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
baozi
  • 679
  • 10
  • 30
  • 1
    This is a rather strange input data format, where is it coming from? – alecxe Jul 13 '14 at 04:47
  • does the input string has two lines? – Avinash Raj Jul 13 '14 at 04:55
  • or, instead of using regex to match HTML, which all agree is a generally awful idea, why not use a parser like [`BeautifulSoup`](http://www.crummy.com/software/BeautifulSoup/)? – MattDMo Jul 13 '14 at 04:58
  • @alecxe http://etfdb.com/compare/volume/ don't worry guys. problem solved – baozi Jul 13 '14 at 05:05
  • @XunBao so this is an HTML you are parsing with regex. You should not do this, there are HTML parsers out there. – alecxe Jul 13 '14 at 05:48
  • @alecxe i don't know. i used Beautifulsoup to deal with webpage first.. – baozi Jul 13 '14 at 05:52
  • @XunBao I'm pretty sure you can extract the data without regex here. What is your desired output? – alecxe Jul 13 '14 at 06:33
  • @alecxe i am building a simple local database, and collect [s&500][ndx100][etfs] symbols from different webiste. the way i did it is using beautifulsoup to clean up the website `File = urllib2.urlopen(item)` `redditHtml = File.read()` `soup = BeautifulSoup(redditHtml)`, and then use re.findall or re.search to retrive all the symbols.. – baozi Jul 13 '14 at 06:46
  • @XunBao ok, I've posted an answer with the code that retrieves all of the data from the table. Check it out. Hope that helps. – alecxe Jul 13 '14 at 06:53

5 Answers5

3

+ is a greedy operator, so <a href=.+> will capture <a href="/etf/UVXY/">UVXY</a></td>, u\'\\n\', <td><a href="/etf/UVXY/"> and the rest will be captured by (.+). That is why you are getting only Ultra VIX Short-Term Futures ETF. You need to make it non-greedy like this

pat = re.compile('<a href=.+?>(.+?)</a>')

Output

['UVXY', 'Ultra VIX Short-Term Futures ETF']

If you make only the first part as non-greedy, then (.+) will match everything till the last </a>. So, if the RegEx is

pat = re.compile('<a href=.+?>(.+)</a>')

then the output will be

['UVXY</a></td>, u\'\\n\', <td><a href="/etf/UVXY/">Ultra VIX Short-Term Futures ETF']

That is why you need to make both the greedy quantifiers as non-greedy, like in my first example.

thefourtheye
  • 233,700
  • 52
  • 457
  • 497
1

.+ is greedy match. (href=.+> matches upto the last > that satisfy the rest of the pattern) Use non-greedy version: .+?.

>>> import re
>>> s2 = '[u\'\\n\', <td><a href="/etf/UVXY/">UVXY</a></td>, u\'\\n\', <td><a href="/etf/UVXY/">Ultra VIX Short-Term Futures ETF</a></td>, u\'\\n\', <td class="rightnobr">+7'
>>> pat = re.compile('<a href=.+?>(.+?)</a>')
>>> re.findall(pat,s2)
['UVXY', 'Ultra VIX Short-Term Futures ETF']
falsetru
  • 357,413
  • 63
  • 732
  • 636
1

The problem is that your match is being greedy in which the pattern consumes maximum characters. Technically speaking, it's actually the quantifier + that is being greedy. To get a non-greedy match, use +?

>>> pat = re.compile('<a href=.+?>(.+?)</a>')
>>> re.findall(pat, s2)
['UVXY', 'Ultra VIX Short-Term Futures ETF']

You may consider using a tool for the job as well.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(s2)
>>> links = [str(x.text) for x in soup.find_all('a')]
['UVXY', 'Ultra VIX Short-Term Futures ETF']
hwnd
  • 69,796
  • 4
  • 95
  • 132
0

Do not use regex for parsing HTML, use a specialized tool called HTML parser, like BeautifulSoup:

import urllib2
from bs4 import BeautifulSoup

URL = 'http://etfdb.com/compare/volume/'

soup = BeautifulSoup(urllib2.urlopen(URL))
for row in soup.select('table.msdata tr')[1:]:
    print [td.text.strip() for td in row('td')]

Prints:

[u'SPY', u'SPDR S&P 500', u'86,697,703', u'$172,868.1 M']
[u'EEM', u'iShares MSCI Emerging Markets ETF', u'46,298,734', u'$40,803.4 M']
[u'IWM', u'iShares Russell 2000 ETF', u'45,452,871', u'$25,882.6 M']
[u'QQQ', u'QQQ', u'35,422,355', u'$43,725.0 M']
...
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
0

I don't have enough StackOverflow juice to post a comment, so this appears as an answer. I regularly use online RE parsers to experiment and test my REs. Here is one of the better ones that also includes some good documentation: http://www.freeformatter.com/regex-tester.html

staggart
  • 267
  • 1
  • 6
  • 15