regex not returning match but there is clearly a match

Question

I am searching an html formatted site for a string ("s") which has this format:

<td class="number">$0.48</td>

I am trying to return the "$0.48" by using a regex. It was working until today and I have no idea what changed, but here is my snippet of code:

def scrubdividata(ticker):
    sleep(1.0) # Time in seconds.
    f = urllib2.urlopen('the url')
    lines = f.readlines()
    for i in range(0,len(lines)):
        line = lines[i]
        if "Annual Dividend:" in line:
            print 'for ticker %s, annual dividend is in line'%(ticker)
            s = str(lines[i+1])
            print s
            start = '>$'
            end = '</td>'
            AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)

Here is the result:

for ticker A, annual dividend is in line

    <td class="number">$0.48</td>

Traceback (most recent call last):
  File "test.py", line 115, in <module>
    scrubdividata(ticker)
  File "test.py", line 34, in scrubdividata
    LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
AttributeError: 'NoneType' object has no attribute 'group'

I am using python 2.5 (I believe). I have heard never to use regex with html, but I needed to quickly use my limited knowledge to get the job done asap and regex is the only way I know of. Now, am I suffering the consequences or is there another issue that's causing this? Any insights would be wonderful!

Thanks, B

take a look at this post: http://stackoverflow.com/questions/8692/how-to-use-xpath-in-python — Casimir et Hippolyte, Jul 19 '13 at 18:45
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — SethMMorton, Jul 19 '13 at 19:56

score 3 · Accepted Answer · answered Jul 19 '13 at 18:48

From the docs:

"$" Matches the end of the string or just before the newline at the end of the string.

So you probably want to escape the dollar sign on this line like so:

start = '>\$'

If you're considering doing more searching through HTML in the future, I would suggest you take a look at the Beautiful Soup module. It's a bit more forgiving than regex.

jh314 · Answer 2 · 2013-07-19T19:03:29.743

1

You need to escape the dollar sign.

start = '>\$'
end = '</td>'
AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)

The reason is that the $ is a special character in regex. (It matches the end of a string or before the newline.)

This will set AnnualDiv to the string '0.48'. If you want to add the $, you can do it using this:

AnnualDiv = "$%s" % re.search('%s(.*)%s' % (start, end), s).group(1)

edited Jul 19 '13 at 19:03

answered Jul 19 '13 at 18:43

jh314

27,144
16
62
82

regex not returning match but there is clearly a match

2 Answers2