1

This is the code I am using from Christophers Reeves tutorial on stock scraping it's his 3rd video on the subject on youtube.

import urllib
import re

symbolslist = ["aapl","spy","goog","nflx"]

i=0
while i<len(symbolslist):
    url = "http://finance.yahoo.com/q?s=" +symbolslist[i] +"&q1=1"
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<span id="yfs_l84_'+symbolslist[i] +'">(.?+)</span>'
    pattern = re.compile(regex)
    price = re.findall(pattern,htmltext)
    print "The price of", symbolslist[i]," is", price
    i+=1

I get the following error when I run this code in python 2.7.5

Traceback <most recent call last>:
File "fundamentalism)stocks.py, line 12, in <module>
pattern = re.compile(regex)
File "C:\Python27\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py, line 242, in compile
raise error, v # invalid expression
sre_constant.error: multiple repeat

I don't know if the problem is with the way my library, is installed, my version of python or what. I appreciate your help.

LostAvatar
  • 795
  • 7
  • 20
Rob B.
  • 130
  • 1
  • 7
  • 24

2 Answers2

3

The problem is in using multiple repeat characters: + and ?.

Probably, non-greedy matching was meant instead: (.+?):

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behavior isn’t desired; if the RE <.*> is matched against '<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '<H1>'..

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks for the help. That makes sense. Where could I find out more about that? – Rob B. Sep 02 '13 at 19:02
  • 1
    Well, [here](http://docs.python.org/2/howto/regex.html), [here](http://www.youtube.com/watch?v=FH-ZFOZxjMs), [here](http://stackoverflow.com/questions/4273987/python-re-sub-use-non-greedy-mode-with-end-of-string-it-comes-greedy) etc :) – alecxe Sep 02 '13 at 19:16
0

Others have answered about the greedy match, but on an unrelated note you'll want to write that more like:

for symbol in symbolslist:
    url = "http://finance.yahoo.com/q?s=%s&q1=1" % symbol
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<span id="yfs_l84_%s">(.?+)</span>' % symbol
    price = re.findall(regex, htmltext)[0]
    print "The price of", symbol," is", price
  • The standard Python idiom is to iterate across all the values in a list, not to pick them out by index.
  • "String interpolation" is a lot easier to manage than string concatenation, especially if you're adding several values into the mix (like maybe you want to specify the value of q1 in a later version).
  • re.findall takes a string as its first argument. Explicitly compiling a pattern and then throwing it away in the next loop doesn't get you anything.
  • re.findall returns a list, and you only want the first element from it.
Kirk Strauser
  • 30,189
  • 5
  • 49
  • 65