Python Web Scraping Problems

Question

I am using Python to scrape AAPL's stock price from Yahoo finance. But the program always returns []. I would appreciate if someone could point out why the program is not working. Here is my code:

import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price

The original source is like this:

<span id="yfs_l84_aapl" class>112.31</span>

Here I just want the price 112.31. I copy and paste the code and find 'class' changes to 'class=""'. I also tried code

regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'

But it does not work either.

Why not use a proper DOM parser and the `.getElementByID('yfs_l84_aapl')`? That would be more appropriate than trying to use regex to parse HTML/XML... — David Zemens, Sep 09 '15 at 00:59
Thank you for your comment. I am just a beginner and I will definitely try your code. — Allen, Sep 09 '15 at 02:45
Cheers. Although not specific to python, [this](http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg) discusses in detail why RegEx is ill-suited for the task. For a very simple case like yours, with no traversing, and a relatively known structure, regex is probably OK. But then again, the `id` attribute is a unique identifier so there's no need for RegEx or even DOM "parsing" if the elements can be uniquely identified :) — David Zemens, Sep 09 '15 at 02:58
There are also some API available [here](https://code.google.com/p/yahoo-finance-managed/wiki/YahooFinanceAPIs) which I have not used, but which most likely return the data in XML or JSON format which are widely supported by python. Again, it's better than trying to read a web page source and parse the HTML :) Good luck !! — David Zemens, Sep 09 '15 at 02:59

score 5 · Accepted Answer · answered Sep 09 '15 at 00:51

5

Well, the good news is that you are getting the data. You were nearly there. I would recommend that you work our your regex problems in a tool that helps, e.g. regex101.

Anyway, here is your working regex:

regex='<span id="yfs_l84_aapl">(\d*\.\d\d)'

You are collecting only digits, so don't do the general catch, be specific where you can. This is multiple digits, with a decimal literal, with two more digits.

answered Sep 09 '15 at 00:51

Shawn Mehan

4,513
9
31
51

Thank you very much for your suggestion. I tried your code and it works well! I am a beginner to Python and there is a lot for me to learn. – Allen Sep 09 '15 at 02:48

score 2 · Answer 2 · answered Sep 09 '15 at 00:58

2

When I went to the yahoo site you provided, I saw a span tag without class attribute.

<span id="yfs_l84_aapl">112.31</span>

Not sure what you are trying to do with "class." Without that I get 112.31

import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price

answered Sep 09 '15 at 00:58

Pyrogrammer

173
1
12

Yes, when I went to Yahoo this time, I got the same span tag as you did. I am not sure why I got the other span tag this afternoon. Thanks for your help! – Allen Sep 09 '15 at 02:50
No problem. Have fun with the project XD – Pyrogrammer Sep 09 '15 at 03:12

score 1 · Answer 3 · answered Sep 09 '15 at 01:33

I am using BeautifulSoup to get the text from span tag

import urllib
from BeautifulSoup import BeautifulSoup

response =urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
html = response.read()
soup = BeautifulSoup(html)
# find all the spans have id = 'yfs_l84_aapl'
target = soup.findAll('span',{'id':"yfs_l84_aapl"})
# target is a list 
print(target[0].string)

Python Web Scraping Problems

3 Answers3