9

I am using Python to scrape AAPL's stock price from Yahoo finance. But the program always returns []. I would appreciate if someone could point out why the program is not working. Here is my code:

import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price

The original source is like this:

<span id="yfs_l84_aapl" class>112.31</span>

Here I just want the price 112.31. I copy and paste the code and find 'class' changes to 'class=""'. I also tried code

regex='<span id=\"yfs_l84_aapl\" class="">(.+?)</span>'

But it does not work either.

MattDMo
  • 100,794
  • 21
  • 241
  • 231
Allen
  • 427
  • 1
  • 7
  • 14
  • 1
    Why not use a proper DOM parser and the `.getElementByID('yfs_l84_aapl')`? That would be more appropriate than trying to use regex to parse HTML/XML... – David Zemens Sep 09 '15 at 00:59
  • Thank you for your comment. I am just a beginner and I will definitely try your code. – Allen Sep 09 '15 at 02:45
  • Cheers. Although not specific to python, [this](http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg) discusses in detail why RegEx is ill-suited for the task. For a very simple case like yours, with no traversing, and a relatively known structure, regex is probably OK. But then again, the `id` attribute is a unique identifier so there's no need for RegEx or even DOM "parsing" if the elements can be uniquely identified :) – David Zemens Sep 09 '15 at 02:58
  • There are also some API available [here](https://code.google.com/p/yahoo-finance-managed/wiki/YahooFinanceAPIs) which I have not used, but which most likely return the data in XML or JSON format which are widely supported by python. Again, it's better than trying to read a web page source and parse the HTML :) Good luck !! – David Zemens Sep 09 '15 at 02:59

3 Answers3

5

Well, the good news is that you are getting the data. You were nearly there. I would recommend that you work our your regex problems in a tool that helps, e.g. regex101.

Anyway, here is your working regex:

regex='<span id="yfs_l84_aapl">(\d*\.\d\d)'

You are collecting only digits, so don't do the general catch, be specific where you can. This is multiple digits, with a decimal literal, with two more digits.

Shawn Mehan
  • 4,513
  • 9
  • 31
  • 51
  • Thank you very much for your suggestion. I tried your code and it works well! I am a beginner to Python and there is a lot for me to learn. – Allen Sep 09 '15 at 02:48
2

When I went to the yahoo site you provided, I saw a span tag without class attribute.

<span id="yfs_l84_aapl">112.31</span>

Not sure what you are trying to do with "class." Without that I get 112.31

import urllib
import re
htmlfile=urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
htmltext=htmlfile.read()
regex='<span id=\"yfs_l84_aapl\">(.+?)</span>'
pattern=re.compile(regex)
price=re.findall(pattern,htmltext)
print price
Pyrogrammer
  • 173
  • 1
  • 12
1

I am using BeautifulSoup to get the text from span tag

import urllib
from BeautifulSoup import BeautifulSoup

response =urllib.urlopen("https://ca.finance.yahoo.com/q?s=AAPL&ql=0")
html = response.read()
soup = BeautifulSoup(html)
# find all the spans have id = 'yfs_l84_aapl'
target = soup.findAll('span',{'id':"yfs_l84_aapl"})
# target is a list 
print(target[0].string)
galaxyan
  • 5,944
  • 2
  • 19
  • 43