0

I am using python 3 to scrape a website and print a value. Here is the code

import urllib.request
import re

url = "http://in.finance.yahoo.com/q?s=spy"  
hfile = urllib.request.urlopen(url)
htext = hfile.read().decode('utf-8')
regex = '<span id="yfs_l84_SPY">(.+?)</span>'
code = re.compile(regex)
price = re.findall(code,htext)
print (price)

when i run this snippet, it prints an empty list, ie. [], but i am expecting a value e.g. 483.33.

What is the thing that i am getting wrong ? Help

Shivamshaz
  • 262
  • 2
  • 3
  • 10
  • 4
    Please, _please_ don't use regex for parsing HTML. Use (`*gasp!*`) an HTML parser. – Matt Ball Oct 28 '13 at 19:32
  • Matt, why can't we use regex ? whats the issue – Shivamshaz Oct 28 '13 at 19:34
  • 2
    its not that you CANT use it, its just that there are WAAAAY better premade, built-in tools already for parsing this type of thing. Check [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?lq=1) post out – TehTris Oct 28 '13 at 19:35
  • 1
    @user2762289 in the sourcecode of the webpage you're trying to scrap, "spy" is in lowercase while you're using uppercase, you need to match case insensitive or make everything lowercase. – HamZa Oct 28 '13 at 19:36
  • @HamZa Case Sensitive is not an issue, because the webpage automatically converts it into lowercase. Output is not changed even when i use lowercase SPY. – Shivamshaz Oct 28 '13 at 19:43
  • @user2762289 lolwut, please don't tell me that `spy === SPY`. Also why do you think you could set the `i` modifier ? I'm talking about `(.+?)`, you're using `(.+?)` – HamZa Oct 28 '13 at 19:44
  • 2
    You know that there is a Yahoo Finance API? – fanti Oct 28 '13 at 19:46
  • 1
    @HamZa. You are absolutely correct !!! This issue is solved by changing SPY to lowercase. Thank you So much for your support. – Shivamshaz Oct 28 '13 at 19:49
  • actually you **can't** parse arbitrary HTML/XML with regular expression because it isn't a regular language. See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags –  Oct 28 '13 at 19:55
  • Parsing html with regexes can only result in a downvote. –  Oct 28 '13 at 19:59
  • @JarrodRoberson regular expressions are no longer "regular" in some flavors. See the [power of modern regex](http://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html) or see [this awesome answer](http://stackoverflow.com/a/4234491/). – HamZa Oct 28 '13 at 20:04
  • @HamZa I qualified it with ***arbitrary***, and none of the flavors of regex I know of support that. –  Oct 28 '13 at 20:38

2 Answers2

2

I have to recommend that you not use regex to parse HTML, because HTML is not a regular language. Yes, you could use it here. It's not a good habit to get into.

The biggest issue I imagine that you're having is that the real id of the span you're looking for on that page is yfs_l84_spy. Note case.

That said, here is a quick implementation in BeautifulSoup.

import urllib.request
from bs4 import BeautifulSoup

url = "http://in.finance.yahoo.com/q?s=spy"  
hfile = urllib.request.urlopen(url)
htext = hfile.read().decode('utf-8')
soup = BeautifulSoup(htext)
soup.find('span',id="yfs_l84_spy")
Out[18]: <span id="yfs_l84_spy">176.12</span>

And to get at that number:

found_tag = soup.find('span',id="yfs_l84_spy") #tag is a bs4 Tag object
found_tag.next #get next (i.e. only) element of the tag
Out[36]: '176.12'
Community
  • 1
  • 1
roippi
  • 25,533
  • 4
  • 48
  • 73
0

You are not using the regex correctly, there are 2 ways of doing this:

1.

regex = '<span id="yfs_l84_spy">(.+?)</span>'
code = re.compile(regex)
price = code.findall(htext)

2.

regex = '<span id="yfs_l84_spy">(.+?)</span>'
price = re.findall(regex, htext)

It should be noted that the Python regex library does some caching internally so precaching has only limited effect.

HamZa
  • 14,671
  • 11
  • 54
  • 75
Wolph
  • 78,177
  • 11
  • 137
  • 148
  • I tried both ways but the result is still the same, An Empty list. – Shivamshaz Oct 28 '13 at 19:40
  • In that case your regex simply isn't matching. This could be pretty much anything, an extra space, capital letters versus lowercase, multiple lines... etc – Wolph Oct 28 '13 at 20:41