Scraping in python using regex not giving any result?

Question

I am using python 3 to scrape a website and print a value. Here is the code

import urllib.request
import re

url = "http://in.finance.yahoo.com/q?s=spy"  
hfile = urllib.request.urlopen(url)
htext = hfile.read().decode('utf-8')
regex = '<span id="yfs_l84_SPY">(.+?)</span>'
code = re.compile(regex)
price = re.findall(code,htext)
print (price)

when i run this snippet, it prints an empty list, ie. [], but i am expecting a value e.g. 483.33.

What is the thing that i am getting wrong ? Help

Please, _please_ don't use regex for parsing HTML. Use (`*gasp!*`) an HTML parser. — Matt Ball, Oct 28 '13 at 19:32
its not that you CANT use it, its just that there are WAAAAY better premade, built-in tools already for parsing this type of thing. Check [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags?lq=1) post out — TehTris, Oct 28 '13 at 19:35
@user2762289 in the sourcecode of the webpage you're trying to scrap, "spy" is in lowercase while you're using uppercase, you need to match case insensitive or make everything lowercase. — HamZa, Oct 28 '13 at 19:36
@HamZa Case Sensitive is not an issue, because the webpage automatically converts it into lowercase. Output is not changed even when i use lowercase SPY. — Shivamshaz, Oct 28 '13 at 19:43
@user2762289 lolwut, please don't tell me that `spy === SPY`. Also why do you think you could set the `i` modifier ? I'm talking about `(.+?)`, you're using `(.+?)` — HamZa, Oct 28 '13 at 19:44
@HamZa. You are absolutely correct !!! This issue is solved by changing SPY to lowercase. Thank you So much for your support. — Shivamshaz, Oct 28 '13 at 19:49
actually you **can't** parse arbitrary HTML/XML with regular expression because it isn't a regular language. See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — , Oct 28 '13 at 19:55
@JarrodRoberson regular expressions are no longer "regular" in some flavors. See the [power of modern regex](http://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html) or see [this awesome answer](http://stackoverflow.com/a/4234491/). — HamZa, Oct 28 '13 at 20:04
@HamZa I qualified it with ***arbitrary***, and none of the flavors of regex I know of support that. — , Oct 28 '13 at 20:38

score 2 · Answer 1 · edited May 23 '17 at 10:31

I have to recommend that you not use regex to parse HTML, because HTML is not a regular language. Yes, you could use it here. It's not a good habit to get into.

The biggest issue I imagine that you're having is that the real id of the span you're looking for on that page is yfs_l84_spy. Note case.

That said, here is a quick implementation in BeautifulSoup.

import urllib.request
from bs4 import BeautifulSoup

url = "http://in.finance.yahoo.com/q?s=spy"  
hfile = urllib.request.urlopen(url)
htext = hfile.read().decode('utf-8')
soup = BeautifulSoup(htext)
soup.find('span',id="yfs_l84_spy")
Out[18]: <span id="yfs_l84_spy">176.12</span>

And to get at that number:

found_tag = soup.find('span',id="yfs_l84_spy") #tag is a bs4 Tag object
found_tag.next #get next (i.e. only) element of the tag
Out[36]: '176.12'

Thanks for the suggestion. I will switch over :) – Shivamshaz Oct 28 '13 at 19:56 — Shivamshaz, Oct 28 '13 at 19:56

score 0 · Answer 2 · edited Oct 28 '13 at 19:55

0

You are not using the regex correctly, there are 2 ways of doing this:

1.

regex = '<span id="yfs_l84_spy">(.+?)</span>'
code = re.compile(regex)
price = code.findall(htext)

2.

regex = '<span id="yfs_l84_spy">(.+?)</span>'
price = re.findall(regex, htext)

It should be noted that the Python regex library does some caching internally so precaching has only limited effect.

edited Oct 28 '13 at 19:55

HamZa

14,671
11
54
75

answered Oct 28 '13 at 19:37

Wolph

78,177
11
137
148

I tried both ways but the result is still the same, An Empty list. – Shivamshaz Oct 28 '13 at 19:40
In that case your regex simply isn't matching. This could be pretty much anything, an extra space, capital letters versus lowercase, multiple lines... etc – Wolph Oct 28 '13 at 20:41

Scraping in python using regex not giving any result?

2 Answers2