Web scraping using regex

Question

I'm running into a wall why this code does not work, even thought it's the same code as on an online tutorial Python Web Scraping Tutorial 5 (Network Requests). I tried running the code also via online Python interpreter.

import urllib
import re

htmltext = urllib.urlopen("https://www.google.com/finance?q=AAPL")

regex = '<span id="ref_[^.]*_l">(.+?)</span>'
pattern = re.compile(regex)
results = re.findall(pattern,htmltext)
results

I get:

re.pyc in findall(pattern, string, flags)
175 
176     Empty matches are included in the result."""
--> 177     return _compile(pattern, flags).findall(string)
178 
179 if sys.hexversion >= 0x02020000:

TypeError: expected string or buffer

Expected result(s):

112.71

Help appreciated. I tried using "read()" on the url but that didn't work. According to documentation even empty results should be included. Thanks

There is error in your regex pattern, correct pattern would be `(.+?)<\/span>` — ZdaR, Sep 24 '16 at 09:45
If the tutorial you're using suggests using regular expressions to scrape the web, find a different one; HTML parsers exist for a reason. — jonrsharpe, Sep 24 '16 at 09:46
@ZdaR well no... `/` doesn't require escaping in regular expressions... — Jon Clements, Sep 24 '16 at 09:48
@ZdaR Thanks, but it doesn't seem to get the code going. Same error. — Smolo, Sep 24 '16 at 09:48
@jonrsharpe I def could do that. Any idea what could be wrong here ? thanks! — Smolo, Sep 24 '16 at 09:50
@Smolo yeah... definitely listen to Jon - bad tutorials won't help you learning here and this one definitely isn't... anyway... try `htmltext.read().decode('utf8')` and see if that does it... — Jon Clements, Sep 24 '16 at 09:53
Any tutorial that tells you to parse html with a regex should be avoided. Beautifulsoup can do this reliably in a single line `BeautifulSoup(htmltext).select("span[id^=ref_]")` — Padraic Cunningham, Sep 24 '16 at 10:00
as mentioned above a few times - dont use regex to parse html — yedpodtrzitko, Sep 24 '16 at 11:29
Sure, don't use regex to do that. But read/listen the video tutorial until the end ! After this ugly fetching the guy explains a quite better practice by using google API detected from network panel from google dev tools — Gilles Quénot, Sep 24 '16 at 11:56
@GillesQuenot I think people here offering their free time and experience aren't really going to listen to a youtube video attempting: "You really don't want to do this, but now we've wasted your time watching this and causing you having to ask a question to clarify things, now let's show you a correct way"? — Jon Clements, Sep 24 '16 at 13:02

Gilles Quénot · Answer 1 · 2016-09-24T15:12:00.627

1

If you follow the tutorial until the end :) :

% python2                                                                                                     
>>> import urllib
>>> data = urllib.urlopen('https://www.google.com/finance/getprices?q=AAPL&x=NASD&i=10&p=25m&f=c&auto=1').read()
>>> print data.split()[-1]
112.71

Never use regex to web scrape

I make improvement to fetch last array element simpler

edited Sep 24 '16 at 15:12

answered Sep 24 '16 at 12:11

Gilles Quénot

173,512
41
224
223

Thanks Gilles, but that's a different URL that you're opening. I did follow tutorial until the end, but I still don't understand why the same piece of code works differently for different people/environments. I appreciate tho! – Smolo Sep 24 '16 at 14:09
This is the URL used at the end of the tutorial – Gilles Quénot Sep 24 '16 at 15:11

score 0 · Answer 2 · answered Sep 24 '16 at 09:56

0

The problem is that you have not actually read the HTML from the request.

htmltext = urllib.urlopen("https://www.google.com/finance?q=AAPL").read()

answered Sep 24 '16 at 09:56

Daniel Roseman

588,541
66
880
895

Umm... the OP has said *I tried using "read()" on the url but that didn't work*... – Jon Clements Sep 24 '16 at 09:57
1

Well they should show that code; this works for me. And this is definitely Python 2, since `urllib.urlopen` doesn't exist in Python 3. – Daniel Roseman Sep 24 '16 at 09:59
Right, so I'm not getting any error, but instead just an empty result ... which I shouldn't because the pattern occurs few times within the page. – Smolo Sep 24 '16 at 10:01
@Smolo, it returns `['112.71']` and there is one occurrence that matches the pattern not multiple. If you get nothing you are probably not getting the same source back for various reasons. – Padraic Cunningham Sep 24 '16 at 10:11
@PadraicCunningham I'm executing the code via Python shell 2.7.6, as well as via Python run in the cloud (if there are to be any differences in output between the two). Here's a link to the code http://www.tutorialspoint.com/execute_python_online.php?PID=0Bw_CjBb95KQMSkplYjVvbklUUzQ How can I dig deeper into why we're getting different results ? Thanks! – Smolo Sep 24 '16 at 10:20

Web scraping using regex

2 Answers2