Python Efficient Web Scraping?

Question

I'm fairly new to Python and am trying to make a web parser for a stock app. I'm essentially using urllib to open the desired webpage for each stock in the argument list and reading the full contents of the html code for that page. Then I'm slicing that in order to find the quote I'm looking for. The method I've implemented works, but I'm doubtful that this is the most efficient means of achieving this result. I've spent some time looking into other potential methods for reading files more rapidly, but none seem to pertain to web scraping. Here's my code:

from urllib.request import urlopen

def getQuotes(stocks):
    quoteList = {}
    for stock in stocks:
        html = urlopen("https://finance.google.com/finance?q={}".format(stock))
        webpageData = html.read()
        scrape1 = webpageData.split(str.encode('<span class="pr">\n<span id='))[1].split(str.encode('</span>'))[0]
        scrape2 = scrape1.split(str.encode('>'))[1]
        quote = bytes.decode(scrape2)
        quoteList[stock] = float(quote)
    return quoteList

print(getQuotes(['FB', 'GOOG', 'TSLA']))

Thank you all so much in advance!

Check out [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) — Mako212, Sep 12 '17 at 20:51
I would work with the `requests` package instead of `urllib` directly. I would think the above code runs pretty fast, does it not? When you have many requests, you can look into multithreading. Should work well to speed things up depending on the code. — M3RS, Sep 12 '17 at 20:53
Oh, yes and check Beautiful Soup or lxml, as suggested above. — M3RS, Sep 12 '17 at 20:54
@Andras It does run rather quickly, but there's a great deal of deviation in its speed, taking anywhere from 1 to 5 seconds. I'm almost certain that the inconsistency has to do with the internet connection though. I still was just curious to see what other options all of you beautiful people had to offer. — , Sep 12 '17 at 20:57
If you will be reading *many* pages *often* then consider https://github.com/kennethreitz/grequests. Otherwise, http://docs.python-requests.org/en/master/ will do you very well. As far as parsing is concerned, have a look at https://doc.scrapy.org/en/latest/topics/selectors.html. *Good* and fast. — Bill Bell, Sep 12 '17 at 20:58
Thanks for all of the options! I have plenty of things to look into now — , Sep 12 '17 at 21:05

Brad Solomon · Accepted Answer · 2017-09-13T20:10:21.543

I'm essentially using urllib to open the desired webpage for each stock in the argument list and reading the full contents of the html code for that page. Then I'm slicing that in order to find the quote I'm looking for.

Here's that implementation in Beautiful Soup and requests:

import requests
from bs4 import BeautifulSoup

def get_quotes(*stocks):
    quotelist = {}
    base = 'https://finance.google.com/finance?q={}'
    for stock in stocks:
        url = base.format(stock)
        soup = BeautifulSoup(requests.get(url).text, 'html.parser')
        quote = soup.find('span', attrs={'class' : 'pr'}).get_text().strip()
        quotelist[stock] = float(quote)
    return quotelist

print(get_quotes('AAPL', 'GE', 'C'))
{'AAPL': 160.86, 'GE': 23.91, 'C': 68.79}
# 1 loop, best of 3: 1.31 s per loop

As mentioned in the comments you may want to look into multithreading or grequests.

Using grequests to make asynchronous HTTP requests:

def get_quotes(*stocks):
    quotelist = {}
    base = 'https://finance.google.com/finance?q={}'
    rs = (grequests.get(u) for u in [base.format(stock) for stock in stocks])
    rs = grequests.map(rs)
    for r, stock in zip(rs, stocks):
        soup = BeautifulSoup(r.text, 'html.parser')
        quote = soup.find('span', attrs={'class' : 'pr'}).get_text().strip()
        quotelist[stock] = float(quote)
    return quotelist

%%timeit 
get_quotes('AAPL', 'BAC', 'MMM', 'ATVI',
           'PPG', 'MS', 'GOOGL', 'RRC')
1 loop, best of 3: 2.81 s per loop

Update: here's a modified version from Dusty Phillips' Python 3 Object-oriented Programming that uses the built-in threading module.

from threading import Thread

from bs4 import BeautifulSoup
import numpy as np
import requests


class QuoteGetter(Thread):
    def __init__(self, ticker):
        super().__init__()
        self.ticker = ticker
    def run(self):
        base = 'https://finance.google.com/finance?q={}'
        response = requests.get(base.format(self.ticker))
        soup = BeautifulSoup(response.text, 'html.parser')
        try:
            self.quote = float(soup.find('span', attrs={'class':'pr'})
                                .get_text()
                                .strip()
                                .replace(',', ''))
        except AttributeError:
            self.quote = np.nan


def get_quotes(tickers):
    threads = [QuoteGetter(t) for t in tickers]
    for thread in threads:        
        thread.start()
    for thread in threads:
        thread.join()
    quotes = dict(zip(tickers, [thread.quote for thread in threads]))
    return quotes

tickers = [
    'A', 'AAL', 'AAP', 'AAPL', 'ABBV', 'ABC', 'ABT', 'ACN', 'ADBE', 'ADI', 
    'ADM',  'ADP', 'ADS', 'ADSK', 'AEE', 'AEP', 'AES', 'AET', 'AFL', 'AGN', 
    'AIG', 'AIV', 'AIZ', 'AJG', 'AKAM', 'ALB', 'ALGN', 'ALK', 'ALL', 'ALLE',
    ]

%time get_quotes(tickers)
# Wall time: 1.53 s

Your first solution with BeautifulSoup actually ended up being slightly slower than my initial implementation... but oh boy, pairing it with grequests really did the trick! Much faster results. Thanks again! — , Sep 12 '17 at 23:51
@ChaseShankula yes, not surprised--BeautifulSoup isn't particularly known for its speed. In this case what's taking up time is the underlying request and parser. What bs4 is useful for is pulling multiple pieces of data from a document [tree](http://web.simmons.edu/~grabiner/comm244/weekfour/document-tree.html). Have a read through the [docs](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) when you can, it will come in handy at some point down the road. — Brad Solomon, Sep 13 '17 at 01:23
@ChaseShankula update to use `threading` rather than `grequests` as I'm having some issues with it. — Brad Solomon, Sep 13 '17 at 20:15
Thanks again for all the help! I ended up using threading, but the thing was so darn fast that I hit google's rate limit, which triggered a reCAPTCHA verification. I've found another source for real-time data that I'm using instead as they don't seem to monitor scraping as robustly as google does. Thanks to your threading idea and my new source that shows 100 quotes on the screen at once, I'm now pulling in quotes at an average rate of 500/second! Really cool stuff, you totally opened up my world of possibilities here haha — , Sep 15 '17 at 22:13

Python Efficient Web Scraping?

1 Answers1