Python 3.4 pulling multiple websites using asynchronous I/O

Question

In my code I am generating many different URL addresses and pulling a specific table from each of these sites. Without using concurrent operations the process is very slow and I would like to optimize for speed.

from lxml import html 

for eachTicker in ticker_list:
    bs_url = 'http://finance.yahoo.com/q/bs?s=%s' % eachTicker
    is_url = 'http://finance.yahoo.com/q/is?s=%s' % eachTicker
    cf_url = 'http://finance.yahoo.com/q/cf?s=%s' % eachTicker

    bs_tree = html.parse(bs_url)
    is_tree = html.parse(is_url)
    cf_tree = html.parse(cf_url)

    cf_content = cf_tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr/td
    bs_content = bs_tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr/td
    is_content = is_tree.xpath("//table[@class='yfnc_tabledata1']/tr[1]/td/table/tr/td

I want to use Asynchronous I/O (Asyncio) to make this process much faster. Any ideas?

I'm currently playing around with the code below to see if I can get it to work. I'd like to put this in a for loop and run a list of URLs through it

 import asyncio
 import aiohttp

 @asyncio.coroutine
 def print_page(url):
     response = yield from aiohttp.request('GET', url)
     body = yield from response.read_and_close(decode=False)
     print(body)

 loop = asyncio.get_event_loop()
 loop.run_until_complete(print_page('http://www.google.com/'))

 loop.run_until_complete(asyncio.wait([print_page('http://www.finance.yahoo.com/q/cf?s=ABT'),
                                  print_page('http://www.finance.yahoo.com/q/cf?s=MMM')]))

I don't see any asyncio code in your question. What specific issues have you encountered? Are you sure that network I/O is the bottleneck and the html parsing is not limited by CPU? Do you know how to run several asyncio coroutines concurrently? Do you know how to make an http request using asyncio? Are you sure `yahoo.com` does not throttle requests from your IP? Have you considered [`multiprocessing.ThreadPool`](http://stackoverflow.com/a/23284285/4279) instead? — jfs, Jun 22 '15 at 13:32
@J.F.Sebastian I'm not sure where to start period. If multiprocessing, threading, or asyncio would work better. I really need to optimize this program for speed. I've done some research, and I'm playing around with some code right now, I've edited the question to include it. — Aran Freel, Jun 22 '15 at 15:18
What is the issue with the current asyncio code? You have `[f(url1), f(url2)]`. To make a for-loop, use list comprehension: `[f(url) for url in [url1, url2]]`, [code example](http://stackoverflow.com/a/20722204/4279) — jfs, Jun 22 '15 at 15:30
@J.F.Sebastian That seems logical... What is your opinion on asyncio v threading v multiprocessing? I'm not sure which would be best to learn. — Aran Freel, Jun 22 '15 at 16:46
Learn all of it. It is all about concurrency. To understand what is all about, see [David Beazley - Python Concurrency From the Ground Up: LIVE! - PyCon 2015](http://www.youtube.com/watch?v=MCs5OvhV9S4) — jfs, Jun 22 '15 at 17:40
@J.F.Sebastian Thanks for the reference, I've gone through a few of the slides on Beazley's website and they are extremely informative; a perfect source! — Aran Freel, Jun 23 '15 at 13:59

Python 3.4 pulling multiple websites using asynchronous I/O

0 Answers0