2

I am extracting information from a website and storing it to a database using Python with MySQLdb and BeautifulSoup.

The website is organized by about 15 different cities and each city has anywhere from 10 to 150 pages. There is a total of about 500 pages.

For each page per city, I open the site using BeautifulSoup, extract all the neccessary information then perform a insert into or update SQL query.

Currently I am not using threads, and it takes a few minutes to go through all 500 pages because the Python program...

  1. Open a page.
  2. Extract information.
  3. Perform SQL query.
  4. Open the next page...

Ideally I would want to load balance the thread by having, say, 10 concurrent threads that open up about 50 pages each. But I think that may be too complicated to code.

So instead I am thinking of having one thread per city. How would I accomplish this?

Currently my code looks like something like this:

//import threading
import BeautifulSoup
import urllib2
import MySQLdb

con = MySQLdb.connect( ... )

def open_page( url ):
    cur = con.cursor()
    // do SQL query

//Get a dictionary of city URL

cities = [
    'http://example.com/atlanta/',
    'http://example.com/los-angeles/',
    ...
    'http://example.com/new-york/'
]

for city_url in cities:
    soup = BeautifulSoup( urllib2.urlopen( city_url ) )

    // find every page per city
    pages = soup.findAll( 'div', { 'class' : 'page' } )

    for page in pages:
        page_url = page.find( 'a' )[ 'href' ]
        open_page( page_url )
hobbes3
  • 28,078
  • 24
  • 87
  • 116
  • I'm not sure that multithreading is necessary to improve the performance of your script. If it were me, I would first ensure that autocommit is disabled as suggested in this post: http://stackoverflow.com/questions/6482004/pythonmysql-bulk-insert and see if that alone improves performance. Then I would consider executing one or more statements *per city* not *per page*. That would severely decrease the number of statements being executed, only after exhausting those options would I look at multithreading. – Rob Marrowstone Feb 10 '12 at 00:08

2 Answers2

1
  1. Parse urls in N-threads.
  2. Create .CSV source from parsed urls
  3. Create TEMP table
  4. Insert into TEMP table from CSV by http://dev.mysql.com/doc/refman/5.1/en/load-data.html
  5. Insert WITHOUT dupes into MAIN table from TEMP table
cetver
  • 11,279
  • 5
  • 36
  • 56
1

Your initial idea is absolutely feasible. Just start 10 worker threads that wait for input on one and the same queue. Then your mail process puts the urls into this queue. The load-balancing will happen automatically.

If your SQL bindings are thread-safe, you can do the INSERT or UPDATE stuff in the worker threads. Otherwise, I'd add one more thread for the SQL stuff, waiting for input on a different queue. Then your worker threads would put the query into this queue, and the SQL thread would execute it.

If you google for "python worker threads queue" you'll find a few examples.

uselpa
  • 18,732
  • 2
  • 34
  • 52