I am extracting information from a website and storing it to a database using Python with MySQLdb and BeautifulSoup.
The website is organized by about 15 different cities and each city has anywhere from 10 to 150 pages. There is a total of about 500 pages.
For each page per city, I open the site using BeautifulSoup, extract all the neccessary information then perform a insert into
or update
SQL query.
Currently I am not using threads, and it takes a few minutes to go through all 500 pages because the Python program...
- Open a page.
- Extract information.
- Perform SQL query.
- Open the next page...
Ideally I would want to load balance the thread by having, say, 10 concurrent threads that open up about 50 pages each. But I think that may be too complicated to code.
So instead I am thinking of having one thread per city. How would I accomplish this?
Currently my code looks like something like this:
//import threading
import BeautifulSoup
import urllib2
import MySQLdb
con = MySQLdb.connect( ... )
def open_page( url ):
cur = con.cursor()
// do SQL query
//Get a dictionary of city URL
cities = [
'http://example.com/atlanta/',
'http://example.com/los-angeles/',
...
'http://example.com/new-york/'
]
for city_url in cities:
soup = BeautifulSoup( urllib2.urlopen( city_url ) )
// find every page per city
pages = soup.findAll( 'div', { 'class' : 'page' } )
for page in pages:
page_url = page.find( 'a' )[ 'href' ]
open_page( page_url )