I am using urllib2 and BeautifulSoup in python for web scraping, and am constantly saving scraped content to file. I notice that my progress is getting slower and slower and eventually stops within 4 to 8 hours, even for something as simple as
import urllib2
from bs4 import BeautifulSoup
def searchBook():
fb = open(r'filePath', 'a')
for index in range(3510000,3520000):
url = 'http://www.qidian.com/Book/' + str(index) + '.aspx'
try:
html = urllib2.urlopen(url,'html').read()
soup = BeautifulSoup(html)
stats = getBookStats(soup)
fb.write(str(stats))
fb.write('\n')
except:
print url + 'doesn't exist'
fb.close()
def getBookStats(soup): # extract book info from script
stats = {}
stats['trialStatus'] = soup.find_all('span',{'itemprop':'trialStatus'})[0].string
stats['totalClick'] = soup.find_all('span',{'itemprop':'totalClick'})[0].string
stats['monthlyClick'] = soup.find_all('span',{'itemprop':'monthlyClick'})[0].string
stats['weeklyClick'] = soup.find_all('span',{'itemprop':'weeklyClick'})[0].string
stats['genre'] = soup.find_all('span',{'itemprop':'genre'})[0].string
stats['totalRecommend'] = soup.find_all('span',{'itemprop':'totalRecommend'})[0].string
stats['monthlyRecommend'] = soup.find_all('span',{'itemprop':'monthlyRecommend'})[0].string
stats['weeklyRecommend'] = soup.find_all('span',{'itemprop':'weeklyRecommend'})[0].string
stats['updataStatus'] = soup.find_all('span',{'itemprop':'updataStatus'})[0].string
stats['wordCount'] = soup.find_all('span',{'itemprop':'wordCount'})[0].string
stats['dateModified'] = soup.find_all('span',{'itemprop':'dateModified'})[0].string
return stats
My questions are
1) What is the bottle neck of this code, urllib2.urlopen() or soup.find_all()?
2) The only way that I can tell that the code has stopped is by examining the output file. I then manually restart the process from where it stopped. Is there a more efficient way to tell that the code has stopped? Is there a way to automate the restart?
3) The best thing to do is, of course, to prevent the code from slowing and stopping altogether. What are the possible places that I can check?
I am currently trying suggestions from answers and comments
1) @DavidEhrmann
url = 'http://www.qidian.com/BookReader/' + str(3532901) + '.aspx'
with urllib2.urlopen(url,'html') as u: html = u.read()
# html = urllib2.urlopen(url,'html').read()
--------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-32-8b6f635f6bd5> in <module>()
1 url = 'http://www.qidian.com/BookReader/' + str(3532901) + '.aspx'
----> 2 with urllib2.urlopen(url,'html') as u: html = u.read()
3 html = urllib2.urlopen(url,'html').read()
4 soup = BeautifulSoup(html)
AttributeError: addinfourl instance has no attribute '__exit__'
2) @Stardustone
The program still stops after adding sleep() commands at various locations.