1

I am using urllib2 and BeautifulSoup in python for web scraping, and am constantly saving scraped content to file. I notice that my progress is getting slower and slower and eventually stops within 4 to 8 hours, even for something as simple as

import urllib2
from bs4 import BeautifulSoup

def searchBook():
    fb = open(r'filePath', 'a')
    for index in range(3510000,3520000):
        url = 'http://www.qidian.com/Book/' + str(index) + '.aspx'
        try:
            html = urllib2.urlopen(url,'html').read()
            soup = BeautifulSoup(html)
            stats = getBookStats(soup)
            fb.write(str(stats))
            fb.write('\n')                
        except:
            print url + 'doesn't exist'
    fb.close()


def getBookStats(soup):                                         # extract book info from script
    stats = {}
    stats['trialStatus'] = soup.find_all('span',{'itemprop':'trialStatus'})[0].string
    stats['totalClick'] = soup.find_all('span',{'itemprop':'totalClick'})[0].string
    stats['monthlyClick'] = soup.find_all('span',{'itemprop':'monthlyClick'})[0].string
    stats['weeklyClick'] = soup.find_all('span',{'itemprop':'weeklyClick'})[0].string
    stats['genre'] = soup.find_all('span',{'itemprop':'genre'})[0].string
    stats['totalRecommend'] = soup.find_all('span',{'itemprop':'totalRecommend'})[0].string
    stats['monthlyRecommend'] = soup.find_all('span',{'itemprop':'monthlyRecommend'})[0].string
    stats['weeklyRecommend'] = soup.find_all('span',{'itemprop':'weeklyRecommend'})[0].string
    stats['updataStatus'] = soup.find_all('span',{'itemprop':'updataStatus'})[0].string
    stats['wordCount'] = soup.find_all('span',{'itemprop':'wordCount'})[0].string
    stats['dateModified'] = soup.find_all('span',{'itemprop':'dateModified'})[0].string
    return stats

My questions are

1) What is the bottle neck of this code, urllib2.urlopen() or soup.find_all()?

2) The only way that I can tell that the code has stopped is by examining the output file. I then manually restart the process from where it stopped. Is there a more efficient way to tell that the code has stopped? Is there a way to automate the restart?

3) The best thing to do is, of course, to prevent the code from slowing and stopping altogether. What are the possible places that I can check?


I am currently trying suggestions from answers and comments

1) @DavidEhrmann

url = 'http://www.qidian.com/BookReader/' + str(3532901) + '.aspx'
with urllib2.urlopen(url,'html') as u: html = u.read()
# html = urllib2.urlopen(url,'html').read()
--------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-32-8b6f635f6bd5> in <module>()
      1 url = 'http://www.qidian.com/BookReader/' + str(3532901) + '.aspx'
----> 2 with urllib2.urlopen(url,'html') as u: html = u.read()
      3 html = urllib2.urlopen(url,'html').read()
      4 soup = BeautifulSoup(html)

AttributeError: addinfourl instance has no attribute '__exit__'

2) @Stardustone

The program still stops after adding sleep() commands at various locations.

Ye Tian
  • 353
  • 1
  • 2
  • 17
  • 1
    First `except Exception as e:` and `print e` so you actually get some error info – Padraic Cunningham Jul 28 '15 at 21:44
  • Good idea! Thank you! – Ye Tian Jul 28 '15 at 21:46
  • You really should close the file handle you opened. Try `with urllib2.urlopen(url,'html') as u: html = u.read()`. Exhausting the pool of file descriptors could even cause performance issues. – David Ehrmann Jul 28 '15 at 21:54
  • How do you mean it actually stops, are you saying it crashes? Also just looking at the output file would not guarantee the code is still not working, any error traceback will be a lot more informative – Padraic Cunningham Jul 28 '15 at 22:09
  • Is your machine doing something else e.g. backups when it gets slow? How is the available memory? Look at `top` (or the equivalent on Windows) to see what is taking the machine's resources. – halfer Jul 28 '15 at 22:11
  • 1
    Also, maybe your scrape target is detecting you've made a number of fetches, and is starting to throttle them. Maybe add a timer to see where the slowness comes in - if it's in the fetch itself, you perhaps need to add some sleeps to prevent the throttle from kicking in. How fast are you fetching (requests/sec) and how many do you do in a session? – halfer Jul 28 '15 at 22:12
  • (Is your [previous question](http://stackoverflow.com/q/31667730/472495) a duplicate?) – halfer Jul 28 '15 at 22:14
  • @DavidEhrmann Unfortunately that doesn't work. – Ye Tian Jul 28 '15 at 22:21
  • 1
    @halfer My previous question was aiming at multi-threading, i.e., I would like to fetch urls in parallel. This one describes a situation where jobs terminate on themselves. How do I figure out how many requests I fetch per sec? Do I add a timer or is there a urllib built-in function that I can call? – Ye Tian Jul 28 '15 at 22:25
  • 1
    "How do I figure out how many requests I fetch per sec?" - I don't know, I don't use Python. But I'd say that's an important thing to research. – halfer Jul 28 '15 at 23:23

2 Answers2

1

I suspect a too high system load average, try to add sleep(0.5) in the try part for each iteration :

     try:
        html = urllib2.urlopen(url,'html').read()
        soup = BeautifulSoup(html)
        stats = getBookStats(soup)
        fb.write(str(stats))
        fb.write('\n')
        time.sleep(0.5)
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • Since you're making several requests that don't appear to be dependent on one another. Maybe you can look into making concurrent requests or multi-threading libraries? –  Jul 28 '15 at 21:46
  • How would this improve performance? Also, URL fetching and HTML parsing tends to be bound more on IO than on parsing the HTML, so I doubt the machine's under much CPU load. – David Ehrmann Jul 28 '15 at 21:56
  • @Coeus This is a subroutine of my code, the other parts of which has nested, although independent, url requesting (see [http://stackoverflow.com/questions/31667730/how-to-speed-up-web-scraping-with-nested-urllib2-urlopen-in-python]). I do believe this is a different issue from multi-threading, which I also would like to have. – Ye Tian Jul 28 '15 at 22:02
  • @StardustOne Unfortunately this approach doesn't work, either. – Ye Tian Jul 29 '15 at 00:10
1

See this answer on how to test how long a function call is taking. This will allow you to determine whether it's the urlopen() that's getting slower.

It could well be, as @halfer said, that the web site you're scraping doesn't want you to scrape a lot, and is progressively throttling your requests. Check their terms of service, and also check whether they offer an API as an alternative to scraping.

Community
  • 1
  • 1
LarsH
  • 27,481
  • 8
  • 94
  • 152