1

I have a several lists of URLs which I wish to fetch their html content. The URLs are fetched from Twitter and I don't know about the content of the links. They might might be links to web pages as well as music or video. This is how read the html content of the links of a list of urls:

from multiprocessing.dummy import Pool as ThreadPool

def fetch_url(argv):

    url = argv[0]
    output = None

    print "processing url {}".format(url)

    try:
        # sending the request
        req = requests.get(url, stream=True)

        # checking if it is an html page
        content_type = req.headers.get('content-type')
        if 'text/html' in content_type or 'application/xhtml+xml' in content_type:

            # reading the contents
            html = req.content
            req.close()

            output = html

        else:
            print "\t{} is not an HTML file".format(url)
            req.close()

    except Exception, e:
        print "\t HTTP request was not accepted for {}; {}".format(url, e)

    return output


with open('url_list_1.pkl', 'rb') as fp:
   url_list = pickle.load(fp)

"""
The url_list has such structure:
url_list = [u'http://t.co/qmIPqQVBmW',
            u'http://t.co/mE8krkEejV',
            ...]
"""

pool = Thread
pool = ThreadPool(N_THREADS)

# open the results in their own threads and return the results
func = fetch_url
results = pool.map(func,url_list)

# close the pool and wait for the work to finish
pool.close()
pool.join()

The code works without any problems for most of the lists, but for some of them it gets stuck and doesn't finish the job. I think some of the URLs don't return a response. How can I remedy this? For example wait for a request for X seconds and if it didn't respond, forget about it and move to the next url? Why is this happening?

Adham
  • 342
  • 1
  • 2
  • 13
  • 1
    did you hit my bot trap? – Skaperen Apr 13 '15 at 09:13
  • When a site provides an API, [as Twitter does](https://dev.twitter.com/overview/documentation), you really should use that API instead of trying to scrape it manually. Besides the fact that you're usually violating the ToS, and that many of them will deliberately try to break your scraping code, it's just a lot easier and more robust to use the API. – abarnert Apr 13 '15 at 09:29
  • @abarnert I have used the Twitter API to collect the tweets and all the related metadata such as URLs used in the tweets, and then constructed the url lists from the tweets metadata. I am talking about these URLs, not links to Twitter, but links extracted from the tweets. – Adham Apr 13 '15 at 14:05

1 Answers1

1

Of course you can set a timeout (in seconds) for your requests, its really easy!

req = requests.get(url, stream=True, timeout=1)

Quoted from python requests:

timeout is not a time limit on the entire response download; rather, an exception is raised if the server has not issued a response for timeout seconds (more precisely, if no bytes have been received on the underlying socket for timeout seconds).

More info: http://docs.python-requests.org/en/latest/user/quickstart/#timeouts

VGe0rge
  • 1,030
  • 3
  • 15
  • 18
  • Thanks for your reply. I have tried your solution but unfortunately it also gets stuck. I think there is something with the content of some of the links. Or maybe one of there servers stops responding in the middle of the request and keeps the connection alive. Is there a way to limit the response download time? or response download size? – Adham Apr 13 '15 at 14:48
  • What version of requests are you using? – VGe0rge Apr 13 '15 at 15:17
  • I am using requests 2.4.1 – Adham Apr 13 '15 at 15:21
  • Check the first answer here : http://stackoverflow.com/questions/22346158/python-requests-how-to-limit-received-size-transfer-rate-and-or-total-time – VGe0rge Apr 13 '15 at 16:01
  • But the problem with this method is not all the request return the Content-Length header – Adham Apr 13 '15 at 22:56
  • Thats true,but the timeout if statement will work in any case, right? – VGe0rge Apr 14 '15 at 08:09