I have a several lists of URLs which I wish to fetch their html content. The URLs are fetched from Twitter and I don't know about the content of the links. They might might be links to web pages as well as music or video. This is how read the html content of the links of a list of urls:
from multiprocessing.dummy import Pool as ThreadPool
def fetch_url(argv):
url = argv[0]
output = None
print "processing url {}".format(url)
try:
# sending the request
req = requests.get(url, stream=True)
# checking if it is an html page
content_type = req.headers.get('content-type')
if 'text/html' in content_type or 'application/xhtml+xml' in content_type:
# reading the contents
html = req.content
req.close()
output = html
else:
print "\t{} is not an HTML file".format(url)
req.close()
except Exception, e:
print "\t HTTP request was not accepted for {}; {}".format(url, e)
return output
with open('url_list_1.pkl', 'rb') as fp:
url_list = pickle.load(fp)
"""
The url_list has such structure:
url_list = [u'http://t.co/qmIPqQVBmW',
u'http://t.co/mE8krkEejV',
...]
"""
pool = Thread
pool = ThreadPool(N_THREADS)
# open the results in their own threads and return the results
func = fetch_url
results = pool.map(func,url_list)
# close the pool and wait for the work to finish
pool.close()
pool.join()
The code works without any problems for most of the lists, but for some of them it gets stuck and doesn't finish the job. I think some of the URLs don't return a response. How can I remedy this? For example wait for a request for X seconds and if it didn't respond, forget about it and move to the next url? Why is this happening?