urllib2.Request freezes sometimes

Question

I have following code to parse the website:

if os.path.isfile(data_content_file):
    try:
        with open(data_content_file) as data_file:    
            question_answer = json.load(data_file)
    except Exception as e:
        question_answer = {}
else:
    question_answer = {}

if os.path.isfile(count_file):
    f = open(count_file, 'r')
    try:
        start = int(f.read())
    except Exception as e:
        start = 1
    f.close()
else:
    start = 1

f = open(count_file, 'w+')
for x in xrange(start,500000):
    try:
        print(x)
        f.seek(0)
        f.truncate()
        f.write(str(x))
        req = urllib2.Request("https://islamqa.info/en/"+str(x), headers={'User-Agent' : "Magic Browser"}) 
        con = urllib2.urlopen( req )
        soup = BeautifulSoup(con.read(),"lxml")

I don't know why it get freezed at some x values.

If I stop my script and run again for the same x value, it runs fine.

I tried using timeout, but it's not loading any page, even if timeout is 1000:

req = urllib2.Request("https://islamqa.info/en/"+str(x), headers={'User-Agent' : "Magic Browser"},timeout=10000)

What's the best way to avoid this, or continue the loop, even the site freezes.

Interesting question. It could be that there is no clean way to abort it if the server never responds: http://stackoverflow.com/questions/11817337/how-do-i-gracefully-interrupt-urllib2-downloads — Bemmu, Mar 26 '17 at 12:35
(Also, one reason the server might just not respond is if they are actively trying to prevent you from requesting so many pages in quick succession) — Bemmu, Mar 26 '17 at 12:35
Can't I get any status from server? Morever, yesterday I parsed the whole website in almost one go. Today, I am facing some issues. — learner, Mar 26 '17 at 12:41
For HTTP in Python, the [`requests`](http://docs.python-requests.org/en/master/) module is usually a better option than the various `urllib` versions. — Roland Smith, Mar 26 '17 at 12:57

urllib2.Request freezes sometimes

0 Answers0