python urllib2 try to open url until it works

Question

I need to gather some data from a website, actually it is some text for further analysis. Since I am no expert in web scraping, I did the first step of getting urls with the documents I need. The problem is, that sometimes I can get the documents, but sometimes I get connection time out error. So I want a way for trying until I can get the website response, this is what I have:

from html2text import *
import urllib2
import html2text
from bs4 import BeautifulSoup

id = 1
with open("urls.txt") as f:
    for url in f:
        print url
        html = urllib2.urlopen(url).read()
        soup = BeautifulSoup(html, "html.parser")

        with codecs.open("documentos/" + str(id) + ".txt", "w", "utf-8-sig") as temp:
            temp.write(soup.get_text())
        id += 1

where urls.txt has the desired urls, an example of an url:

How can I achieve this? I could handle it if I only needed 10 documents, but I need more than 500... So I cannot do it manually.

In summary:

Sometimes I can get the documents, sometimes I cannot due to timeout, I want python to try until it CAN get the document...

score 1 · Answer 1 · answered Oct 20 '15 at 01:05

1

You would have to better structure your function for fetching site info a bit. Once you have that function, you can use a retry decorator.

answered Oct 20 '15 at 01:05

Jerome Anthony

7,823
2
40
31

score 0 · Answer 2 · edited May 23 '17 at 11:44

0

You can use urllib2.urlopen()'s timeout argument like this :Handling urllib2's timeout? - Python, and and a retry decorator.

edited May 23 '17 at 11:44

Community

1
1

answered Oct 20 '15 at 01:19

chyoo CHENG

720
2
9
22

python urllib2 try to open url until it works

2 Answers2