0

I need to gather some data from a website, actually it is some text for further analysis. Since I am no expert in web scraping, I did the first step of getting urls with the documents I need. The problem is, that sometimes I can get the documents, but sometimes I get connection time out error. So I want a way for trying until I can get the website response, this is what I have:

from html2text import *
import urllib2
import html2text
from bs4 import BeautifulSoup

id = 1
with open("urls.txt") as f:
    for url in f:
        print url
        html = urllib2.urlopen(url).read()
        soup = BeautifulSoup(html, "html.parser")

        with codecs.open("documentos/" + str(id) + ".txt", "w", "utf-8-sig") as temp:
            temp.write(soup.get_text())
        id += 1

where urls.txt has the desired urls, an example of an url:

How can I achieve this? I could handle it if I only needed 10 documents, but I need more than 500... So I cannot do it manually.

In summary:

Sometimes I can get the documents, sometimes I cannot due to timeout, I want python to try until it CAN get the document...

dpalma
  • 500
  • 5
  • 20

2 Answers2

1

You would have to better structure your function for fetching site info a bit. Once you have that function, you can use a retry decorator.

Jerome Anthony
  • 7,823
  • 2
  • 40
  • 31
0

You can use urllib2.urlopen()'s timeout argument like this :Handling urllib2's timeout? - Python, and and a retry decorator.

Community
  • 1
  • 1
chyoo CHENG
  • 720
  • 2
  • 9
  • 22