I need to gather some data from a website, actually it is some text for further analysis. Since I am no expert in web scraping, I did the first step of getting urls with the documents I need. The problem is, that sometimes I can get the documents, but sometimes I get connection time out error. So I want a way for trying until I can get the website response, this is what I have:
from html2text import *
import urllib2
import html2text
from bs4 import BeautifulSoup
id = 1
with open("urls.txt") as f:
for url in f:
print url
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")
with codecs.open("documentos/" + str(id) + ".txt", "w", "utf-8-sig") as temp:
temp.write(soup.get_text())
id += 1
where urls.txt has the desired urls, an example of an url:
How can I achieve this? I could handle it if I only needed 10 documents, but I need more than 500... So I cannot do it manually.
In summary:
Sometimes I can get the documents, sometimes I cannot due to timeout, I want python to try until it CAN get the document...