I have a problem and a question. This URL - www.listindiario.com - has a redirect and I can't scrape it using BeautifulSoup
webscraping. It has a redirect to the root and I don't know how to do webscraping on the home page since it always redirects and urllib2
fails.
I want to access the home page and not the splash page. Any suggestions?
I understand that the code is not optimized , but I just want to know how to skip that redirect.
key = 'la'
htmlfile_test = urllib2.Request('http://www.listindiario.com', headers=hdr)
try:
htmlfile = urllib2.urlopen(htmlfile_test)
soup = bs4(htmlfile)
print soup
except URLError as e:
if hasattr(e, 'reason'):
print 'Dificultad para encontrar respuesta del server.'
if responses.has_key(e.code):
print 'Razon: ', responses[e.code]
elif hasattr(e, 'code'):
print 'El servidor no puede completar la respuesta.'
print 'Codigo de error : ', e.code
else:
print 'URL: ', htmlfile.geturl()
for resultado in soup.find_all('a', href=True, text=re.compile(key)):
print "Encontrado ! <>", resultado['href']
Object moved to here.
......Why using urllib2 requets instead ? – papabomay Nov 27 '14 at 19:31