2

I have a problem and a question. This URL - www.listindiario.com - has a redirect and I can't scrape it using BeautifulSoup webscraping. It has a redirect to the root and I don't know how to do webscraping on the home page since it always redirects and urllib2 fails.

I want to access the home page and not the splash page. Any suggestions?

I understand that the code is not optimized , but I just want to know how to skip that redirect.

key = 'la'

htmlfile_test = urllib2.Request('http://www.listindiario.com', headers=hdr)

try:
    htmlfile = urllib2.urlopen(htmlfile_test)
    soup = bs4(htmlfile)

    print soup

except URLError as e:
    if hasattr(e, 'reason'):
        print 'Dificultad para encontrar respuesta del server.'

    if responses.has_key(e.code):
        print 'Razon: ', responses[e.code]
    elif hasattr(e, 'code'):
        print 'El servidor no puede completar la respuesta.'
        print 'Codigo de error : ', e.code

    else:
        print 'URL: ', htmlfile.geturl()

        for resultado in soup.find_all('a', href=True, text=re.compile(key)):
            print "Encontrado ! <>", resultado['href']
Jamie Bull
  • 12,889
  • 15
  • 77
  • 116
papabomay
  • 205
  • 2
  • 10

1 Answers1

1

I'd suggest using the requests module instead of urllib2. You can then use:

import requests
r = requests.get('http://www.listindiario.com', allow_redirects=False)
soup = bs4(r.text)
Jamie Bull
  • 12,889
  • 15
  • 77
  • 116
  • Mmmmm ...Object moved

    Object moved to here.

    ......Why using urllib2 requets instead ?
    – papabomay Nov 27 '14 at 19:31
  • `requests` wraps `urllib3` in an easy to use package which exposes things like the `allow_redirects` parameter. Take a look at [this answer](http://stackoverflow.com/a/14804320/1706564). – Jamie Bull Nov 28 '14 at 16:08