2

I'm trying to open multiple pages using urllib2. The problem is that some pages can't be opened. It returns urllib2.HTTPerror: HTTP Error 400: Bad Request

I'm getting hrefs of this pages from another web page (in the head of the page is charset = "utf-8"). The error is returned only then, when I'm trying to open a page containing 'č','ž' or 'ř' in url.

Here is the code:

def getSoup(url):
    req = urllib2.Request(url)

    response = urllib2.urlopen(req)
    page = response.read()
    soup = BeautifulSoup(page, 'html.parser')
    return soup




hovienko = getSoup("http://www.hovno.cz/hovna-az/a/1/")
lis = hovienko.find("div", class_="span12").find('ul').findAll('li')

for liTag in lis:

    aTag = liTag.find('a')['href']
    href = "http://www.hovno.cz"+aTag  """ hrefs, I'm trying to open using urllib2 """
    soup = getSoup(href.encode("iso-8859-2")) """ here occures errors when 'č','ž' or 'ř' in url """

Do anybody knows, what I have to do to avoid errors?

Thank you

Milano
  • 18,048
  • 37
  • 153
  • 353

3 Answers3

1

This site is UTF-8. Why you need href.encode("iso-8859-2") ? I have taken the next code from http://programming-review.com/beautifulsoasome-interesting-python-functions/

    import urllib2
    import cgitb
    cgitb.enable()
    from BeautifulSoup import BeautifulSoup
    from urlparse import urlparse

# print all links
def PrintLinks(localurl):
    data = urllib2.urlopen(localurl).read()
    print 'Encoding of fetched HTML : %s', type(data)
    soup = BeautifulSoup(data)
    parse = urlparse(localurl)
    localurl = parse[0] + "://" + parse[1]
    print "<h3>Page links statistics</h3>"
    l = soup.findAll("a", attrs={"href":True})
    print "<h4>Total links count = " + str(len(l)) + '</h4>'
    externallinks = [] # external links list
    for link in l:
    # if it's external link
        if link['href'].find("http://") == 0 and link['href'].find(localurl) == -1:
            externallinks = externallinks + [link]
    print "<h4>External links count = " + str(len(externallinks)) + '</h4>'


    if len(externallinks) > 0:
        print "<h3>External links list:</h3>"
        for link in externallinks:
          if link.text != '':
            print '<h5>' + link.text.encode('utf-8')
            print ' => [' + '<a href="' + link['href'] + '" >' + link['href'] + '</a>' +  ']' + '</h5>'
          else:
            print '<h5>' + '[image]',
            print ' => [' + '<a href="' + link['href'] + '" >' + link['href'] + '</a>' +  ']' + '</h5>'


PrintLinks( "http://www.zlatestranky.cz/pro-mobily/")
minskster
  • 512
  • 3
  • 6
1

The solution was very simple. I should used urllib2.quote().

EDITED CODE:

for liTag in lis:

    aTag = liTag.find('a')['href']
    href = "http://www.hovno.cz"+urllib2.quote(aTag.encode("utf-8"))
    soup = getSoup(href)
Milano
  • 18,048
  • 37
  • 153
  • 353
0

Couple of things here.

First, you URIs can't contain non-ASCII. You have to replace them. See this: How to fetch a non-ascii url with Python urlopen?

Secondly, save yourself a world of pain and use requests for HTTP stuff.

Community
  • 1
  • 1
Bobby Russell
  • 475
  • 2
  • 12
  • Thank you Bobby, I've already solved it using urllib2.quote(). Could you tell me how using requests could help me with my code? For example on this code below. Thank you – Milano Sep 09 '14 at 16:11
  • The requests library has a much better API. See [this example](https://gist.github.com/kennethreitz/973705) for a hint at what I'm talking about. – Bobby Russell Sep 09 '14 at 17:09