3

I am trying to make a crawler in python by following an udacity course. I have this method get_page() which returns the content of the page.

def get_page(url):
    '''
    Open the given url and return the content of the page.
    '''

    data = urlopen(url)
    html = data.read()
    return html.decode('utf8')

the original method was just returning data.read(), but that way I could not do operations like str.find(). After a quick search I found out I need to decode the data. But now I get this error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I have found similar questions in SO but none of them were specifically for this. Please help.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
Sayantan Das
  • 1,619
  • 4
  • 24
  • 43

3 Answers3

0

You are trying to decode an invalid string.

The start byte of any valid UTF-8 string must be in the range of 0x00 to 0x7F. So 0x8B is definitely invalid. From RFC3629 Section 3:

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number.

You should post the string you are trying to decode.

Community
  • 1
  • 1
Rei
  • 6,263
  • 14
  • 28
  • this is actually a web page. for example if i pass http://google.co.in as `url` then i get this error >UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 9862: invalid start byte – Sayantan Das Dec 18 '16 at 09:11
0

Maybe the page is encoded with other character encoding but 'utf-8'. So the start byte is invalid. You could do this.

def get_page(self, url):
    if url is None:
        return None
    response=urllib.request.urlopen(url)
    if response.getcode()!=200:
        print("Http code:",response.getcode())
        return None
    else:
        try:
            return response.read().decode('utf-8')
        except:
            return response.read()
Qi Liu
  • 13
  • 6
0

Web servers often serve HTML pages with a Content-Type header that includes the encoding used to encoding the page. The header might look this:

Content-Type: text/html; charset=UTF-8

We can inspect the content of this header to find the encoding to use to decode the page:

from urllib.request import urlopen        
    
def get_page(url):    
    """ Open the given url and return the content of the page."""    
    
    data = urlopen(url)    
    content_type = data.headers.get('content-type', '')    
    print(f'{content_type=}')    
    encoding = 'latin-1'    
    if 'charset' in content_type:    
        _, _, encoding = content_type.rpartition('=')    
        print(f'{encoding=}')    
    html = data.read()    
    return html.decode(encoding) 

Using requests is similar:

response = requests.get(url)
content_type = reponse.headers.get('content-type', '')

Latin-1 (or ISO-8859-1) is a safe default: it will always decode any bytes (though the result may not be useful).

If the server doesn't serve a content-type header you can try looking for a <meta> tag that specifies the encoding in the HTML. Or pass the response bytes to Beautiful Soup and let it try to guess the encoding.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153