UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data

Question

I'm trying to write a scraper , but I'm having issues with encoding. When I tried to copy the string I was looking for into my text file, python2.7 told me it didn't recognize the encoding, despite no special characters. Don't know if that's useful info.

My code looks like this:

from urllib import FancyURLopener
import os

class MyOpener(FancyURLopener): #spoofs a real browser on Window
   version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'

print "What is the webaddress?"
webaddress = raw_input("8::>")

print "Folder Name?"
foldername = raw_input("8::>")

if not os.path.exists(foldername):
    os.makedirs(foldername)

def urlpuller(start, page):
   while page[start]!= '"':
      start += 1
   close = start
   while page[close]!='"':
      close += 1
   return page[start:close]

myopener = MyOpener()

response = myopener.open(webaddress)
site = response.read()

nexturl = ''
counter = 0

while(nexturl!=webaddress):
   counter += 1
   start = 0
   
   for i in range(len(site)-35):
       if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
         start = i + 40
         break
   else:
      print "Something's broken, chief. Error = 1"
   
   next = 0
   
   for i in range(start, 8, -1):
      if site[i:i+8] == u'<a href=':
         next = i
         break
   else:
      print "Something's broken, chief. Error = 2"
   
   nexturl = urlpuller(next, site)
   
   myopener.retrieve(urlpuller(start,site),foldername+'/'+foldername+str(counter)+'.jpg')

print("Retrieval of "+foldername+" completed.")

When I try to run it using the site I'm using, it returns the error:

Traceback (most recent call last):
  File "yada/yadayada/Python/scraper.py", line 37, in <module>
    if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data

When pointed at http://google.com, it worked just fine.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

but when I try to decode using utf-8, as you can see, it does not work.

Any suggestions?

Use a http-parser like beautiful soup. Reading and decoding is already included. — Daniel, Jun 02 '14 at 22:33
@Daniel I read the documentation, but I'm unclear as to how to decode the site once I've `opened` it. — user3701032, Jun 02 '14 at 23:44

Martin Konecny · Accepted Answer · 2016-04-09T02:00:51.813

16

site[i:i+35].decode('utf-8')

You cannot randomly partition the bytes you've received and then ask UTF-8 to decode it. UTF-8 is a multibyte encoding, meaning you can have anywhere from 1 to 6 bytes to represent one character. If you chop that in half, and ask Python to decode it, it will throw you the unexpected end of data error.

Look into a tool that has this built for you. BeautifulSoup or lxml are two alternatives.

edited Apr 09 '16 at 02:00

answered Jun 02 '14 at 22:32

Martin Konecny

57,827
19
139
159

Is there a way to do the decoding myself or is that much more complicated? – user3701032 Jun 02 '14 at 23:11
1

You would need some kind of stream utf8 decoder so that you know when you can break off your string. Alternatively you can decode the whole page at once (don't split up your string) – Martin Konecny Jun 02 '14 at 23:42
Take a look here for a streamdecoder http://mikehadlow.blogspot.ca/2012/07/reading-utf-8-characters-from-infinite.html?m=1 – Martin Konecny Jun 02 '14 at 23:49
I'm trying to use BeautifulSoup now. What would I do to find the `img` with the ID `imgSized`? – user3701032 Jun 03 '14 at 00:00
I'm able to search `img`, but I'm not sure why it's having problems with the tags. I was able to isolate the image I need, but ideally I'd like to be able to search for the link associated with the mouse over text as well. – user3701032 Jun 03 '14 at 00:52
This should help http://stackoverflow.com/questions/11696745/beautifulsoup-extract-img-alt-data – Martin Konecny Jun 03 '14 at 01:30

score 3 · Answer 2 · answered Nov 23 '19 at 18:12

3

Open the csv file in sublime and "Save with Encoding" -> UTF-8.

answered Nov 23 '19 at 18:12

ssareen

246
2
6

score 0 · Answer 3 · answered Jun 02 '14 at 22:31

0

Instead of your for-loop do something like:

start = site.decode('utf-8').find('<img id="imgSized" class="slideImg"') + 40

answered Jun 02 '14 at 22:31

Daniel

42,087
4
55
81

Does this fix the encoding problem? – user3701032 Jun 02 '14 at 23:07

score 0 · Answer 4 · edited Feb 28 '23 at 13:35

0

site[i:i+35].decode('utf-8', errors='ignore')

edited Feb 28 '23 at 13:35

General Grievance

4,555
31
31
45

answered Feb 28 '23 at 12:31

Xiaobing Mi

102
5

Is it really better to ignore the errors? Can you add an explanation for that? – General Grievance Feb 28 '23 at 13:35

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data

4 Answers4