15

I'm trying to write a scraper , but I'm having issues with encoding. When I tried to copy the string I was looking for into my text file, python2.7 told me it didn't recognize the encoding, despite no special characters. Don't know if that's useful info.

My code looks like this:

from urllib import FancyURLopener
import os

class MyOpener(FancyURLopener): #spoofs a real browser on Window
   version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'

print "What is the webaddress?"
webaddress = raw_input("8::>")

print "Folder Name?"
foldername = raw_input("8::>")

if not os.path.exists(foldername):
    os.makedirs(foldername)

def urlpuller(start, page):
   while page[start]!= '"':
      start += 1
   close = start
   while page[close]!='"':
      close += 1
   return page[start:close]

myopener = MyOpener()

response = myopener.open(webaddress)
site = response.read()

nexturl = ''
counter = 0

while(nexturl!=webaddress):
   counter += 1
   start = 0
   
   for i in range(len(site)-35):
       if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
         start = i + 40
         break
   else:
      print "Something's broken, chief. Error = 1"
   
   next = 0
   
   for i in range(start, 8, -1):
      if site[i:i+8] == u'<a href=':
         next = i
         break
   else:
      print "Something's broken, chief. Error = 2"
   
   nexturl = urlpuller(next, site)
   
   myopener.retrieve(urlpuller(start,site),foldername+'/'+foldername+str(counter)+'.jpg')

print("Retrieval of "+foldername+" completed.")

When I try to run it using the site I'm using, it returns the error:

Traceback (most recent call last):
  File "yada/yadayada/Python/scraper.py", line 37, in <module>
    if site[i:i+35].decode('utf-8') == u'<img id="imgSized" class="slideImg"':
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 34: unexpected end of data

When pointed at http://google.com, it worked just fine.

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

but when I try to decode using utf-8, as you can see, it does not work.

Any suggestions?

user3701032
  • 153
  • 1
  • 1
  • 5

4 Answers4

16
site[i:i+35].decode('utf-8')

You cannot randomly partition the bytes you've received and then ask UTF-8 to decode it. UTF-8 is a multibyte encoding, meaning you can have anywhere from 1 to 6 bytes to represent one character. If you chop that in half, and ask Python to decode it, it will throw you the unexpected end of data error.

Look into a tool that has this built for you. BeautifulSoup or lxml are two alternatives.

Martin Konecny
  • 57,827
  • 19
  • 139
  • 159
  • Is there a way to do the decoding myself or is that much more complicated? – user3701032 Jun 02 '14 at 23:11
  • 1
    You would need some kind of stream utf8 decoder so that you know when you can break off your string. Alternatively you can decode the whole page at once (don't split up your string) – Martin Konecny Jun 02 '14 at 23:42
  • Take a look here for a streamdecoder http://mikehadlow.blogspot.ca/2012/07/reading-utf-8-characters-from-infinite.html?m=1 – Martin Konecny Jun 02 '14 at 23:49
  • I'm trying to use BeautifulSoup now. What would I do to find the `img` with the ID `imgSized`? – user3701032 Jun 03 '14 at 00:00
  • I'm able to search `img`, but I'm not sure why it's having problems with the tags. I was able to isolate the image I need, but ideally I'd like to be able to search for the link associated with the mouse over text as well. – user3701032 Jun 03 '14 at 00:52
  • This should help http://stackoverflow.com/questions/11696745/beautifulsoup-extract-img-alt-data – Martin Konecny Jun 03 '14 at 01:30
3

Open the csv file in sublime and "Save with Encoding" -> UTF-8.

ssareen
  • 246
  • 2
  • 6
0

Instead of your for-loop do something like:

start = site.decode('utf-8').find('<img id="imgSized" class="slideImg"') + 40
Daniel
  • 42,087
  • 4
  • 55
  • 81
0
site[i:i+35].decode('utf-8', errors='ignore')
General Grievance
  • 4,555
  • 31
  • 31
  • 45
Xiaobing Mi
  • 102
  • 5