UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER

Question

I am using python+bs4+pyside in the code ,please look the part of the code below:

enter code here 
#coding:gb2312
import urllib2
import sys
import urllib
import urlparse
import random
import time
from datetime import datetime, timedelta
import socket
from bs4 import BeautifulSoup
import lxml.html
from PySide.QtGui import *
from PySide.QtCore import *
from PySide.QtWebKit import *

def download(self, url, headers, proxy, num_retries, data=None):
    print 'Downloading:', url
    request = urllib2.Request(url, data, headers or {})
    opener = self.opener or urllib2.build_opener()
    if proxy:
        proxy_params = {urlparse.urlparse(url).scheme: proxy}
        opener.add_handler(urllib2.ProxyHandler(proxy_params))
    try:
        response = opener.open(request)
        html = response.read()
        code = response.code
    except Exception as e:
        print 'Download error:', str(e)
        html = ''
        if hasattr(e, 'code'):
            code = e.code
            if num_retries > 0 and 500 <= code < 600:
                # retry 5XX HTTP errors
                return self._get(url, headers, proxy, num_retries-1, data)
        else:
            code = None
    return {'html': html, 'code': code}
def crawling_hdf(openfile):
filename = open(openfile,'r')
namelist = filename.readlines()
app = QApplication(sys.argv)
for name in namelist:         
    url = "http://so.haodf.com/index/search?type=doctor&kw="+ urllib.quote(name)
    #get doctor's home page
    D = Downloader(delay=DEFAULT_DELAY, user_agent=DEFAULT_AGENT, proxies=None, num_retries=DEFAULT_RETRIES, cache=None)
    html = D(url)
    soup = BeautifulSoup(html)
    tr = soup.find(attrs={'class':'docInfo'})
    td = tr.find(attrs={'class':'docName font_16'}).get('href')
    print td
    #get doctor's detail information page
    loadPage_bs4(td)

filename.close()

if __name__ == '__main__':
crawling_hdf("name_list.txt")

After I run the program , there shows a waring message:

Warning (from warnings module): File "C:\Python27\lib\site-packages\bs4\dammit.py", line 231 "Some characters could not be decoded, and were " UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

I have used print str(html) and find all chinese language in tages are messy code.

I have tried use ”decode or encode“ and ”gzip“ solutions which are search in this website，but it doesn't work in my case.

Thank you very much for your help！

score 0 · Accepted Answer · edited May 23 '17 at 12:30

0

It looks like that page is encoded in gbk. The problem is that there is no direct conversion between utf-8 and gbk (that I am aware of).

I've seen this workaround used before, try:

html.encode('latin-1').decode('gbk').encode('utf-8')

edited May 23 '17 at 12:30

Community

1
1

answered Nov 30 '16 at 14:48

"No direct conversion" – with that, you mean *in Python*? There is nothing special about GBK, it's just another [regular encoding](https://en.wikipedia.org/wiki/GBK). Also: UTF8 is just a storage system (for Unicode), so even if there is no 'direct' conversion, you'd better call it by its proper name and leave UTF8 out of this. – Jongware Nov 30 '16 at 14:59
I'm not sure what you mean. My understanding of the problem is that there is a gbk encoded page that dammit.py is failing to convert to utf-8. I'm giving a workaround I've seen used using latin-1 as a "translator". Given the context I would say that "in Python" is a given. FWIW, if there is a better way to get from a to b here, I'm all for it! – Nov 30 '16 at 15:05
This question would have been a duplicate if the one you found had an accepted answer ... Still, the comment "The detour over latin-1 is shocking" should have told you something. – Jongware Nov 30 '16 at 15:07
Are you suggesting a better solution, or what? – Nov 30 '16 at 15:09
@RadLexus if you *already* have a Unicode string that was decoded improperly, this is the only way to fix it. The proper solution is to decode it correctly in the first place, but often that's impossible because of [Mojibake](https://en.wikipedia.org/wiki/Mojibake). It's not "shocking" that `latin-1` works because Unicode took Latin-1 as its base, so the first 256 codepoints map directly to a Latin-1 byte - the conversion is effectively a no-op. – Mark Ransom Nov 30 '16 at 18:23
Sorry，I did not mentioned that I have added #coding:gb2312 at the begin of the code 。By the way ， your solutions is still no works。 – ma yang Dec 01 '16 at 06:45
After I use html.decode('latin-1') , there is no warning message again . It still have messy code when I print(html) . But it doesn't matter . Thanks a lot for every one! – ma yang Dec 01 '16 at 07:07

score 0 · Answer 2 · edited May 23 '17 at 12:24

GBK is one of the built-in encodings in the codecs in Python.

That means that anywhere you have a string of raw bytes, you can use the method decode and the appropriate codec name (or its alias) to convert it to a native Unicode string.

The following works (adapted from https://stackoverflow.com/q/36530578/2564301), insofar the returned text does not contain 'garbage' or 'unknown' characters, and indeed is differently encoded than the source page (as verified by saving this as a new file and comparing the values for the Chinese characters).

from urllib import request

def scratch(url,encode='utf-8'):
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = {'User-Agent':user_agent}
    req = request.Request(url,headers=headers)
    result = request.urlopen(req)
    page = result.read()
    u_page = page.decode(encoding="gbk")
    result.close()
    print(u_page)
    return u_page    

page = scratch('http://so.haodf.com/index/search')
print (page)

UnicodeWarning: Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER

2 Answers2