-1

I want to save a string to a new txt file.

The encoding of the string is 'utf-8'(I think so) and it contains some Chinese character

But the file's is GB2312

here is my code,I omit some:

# -*- coding:utf-8 -*-
# Python 3.4 window 7

def getUrl(self, url, coding='utf-8'):
    self.__reCompile = {}
    req = request.Request(url)
    req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 UBrowser/5.5.9703.2 Safari/537.36')
    with request.urlopen(req) as response:
        return response.read().decode(coding)

def saveText(self,filename,content,mode='w'):
    self._checkPath(filename)
    with open(filename,mode) as f:
        f.write(content)

joke= self.getUrl(pageUrl)
#some re transform such as re.sub('<br>','\r\n',joke)
self.saveText(filepath+'.txt',joke,'a')

Sometimes there is an UnicodeEncodeError: enter image description here

Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
ZuoTao.Chou
  • 49
  • 1
  • 9

3 Answers3

4

Your exception is thrown in 'saveText', but I can't see how you implemented it so I'll try to reproduce the error and the give a suggestion to a fix.

In 'getUrl' you return a decoded string ( .decode('utf-8') ) and my guess is, that in 'saveText', you forget to encode it before writing to the file.

Reproducing the error

Trying to reproduce the error, I did this:

# String with unicode chars, decoded like in you example
s = 'æøå'.decode('utf-8') 

# How saveText could be:
# Encode before write
f = open('test', mode='w')
f.write(s)
f.close()

this gives a similar exception:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-36-1309da3ad975> in <module>()
      5 # Encode before write
      6 f = open('test', mode='w')
----> 7 f.write(s)
      8 f.close()

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Two ways of fixing

You can do either:

# String with unicode chars, decoded like in you example
s = 'æøå'.decode('utf-8') 

# How saveText could be:
# Encode before write
f = open('test', mode='w')
f.write(s.encode('utf-8'))
f.close()

or you can try writing the file using the module 'codecs':

import codecs

# String with unicode chars, decoded like in you example
s = 'æøå'.decode('utf-8') 

# How saveText could be:
f = codecs.open('test', encoding='utf-8', mode='w')
f.write(s)  
f.close()

Hope this helps.

Andreas Ryge
  • 325
  • 2
  • 7
2

The encoding of the string is 'utf-8'(I think so) and it contains some Chinese character

You've decoded the response from the remote server using UTF-8. Once it's decoded to a Python string, it's no longer encoded and stored effectively as Unicode points in memory.

The error you're getting is because Python is trying to use your codepage to convert the string to bytes. Due to your Windows region settings, it's chosen GBK, which doesn't support all of the Unicode characters.

To save, you simply need to open the output file with a specified encoding, using the encoding argument to open() (Python 3. In Python 2, use io.open()). In your case, "UTF-8" may be appropriate encoding to use.

Your saveText() method needs to updated to:

def saveText(self,filename,content,mode='w',encoding="utf-8"):
    self._checkPath(filename)
    with open(filename,mode,encoding) as f:
        f.write(content)

You may run into a issue with your HTTP data. You're assuming the remote content is UTF-8 when you decode the response. This won't always be the case. You could analyse the HTTP response headers to get the right encoding or use Requests library, which does this for you. Your URL getter would look like:

def getUrl(url):
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 UBrowser/5.5.9703.2 Safari/537.36'}
    response = requests.get(url, headers=headers)
    response.raise_for_status() # Throw an exception on errors
    return response.text
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
1

I think that the encoding your terminal is using doesn't support that character. Python is handling it just fine, I think it's your output encoding that cannot handle it.

See also this question

Community
  • 1
  • 1
GiftZwergrapper
  • 2,602
  • 2
  • 20
  • 40