18

I am trying to write some strings to a file (the strings have been given to me by the HTML parser BeautifulSoup).

I can use "print" to display them, but when I use file.write() I get the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 6: ordinal not in range(128)

How can I parse this?

Ivy
  • 3,393
  • 11
  • 33
  • 46

3 Answers3

16

This error occurs when you pass a Unicode string containing non-English characters (Unicode characters beyond 128) to something that expects an ASCII bytestring. The default encoding for a Python bytestring is ASCII, "which handles exactly 128 (English) characters". This is why trying to convert Unicode characters beyond 128 produces the error.

The unicode()

unicode(string[, encoding, errors])

constructor has the signature unicode(string[, encoding, errors]). All of its arguments should be 8-bit strings.

The first argument is converted to Unicode using the specified encoding; if you leave off the encoding argument, the ASCII encoding is used for the conversion, so characters greater than 127 will be treated as errors

for example

s = u'La Pe\xf1a' 
print s.encode('latin-1')

or

write(s.encode('latin-1'))

will encode using latin-1

yossi
  • 12,945
  • 28
  • 84
  • 110
  • The string it's outputting is a price like "£123" – Ivy Aug 04 '11 at 10:25
  • which is not valid ASCII. The pound sign is char code 163, outside of the ASCII range of 127. – Daniel Roseman Aug 04 '11 at 10:28
  • You must specify an encoding that can encode those characters. Files do not contain characters; they contain bytes. Encodings convert characters to bytes. – Karl Knechtel Aug 04 '11 at 10:29
  • 2
    Yes, when I say "you must do this" I understand perfectly that you aren't doing it yet. That's why you must do it: to fix the problem you describe. `write()` doesn't "understand Unicode" because (a) files do not contain characters, but bytes; and (b) there **is more than one way to do the encoding** and there is no particularly good way for it to choose on your behalf. Well, actually, it does: it picks the simplest possible encoding, that only handles the few character that everyone agrees upon, so that an error comes up if anything special is required. – Karl Knechtel Aug 04 '11 at 11:08
2

The answer to your question is "use codecs". The appeded code also shows some gettext magic, FWIW. http://wiki.wxpython.org/Internationalization

import codecs

import gettext

localedir = './locale'
langid = wx.LANGUAGE_DEFAULT # use OS default; or use LANGUAGE_JAPANESE, etc.
domain = "MyApp"             
mylocale = wx.Locale(langid)
mylocale.AddCatalogLookupPathPrefix(localedir)
mylocale.AddCatalog(domain)

translater = gettext.translation(domain, localedir, 
                                 [mylocale.GetCanonicalName()], fallback = True)
translater.install(unicode = True)

# translater.install() installs the gettext _() translater function into our namespace...

msg = _("A message that gettext will translate, probably putting Unicode in here")

# use codecs.open() to convert Unicode strings to UTF8

Logfile = codecs.open(logfile_name, 'w', encoding='utf-8')

Logfile.write(msg + '\n')

Despite Google being full of hits on this problem, I found it rather hard to find this simple solution (it is actually in the Python docs about Unicode, but rather burried).

So ... HTH...

GaJ

GreenAsJade
  • 14,459
  • 11
  • 63
  • 98
  • 1
    "Simple"? That's also showing a bunch of i18n machinery that OP doesn't care about - he's not trying to make sure that people see text in the right language, he's trying to grab text in a specific language from a specific source and put it in a file. So the only relevant part of your snipped is the first line and the last two, really. As for "hard to find", really? What did you Google for? I tried `UnicodeEncodeError: 'ascii' codec can't encode character`; the results seem helpful enough... – Karl Knechtel Aug 04 '11 at 11:13
1

I tried this it works fine

with open(r"C:\rag\sampleoutput.txt", 'w', encoding="utf-8") as f:  
Ivy
  • 3,393
  • 11
  • 33
  • 46