Python - Change string to utf8

Question

I am trying to write Portuguese to an HTML file but I am getting some funny characters. How do I fix this?

first = """<p style="color: red; font-family: 'Liberation Sans',sans-serif">{}</p>""".format(sentences1[i]) 
f.write(first)

Expected Output: Hoje, nós nos unimos ao povo...

Actual Output in browser (Firefox on Ubuntu): ï»¿Hoje, nÃ³s nos unimos ao povo...

I tried doing this:

first = """<p style="color: red; font-family: 'Liberation Sans',sans-serif">{}</p>""".format(sentences1[i]) 
f.write(first.encode('utf8'))

Output in terminal: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 65: ordinal not in range(128)

Why am I getting this error and also how can I write other languages to an HTML doc without the funny characters?
Or, is there a different file type that I can write to with the above font formatting?

http://stackoverflow.com/questions/21129020/how-to-fix-unicodedecodeerror-ascii-codec-cant-decode-byte — liuzhidong, Mar 17 '15 at 14:42

score 1 · Answer 1 · answered Mar 17 '15 at 13:55

1

Your format string should be a Unicode string too:

first = u"""<p style="color: red; font-family: 'Liberation Sans',sans-serif">{}</p>""".format(sentences1[i]) 
f.write(first)

answered Mar 17 '15 at 13:55

Selcuk

57,004
12
102
110

1

I am still getting this error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128) – Mar 17 '15 at 14:02
Where does your `sentences1` list comes from? Can you post the code for it too? – Selcuk Mar 17 '15 at 14:13
@kedar Traceback (most recent call last): File "p.py", line 80, in first = u"""
{}
""".format(sentences1[i]) UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128) – Mar 17 '15 at 14:43
@ selcuk the sentences1 list is derived from a file. Each sentence is read and stored in the list. My code works perfectly on English text. If I try to write to a different language, I get funny symbols. So I tried to changed the codec and then I get the errors. – Mar 17 '15 at 14:46
Do you decode according to the encoding of the file when reading from it? – Kedar Mar 17 '15 at 14:50
Also, when I print to the terminal, all the Portuguese letters come out fine – Mar 17 '15 at 14:52
@kedar - I don't decode anything. This is the code for the source file: f = open('source1.txt', 'r') - text1 = f.read() – Mar 17 '15 at 14:55

score 0 · Answer 2 · answered Mar 17 '15 at 15:46

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

^ Read it!

This is what happens when you try to use .format on text read from a file with special characters.

>>> mystrf = u'special text here >> {} << special text'
>>> g = open('u.txt','r')
>>> lines = g.readlines()
>>> mystrf.format(lines[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>>

Python tries to decode the text from the file as ASCII. So how do we fix that.

We simply tell python the proper encoding.

>>> line = mystrf.format(lines[0].decode('utf-8'))
>>> print line
special text here >> ß << special text

But when we try to write to a file again. It doesn't work.

>>> towrite.write(line)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 21: ordinal not in range(128)

We encode the line before writing to a file again.

>>> towrite.write(line.encode('utf-8'))

Mark Ransom · Answer 3 · 2015-03-17T16:02:20.280

It appears that you're working with a string that is already UTF-8 encoded, so that's OK. The problem is that the meta tag in the HTML output is identifying the text as something other than UTF-8. For example, you may have <meta charset="ISO-8859-1">; you need to change it to <meta charset="UTF-8">.

The term for this kind of character set confusion is Mojibake.

P.S. Your string starts with a Byte Order Mark (BOM), you might want to remove it before working with the string.

Python - Change string to utf8

3 Answers3