0

I am trying to write Portuguese to an HTML file but I am getting some funny characters. How do I fix this?

first = """<p style="color: red; font-family: 'Liberation Sans',sans-serif">{}</p>""".format(sentences1[i]) 
f.write(first)

Expected Output: Hoje, nós nos unimos ao povo...

Actual Output in browser (Firefox on Ubuntu): Hoje, nós nos unimos ao povo...

I tried doing this:

first = """<p style="color: red; font-family: 'Liberation Sans',sans-serif">{}</p>""".format(sentences1[i]) 
f.write(first.encode('utf8'))

Output in terminal: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 65: ordinal not in range(128)

Why am I getting this error and also how can I write other languages to an HTML doc without the funny characters?
Or, is there a different file type that I can write to with the above font formatting?

3 Answers3

1

Your format string should be a Unicode string too:

first = u"""<p style="color: red; font-family: 'Liberation Sans',sans-serif">{}</p>""".format(sentences1[i]) 
f.write(first)
Selcuk
  • 57,004
  • 12
  • 102
  • 110
  • 1
    I am still getting this error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128) –  Mar 17 '15 at 14:02
  • Where does your `sentences1` list comes from? Can you post the code for it too? – Selcuk Mar 17 '15 at 14:13
  • @kedar Traceback (most recent call last): File "p.py", line 80, in first = u"""

    {}

    """.format(sentences1[i]) UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)
    –  Mar 17 '15 at 14:43
  • @ selcuk the sentences1 list is derived from a file. Each sentence is read and stored in the list. My code works perfectly on English text. If I try to write to a different language, I get funny symbols. So I tried to changed the codec and then I get the errors. –  Mar 17 '15 at 14:46
  • Do you decode according to the encoding of the file when reading from it? – Kedar Mar 17 '15 at 14:50
  • Also, when I print to the terminal, all the Portuguese letters come out fine –  Mar 17 '15 at 14:52
  • @kedar - I don't decode anything. This is the code for the source file: f = open('source1.txt', 'r') - text1 = f.read() –  Mar 17 '15 at 14:55
0

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

^ Read it!

This is what happens when you try to use .format on text read from a file with special characters.

>>> mystrf = u'special text here >> {} << special text'
>>> g = open('u.txt','r')
>>> lines = g.readlines()
>>> mystrf.format(lines[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>>

Python tries to decode the text from the file as ASCII. So how do we fix that.

We simply tell python the proper encoding.

>>> line = mystrf.format(lines[0].decode('utf-8'))
>>> print line
special text here >> ß << special text

But when we try to write to a file again. It doesn't work.

>>> towrite.write(line)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 21: ordinal not in range(128)

We encode the line before writing to a file again.

>>> towrite.write(line.encode('utf-8'))
Kedar
  • 1,648
  • 10
  • 20
0

It appears that you're working with a string that is already UTF-8 encoded, so that's OK. The problem is that the meta tag in the HTML output is identifying the text as something other than UTF-8. For example, you may have <meta charset="ISO-8859-1">; you need to change it to <meta charset="UTF-8">.

The term for this kind of character set confusion is Mojibake.

P.S. Your string starts with a Byte Order Mark (BOM), you might want to remove it before working with the string.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622