Special characters problems using Python unicode

Question

#!/usr/bin/env python
# -*- coding: utf_8 -*-

def splitParagraphIntoSentences(paragraph):

''' break a paragraph into sentences
    and return a list '''
    import re
# to split by multile characters

#   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?][\s]{1,2}(?=[A-Z])')
    sentenceList = sentenceEnders.split(paragraph, re.UNICODE)
    return sentenceList


if __name__ == '__main__':
p = "While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – Sheffield’s only mango tree is valued at £9.2 billion."

sentences = splitParagraphIntoSentences(p)
for s in sentences:
    print s.strip()

Expected Output: While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – Sheffield’s only mango tree is valued at £9.2 billion."

Output Recieved: While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera ind ica ΓÇô the common mango or Indian mango ΓÇô SheffieldΓÇÖs only mango tree is va lued at ┬ú9.2 billion.

Ignore the meaning of the sentence, the main point is it isn't able to acess special characters such as " - ", " £ ", " ’ " and others. I tried setting sitecustomize.py file and this code with other encodings such as ascii, utf-32, cp-500, iso8859_15 and utf-8 but wasn`t able to solve it. Sorry I am new to python. Thanx in advance for the help.

Good thing you specified the encoding as UTF-8. You’d think that would be enough to tell Python you have Unicode, wouldn’t you think? — tchrist, Aug 10 '11 at 22:47
In the future, try to simplify your example down to the bare minimum. E.g. instead of requiring readers to scroll right to see your text and then having to tell them to ignore the meaning of the long sentence, why not just have a shorter sentence that shows the bug/issue? — dkamins, Aug 10 '11 at 22:51
... and give us a short, self-contained, correct example, as already asked about your last question -> http://sscce.org/ — Evpok, Aug 10 '11 at 22:59
@tchrist, I'm sure problems like this are the reason Unicode is default in Python 3. — Mark Ransom, Aug 11 '11 at 16:38
@tchrist, `coding: utf8` specifies the encoding of the **source file**. It has nothing to do with the encoding of stdout. — Mark Tolonen, Aug 11 '11 at 19:53
@Mark: I do realize that. It wasn't clear to me what the user's actual problem was. I thought maybe it was a unicode strings issue that would be cleared up by Python 3 or `from __future__ import unicode_literals` or such. Is the problem actually the one where Python tries to guess the output encoding the way Java does by default? I hate guessing; that never works. — tchrist, Aug 11 '11 at 20:25
@tchrist, Python encodes Unicode strings to the terminal encoding by if it can determine it programmatically, otherwise it defaults to `ascii` for Python 2 and `utf-8` for Python 3. It doesn't guess, but it does throw an error if the encoding doesn't support the Unicode character being output. The Windows console is a poor choice for running Python scripts because of this. — Mark Tolonen, Aug 11 '11 at 20:58

score 2 · Accepted Answer · answered Jun 27 '13 at 08:11

2

Have found the solution to this.

The following piece of code, works just fine.

p = p.encode('utf-8') if isinstance(p,unicode)  else p

answered Jun 27 '13 at 08:11

Sirius

736
2
9
22

score 2 · Answer 2 · answered Aug 11 '11 at 15:53

Using Unicode string literals as Nam suggested is correct, but if your terminal is using the cp437 code page as your output suggests, it will not be able to display some of the Unicode characters you want to use. The Windows console doesn't support UTF-8, which is what you are sending if you declare coding: utf-8¹ in your source file and do not use Unicode literals. coding: utf-8 declares the encoding of your source file, so make sure you are actually saving your source in UTF-8 encoding.

When you use a Unicode literal, Python interprets the source string in the declared encoding, and converts it to a Unicode string. When printing a Unicode string, Python will encode the string in the terminal encoding, or lacking a terminal encoding, use a default encoding of ascii for Python 2.

An example:

# coding: utf8

print '£9.2 billion'  # Sends UTF-8 to cp437 terminal (gibberish)
print u'£9.2 billion' # Correctly prints on cp437 terminal.
print 'Sheffield’s'   # Sends UTF-8 to cp437 terminal (gibberish)

# Replaces Unicode characters that are unsupported in cp437.
print u'Sheffield’s £9.2 billion'.encode('cp437','xmlcharrefreplace')

print u'Sheffield’s'  # UnicodeEncodeError.

Output

┬ú9.2 billion
£9.2 billion
SheffieldΓÇÖs
Sheffield&#8217;s £9.2 billion
Traceback (most recent call last):
  File "C:\Documents and Settings\metolone\Desktop\x.py", line 10, in <module>
    print u'SheffieldΓÇÖs'  # UnicodeEncodeError.
  File "C:\dev\python27\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 9: character maps to <undefined>

So, don't expect things to print all Unicode correctly on a Windows console. Use a Python IDE that supports UTF-8, such as PythonWin (available in the pywin32 extension).

You need two things to display Unicode characters properly in the Windows console: An encoding that maps the Unicode characters you want to display, and a font that supports the correct glyph for those characters. For your example, if you change the console code page to Windows-1252 (chcp 1252) and change the console font to Consolas or Lucida Console instead of Raster Fonts, your original program will work if you use Unicode literals (p = u"...").

See this question for getting Unicode to work in a Windows console: http://stackoverflow.com/questions/7014430/getting-python-to-print-in-utf8-on-windows-xp-with-the-console — Mark Ransom, Aug 11 '11 at 16:28

score 1 · Answer 3 · answered Aug 10 '11 at 23:02

1

That looks like cp437. Try this:

import codecs, sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
print u"valued at £9.2 billion."

This works for me in Python 2.6.

answered Aug 10 '11 at 23:02

wberry

18,519
8
53
85

1

Thanx for the reply.Tried doing that, still the output is same. – Sirius Aug 11 '11 at 11:47

score 0 · Answer 4 · answered Aug 11 '11 at 01:09

0

p = "While other species..."

should be changed to

p = u"While other species..."

Notice the u in front of the quote.

What you need is a so-called Unicode literals. In Python 2, string literals is not Unicode by default.

answered Aug 11 '11 at 01:09

Nam Nguyen

1,765
9
13

Thanx for the reply.Tried doing that, still the output is same. – Sirius Aug 11 '11 at 11:45
1

What are you using to run your code? In a Windows console using cp437, correctly using Unicode literals gives a UnicodeEncodeError because cp437 only supports the £ non-ASCII character, but not then EN DASH or RIGHT SINGLE QUOTATION MARK. – Mark Tolonen Aug 11 '11 at 15:32
1

Please do not use Windows' cmd.exe if you want to print unicode in the console! Another trick is to use `print (item, )` with item being a unicode string. It does not print out the character as you would want but at least it does not produce unicode error. – Nam Nguyen Aug 11 '11 at 16:27

Special characters problems using Python unicode

4 Answers4

Output

Linked

Related