-2

I am trying to decode u'\uf04a' in python thus I can print it without error warnings. In other words, I need to convert stupid microsoft Windows 1252 characters to actual unicode

The source of html containing the unusual errors comes from here http://members.lovingfromadistance.com/showthread.php?12338-HAVING-SECOND-THOUGHTS

Read about u'\uf04a' and u'\uf04c' by clicking here http://www.fileformat.info/info/unicode/char/f04a/index.htm

one example looks like this:

"Oh god please some advice ":

Out[408]: u'Oh god please some advice \uf04c'

Given a thread like this as one example for test:

thread = u'who are you \uf04a Why you are so harsh to her \uf04c'
thread.decode('utf8')

print u'\uf04a'
print u'\uf04a'.decode('utf8') # error!!!

'charmap' codec can't encode character u'\uf04a' in position 1526: character maps to undefined

With the help of two Python scripts, I successfully convert the u'\x92', but I am still stuck with u'\uf04a'. Any suggestions?

References

https://github.com/AnthonyBRoberts/NNS/blob/master/tools/killgremlins.py

Handling non-standard American English Characters and Symbols in a CSV, using Python

Solution:

According to the comments below: I replace these character set with the question mark('?')

thread = u'who are you \uf04a Why you are so harsh to her \uf04c'
thread = thread.replace(u'\uf04a', '?')
thread = thread.replace(u'\uf04c', '?')

Hope this helpful to the other beginners.

Community
  • 1
  • 1
Frank Wang
  • 1,462
  • 3
  • 17
  • 39
  • It's not really clear what you're trying to do, or where Windows 1252 comes in. What character are you really trying to print? Where do you get the data from? If that "string" is to be taken as a byte sequence, then it's not valid UTF-8... – Jon Skeet Jun 01 '14 at 15:55
  • I agree. The post above has been revised. – Frank Wang Jun 01 '14 at 21:56

2 Answers2

5

The notation u'\uf04a' denotes the Unicode codepoint U+F04A, which is by definition a private use codepoint. This means that the Unicode standard does not assign any character to it, and never will; instead, it can be used by private agreements.

It is thus meaningless to talk about printing it. If there is a private agreement on using it in some context, then you print it using a font that has a glyph allocated to that codepoint. Different agreements and different fonts may allocate completely different characters and glyphs to the same codepoint.

It is possible that U+F04A is a result of erroneous processing (e.g., wrong conversions) of character data at some earlier phase.

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
4
u'\uf04a'

already is a Unicode object, which means there's nothing to decode. The only thing you can do with it is encode it, if you're targeting a specific file encoding like UTF-8 (which is not the same as Unicode, but is confused with it all the time).

u'\uf04a'.encode("utf-8")

gives you a string (Python 2) or bytes object (Python 3) which you can then write to a file or a UTF-8 terminal etc.

You won't be able to encode it as a plain Windows string because cp1252 doesn't have that character.

What you can do is convert it to an encoding that doesn't have those offending characters by telling the encoder to replace missing characters by ?:

>>> u'who\uf04a why\uf04c'.encode("ascii", errors="replace")
'who? why?'
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • I need to convert it to meaningful unicode rather than its current form. – Frank Wang Jun 01 '14 at 16:04
  • 1
    please define what you consider as "meaningful". Maybe it would also help if you told us what your reald problem is, that is, what exactly are you tryint to do? where does your data come from and what do you need to do with it? – mata Jun 01 '14 at 16:10
  • `>>> print u'\uf04a'.encode("utf-8")` gives `∩üè` with Python 2 on my Win-7 system. – martineau Jun 01 '14 at 16:15
  • @martineau - if you write utf8-encoded binary data to a terminal that doesn't support utf8 you'll end up with garbage, so your command doesn't really make sense. – mata Jun 01 '14 at 17:01
  • In my case, to convert it to ? is the right way to go. – Frank Wang Jun 01 '14 at 21:34
  • 1
    @FrankWANG: That's simple (and better than the solution you proposed in your question (where solutions don't belong, anyway :)). See my edit. – Tim Pietzcker Jun 02 '14 at 05:01