0

My code:

a = '汉'
b = u'汉'

These two are the same Chinese character. But obviously, a == b is False. How do I fix this? Note, I can't convert a to utf-8 because I have no access to the code. I need to convert b to the encoding that a is using.

So, my question is, what do I do to turn the encoding of b into that of a?

Yuhuan Jiang
  • 2,616
  • 2
  • 19
  • 21
  • Read this: http://www.joelonsoftware.com/articles/Unicode.html (the first people to answer bellow probably should take a read as well - they seen to have more guessed ways to make it work than properly understand what is going on) – jsbueno Feb 23 '14 at 15:02
  • I know what unicode is. I just need to know, *in Python*, what can I do to turn `b` into the same encoding as `a`, so that they can be compared. – Yuhuan Jiang Feb 24 '14 at 01:40
  • read the article. you won't regret. – jsbueno Feb 24 '14 at 17:19
  • @SMTNinja If one of the answers below helped you with your question, consider marking it correct so the question will close. – tsroten Mar 10 '14 at 06:29

3 Answers3

3

If you don't know a's encoding, you'll need to:

  1. detect a's encoding
  2. encode b using the detected encoding

First, to detect a's encoding, let's use chardet.

$ pip install chardet

Now let's use it:

>>> import chardet
>>> a = '汉'
>>> chardet.detect(a)
{'confidence': 0.505, 'encoding': 'utf-8'}

So, to actually accomplish what you requested:

>>> encoding = chardet.detect(a)['encoding']
>>> b = u'汉'
>>> b_encoded = b.encode(encoding)
>>> a == b_encoded
True
tsroten
  • 2,534
  • 1
  • 14
  • 17
1

Decode the encoded string a using str.decode:

>>> a = '汉'
>>> b = u'汉'
>>> a.decode('utf-8') == b
True

NOTE Replace utf-8 according to the source code encoding.

falsetru
  • 357,413
  • 63
  • 732
  • 636
-1

both a.decode and b.encode are OK:

In [133]: a.decode('utf') == b
Out[133]: True

In [134]: b.encode('utf') == a
Out[134]: True

Note that str.encode and unicode.decode are also available, don't mix them up. See What is the difference between encode/decode?

Community
  • 1
  • 1
zhangxaochen
  • 32,744
  • 15
  • 77
  • 108
  • -1 for suggesting random usage of "decode" and "encode". Please, read the link I posted in the comments of the question before trying to code any other thing that deals with text. – jsbueno Feb 23 '14 at 15:04