How can I transform the str to Unicode?

Question

The content in my file is like:

This is a Japanese character: \u3046

And I want to transfer the above string into this form:

This is a Japanese character: unicodeValue_3046|unidecoded_u

Here is my code:

def my_repl(match):
   return ' unicodeValue_' + match.group('uni')[2:] + '|unidecoded_' +unidecode(match.group('uni'))

re.sub(pattern=r'(?P<uni>\\u[a-f0-9]{4})', repl=my_repl, string=open('ja.txt', 'r').readline())

What I get is not what I expected:

Out[207]: u'This is a Japanese character:  unicodeValue_3046|unidecoded_\\u3046 '

After I write it to the file:

opt = re.sub(pattern=r'(?P<uni>\\u[a-f0-9]{4})', repl=my_repl, string=open('ja.txt', 'r').readline())
codecs.open('op', 'w', 'utf-8').write(opt)

What I see is this: This is a Japanese character: unicodeValue_3046|unidecoded_\u3046

Then the unidecode doesn't work, it just outputs what is given.

I know that: unidecode(u'\u3046') and unidecode('\u3046') are 'u', but in my case it differs.

How can I work it out?

You are getting exactly what you wanted, you just need to understand what you are looking for. — tripleee, Aug 24 '18 at 04:44
As an aside, it's 2018, Python 2 should be dead already according to the original end-of-life plans. It got an extension and will be kept in terminal care until 2020; but really, in this day and age, you should be planning to transition to Python 3 or ideally already be in that process. — tripleee, Aug 24 '18 at 04:45
@tripleee It still is confusing. Please help me out. Thank you. — Lerner Zhang, Aug 24 '18 at 05:45
`print` the result from `re.sub` and you will see its `__str__` representation instead of its `__repr__`. This is a very common FAQ. — tripleee, Aug 24 '18 at 05:47
`unidecode` is not a defined function name; please try to update your question into a [mcve]. Avoiding the input file and using `string='This is a Japanese character: \\u3046'` makes this simpler for others to replicate. — tripleee, Aug 24 '18 at 06:03
Maybe also look at https://stackoverflow.com/questions/38385089/how-to-convert-repr-into-encoded-string to get the data in correctly in the first place. (There's an answer which suggests `open(..., encoding='unicode_escape')` in particular.) — tripleee, Aug 24 '18 at 06:08

How can I transform the str to Unicode?

0 Answers0