The content in my file is like:
This is a Japanese character: \u3046
And I want to transfer the above string into this form:
This is a Japanese character: unicodeValue_3046|unidecoded_u
Here is my code:
def my_repl(match):
return ' unicodeValue_' + match.group('uni')[2:] + '|unidecoded_' +unidecode(match.group('uni'))
re.sub(pattern=r'(?P<uni>\\u[a-f0-9]{4})', repl=my_repl, string=open('ja.txt', 'r').readline())
What I get is not what I expected:
Out[207]: u'This is a Japanese character: unicodeValue_3046|unidecoded_\\u3046 '
After I write it to the file:
opt = re.sub(pattern=r'(?P<uni>\\u[a-f0-9]{4})', repl=my_repl, string=open('ja.txt', 'r').readline())
codecs.open('op', 'w', 'utf-8').write(opt)
What I see is this:
This is a Japanese character: unicodeValue_3046|unidecoded_\u3046
Then the unidecode doesn't work, it just outputs what is given.
I know that: unidecode(u'\u3046')
and unidecode('\u3046')
are 'u', but in my case it differs.
How can I work it out?