0

The content in my file is like:

This is a Japanese character: \u3046

And I want to transfer the above string into this form:

This is a Japanese character: unicodeValue_3046|unidecoded_u

Here is my code:

def my_repl(match):
   return ' unicodeValue_' + match.group('uni')[2:] + '|unidecoded_' +unidecode(match.group('uni'))

re.sub(pattern=r'(?P<uni>\\u[a-f0-9]{4})', repl=my_repl, string=open('ja.txt', 'r').readline())

What I get is not what I expected:

Out[207]: u'This is a Japanese character:  unicodeValue_3046|unidecoded_\\u3046 '

After I write it to the file:

opt = re.sub(pattern=r'(?P<uni>\\u[a-f0-9]{4})', repl=my_repl, string=open('ja.txt', 'r').readline())
codecs.open('op', 'w', 'utf-8').write(opt)

What I see is this: This is a Japanese character: unicodeValue_3046|unidecoded_\u3046

Then the unidecode doesn't work, it just outputs what is given.

I know that: unidecode(u'\u3046') and unidecode('\u3046') are 'u', but in my case it differs.

How can I work it out?

halfer
  • 19,824
  • 17
  • 99
  • 186
Lerner Zhang
  • 6,184
  • 2
  • 49
  • 66
  • You are getting exactly what you wanted, you just need to understand what you are looking for. – tripleee Aug 24 '18 at 04:44
  • 1
    As an aside, it's 2018, Python 2 should be dead already according to the original end-of-life plans. It got an extension and will be kept in terminal care until 2020; but really, in this day and age, you should be planning to transition to Python 3 or ideally already be in that process. – tripleee Aug 24 '18 at 04:45
  • @tripleee You are right. The matter disappears in Python3. – Lerner Zhang Aug 24 '18 at 05:10
  • @tripleee It still is confusing. Please help me out. Thank you. – Lerner Zhang Aug 24 '18 at 05:45
  • `print` the result from `re.sub` and you will see its `__str__` representation instead of its `__repr__`. This is a very common FAQ. – tripleee Aug 24 '18 at 05:47
  • @tripleee Please see my updates. – Lerner Zhang Aug 24 '18 at 05:56
  • 1
    `unidecode` is not a defined function name; please try to update your question into a [mcve]. Avoiding the input file and using `string='This is a Japanese character: \\u3046'` makes this simpler for others to replicate. – tripleee Aug 24 '18 at 06:03
  • Maybe also look at https://stackoverflow.com/questions/38385089/how-to-convert-repr-into-encoded-string to get the data in correctly in the first place. (There's an answer which suggests `open(..., encoding='unicode_escape')` in particular.) – tripleee Aug 24 '18 at 06:08

0 Answers0