0

I have below like dataframe where I have japanese,chinese languages in company name...

 data = [['company1', '<U+042E><U+043F><U+0438><U+0442><U+0435><U+0440>'], ['company2', 
 '<c1>lom<e9>kszer Kft.'], ['company3', 'Ernst and young'],
   ['company4', '<c5>bo Akademi']]

  df = pd.DataFrame(data, columns = ['Name', 'company_name'])

it looks like below

enter image description here

now all I want is to convert and translate these values to readable english values.

can I do that? , if yes, how , Please..

  • Is `'Юпитер'` the name of the company? – Nima Oct 03 '21 at 11:21
  • @Nima not sure what you mean –  Oct 03 '21 at 11:25
  • I translated `''` which means `Юпитер` Jupyter. – Nima Oct 03 '21 at 11:30
  • aah , great, can you let me know how I can do it for all values plz? –  Oct 03 '21 at 11:34
  • @Nima could you plz walk me through it, its a bit critical for my delivery, I have exhausted many options. –  Oct 03 '21 at 11:47
  • 1
    That is not an easy thing to do. I did it manually. I will post a detailed description of what I just did. – Nima Oct 03 '21 at 11:50
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/237752/discussion-between-ashish-pandey-and-nima). –  Oct 03 '21 at 12:12

2 Answers2

0

Your examples do not exhibit a single unified encoding. We can speculate that the two-digit ones are Latin-1, but I'm guessing (based also on the duplicate question) that the truth is really more complex than that.

Anyway, for general direction at least, try this:

import re
...
for index in range(len(data)):
    data[index][1] = re.sub(
        r'<U\+([0-9a-fA-F]{4})>', 
        lambda x: chr(int(x.group(1), 16)),
        re.sub(
            r'<([0-9a-fA-F]{2})>',
            lambda x: chr(int(x.group(1), 16)), 
            data[index][1]))

Demo: https://ideone.com/X60x3Q

You can avoid the repeated lambda expression at the cost of a slightly more complex regular expression.

for index in range(len(data)):
    data[index][1] = re.sub(
        r'<(?:U\+)?((?<=\+)[0-9a-fA-F]{4}|(?<=<)[0-9a-fA-F]{2})>', 
        lambda x: chr(int(x.group(1), 16)),
        data[index][1])

Demo: https://ideone.com/SkuvAJ

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • can we also add step to translate to english at the end? –  Oct 03 '21 at 13:11
  • As already communicated elsewhere, it's not clear what you mean. There are probably services for transliteration of Russian but I can't recommend any particular one. Dropping accents from accented strings is a common FAQ which should be easy to search (basically, [convert to NFKD Unicode normal form, then extract the ASCII characters] (https://stackoverflow.com/questions/51710082/what-does-unicodedata-normalize-do-in-python)) but generally of dubious or outright negative value. – tripleee Oct 03 '21 at 13:42
  • how can we remove these, just drop from dataframe. using the same regex?? –  Oct 03 '21 at 14:57
  • I mean df.drop(df[df.employer_name.str.contains(r'')].index, inplace=True) –  Oct 03 '21 at 15:01
  • 2
    I'm not a Pandas person, but as we have now decoded them, do you really want to drop them? The Russian one reads "Yupiter". You seem to be going back and forth between what you want; perhaps first decide what you actually want to ask, then post a new question specifically about that. – tripleee Oct 03 '21 at 15:53
  • sry abt misunderstanding here, I wanted to decode but due to not translating them properly. I just wanted to drop using below.. df.drop(df[df.employer_name.str.contains(r'<(?:U\+)?((?<=\+)[0-9a-fA-F]{4}|(?<=<)[0-9a-fA-F]{2})>')].index,inplace=True) –  Oct 03 '21 at 16:33
  • 1
    I can only repeat that dropping anything which contains even a single accented character (or mathematical symbol, curly quote, emoji, etc) seems quite misdirected. This continues to look like an [XY problem](https://en.wikipedia.org/wiki/XY_problem) and feel like Groundhog Day. – tripleee Oct 03 '21 at 17:30
  • Thanks a lot for help..you have a great one. –  Oct 03 '21 at 18:46
-1

This needs some work. I just translated it manually. Here it is:

>>> '<U+042E><U+043F><U+0438><U+0442><U+0435><U+0440>'
'<U+042E><U+043F><U+0438><U+0442><U+0435><U+0440>' # not useful!
>>> '\u042E\u043F\u0438\u0442\u0435\u0440' # changed the format manually
'Юпитер' # WOW that's it

I can't find a way to do it automatically. Hope it might be helpful.

Nima
  • 404
  • 3
  • 14