5

I have a unicode string like " " and would like to convert it to the ASCII form "thug life".

I know I can achieve this in Python by

import unidecode
print(unidecode.unidecode(' '))
// thug life

However, this would asciify also other unicode characters (such as Chinese/Japanese characters, emojis, accented characters, etc.), which I want to preserve.

Is there a way to detect these type of "artistic" unicode characters?

Some more examples:

thug life

Thanks for your help!

Gino Mempin
  • 25,369
  • 29
  • 96
  • 135
Martin
  • 170
  • 2
  • 11
  • 1
    cf. [NFC/NFD/NFKD/NFKC](https://en.wikipedia.org/wiki/Unicode_equivalence) [`''..normalize()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize) – dakab Nov 20 '20 at 16:23

1 Answers1

6
import unicodedata
strings = [
  ' ',
  ' ',
  ' ',
  ' ',
  'thug life']
for x in strings:
  print(unicodedata.normalize( 'NFKC', x), x)

Output: .\62803325.py

thug life  
thug life  
thug life  
thug life  
thug life thug life

Resources:

JosefZ
  • 28,460
  • 5
  • 44
  • 83
  • 1
    Does not work with these cases: " ", "︎︎︎︎", "ωεłł", "ʜᴀᴋsʜɪ", "", "", "RᗅIPႮ", "ғʀᴇᴇ", "ᕼᗩᑭᑭIᗴᗴ", "υεεɴ". – Gokul NC Sep 17 '21 at 15:03
  • 1
    @GokulNC You need [Romanization/transliteration of Unicode text](https://stackoverflow.com/questions/9842527/) IMHO… See also https://pypi.org/project/Unidecode/ – JosefZ Sep 17 '21 at 16:23
  • @GokulNC, did you figure out how to convert such fonts to unicode with python? if yes, please comment you approach here. – Naveen Reddy Marthala Oct 18 '21 at 12:14
  • As commented above, `Unidecode` was able to clean most of them, although it did not completely 100% solve the problem. – Gokul NC Oct 18 '21 at 15:28
  • yes, same was the case with me. i wanted to know, if you found a way to normalise fonts like " ", "︎︎︎︎", "ωεłł", "ʜᴀᴋsʜɪ", "", "", "RᗅIPႮ", "ғʀᴇᴇ", "ᕼᗩᑭᑭIᗴᗴ", "υεεɴ" that you had mentioned in your comment. – Naveen Reddy Marthala Oct 19 '21 at 05:17
  • I mean, the library mentioned in the answer (`unicodedata`) is different from what I said (`unidecode`). The latter solved the problem. What I meant by "_did not 100% solve_" is that some transliterations were wrong. For example, "ᕼᗩᑭᑭIᗴᗴ" was converted as "hpokikiIgaga". I have now [raised an issue](https://github.com/avian2/unidecode/issues/72) in the package's repo mentioning these cases. Feel free to add more examples there :) – Gokul NC Oct 30 '21 at 08:11
  • @GokulNC because `ᕼᗩᑭᑭIᗴᗴ` is `Canadian Syllabics Nunavut H`, `Canadian Syllabics Carrier Po`, `Canadian Syllabics Ki`, `Canadian Syllabics Ki`, `Latin Capital Letter I`, `Canadian Syllabics Carrier Ga`, `Canadian Syllabics Carrier Ga`. – JosefZ Oct 30 '21 at 18:07
  • Yeah you are right. So I am looking for ways to do an appearance-based conversion rather than approximate-phonetic conversion. – Gokul NC Oct 31 '21 at 08:10