How to convert fancy/artistic unicode text to ASCII?

Question

I have a unicode string like " " and would like to convert it to the ASCII form "thug life".

I know I can achieve this in Python by

import unidecode
print(unidecode.unidecode(' '))
// thug life

However, this would asciify also other unicode characters (such as Chinese/Japanese characters, emojis, accented characters, etc.), which I want to preserve.

Is there a way to detect these type of "artistic" unicode characters?

Some more examples:

ｔｈｕｇｌｉｆｅ

Thanks for your help!

cf. [NFC/NFD/NFKD/NFKC](https://en.wikipedia.org/wiki/Unicode_equivalence) [`''..normalize()`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize) — dakab, Nov 20 '20 at 16:23

score 6 · Accepted Answer · answered Jul 09 '20 at 16:46

6

import unicodedata
strings = [
  ' ',
  ' ',
  ' ',
  ' ',
  'ｔｈｕｇ ｌｉｆｅ']
for x in strings:
  print(unicodedata.normalize( 'NFKC', x), x)

Output: .\62803325.py

thug life  
thug life  
thug life  
thug life  
thug life ｔｈｕｇ ｌｉｆｅ

Resources:

unicodedata — Unicode Database
Normalization forms for Unicode text

answered Jul 09 '20 at 16:46

JosefZ

28,460
5
44
83

1

Does not work with these cases: " ", "︎︎︎︎", "ωεłł", "ʜᴀᴋsʜɪ", "", "", "RᗅIPႮ", "ғʀᴇᴇ", "ᕼᗩᑭᑭIᗴᗴ", "υεεɴ". – Gokul NC Sep 17 '21 at 15:03
1

@GokulNC You need [Romanization/transliteration of Unicode text](https://stackoverflow.com/questions/9842527/) IMHO… See also https://pypi.org/project/Unidecode/ – JosefZ Sep 17 '21 at 16:23
@GokulNC, did you figure out how to convert such fonts to unicode with python? if yes, please comment you approach here. – Naveen Reddy Marthala Oct 18 '21 at 12:14
As commented above, `Unidecode` was able to clean most of them, although it did not completely 100% solve the problem. – Gokul NC Oct 18 '21 at 15:28
yes, same was the case with me. i wanted to know, if you found a way to normalise fonts like " ", "︎︎︎︎", "ωεłł", "ʜᴀᴋsʜɪ", "", "", "RᗅIPႮ", "ғʀᴇᴇ", "ᕼᗩᑭᑭIᗴᗴ", "υεεɴ" that you had mentioned in your comment. – Naveen Reddy Marthala Oct 19 '21 at 05:17
I mean, the library mentioned in the answer (`unicodedata`) is different from what I said (`unidecode`). The latter solved the problem. What I meant by "_did not 100% solve_" is that some transliterations were wrong. For example, "ᕼᗩᑭᑭIᗴᗴ" was converted as "hpokikiIgaga". I have now [raised an issue](https://github.com/avian2/unidecode/issues/72) in the package's repo mentioning these cases. Feel free to add more examples there :) – Gokul NC Oct 30 '21 at 08:11
@GokulNC because `ᕼᗩᑭᑭIᗴᗴ` is `Canadian Syllabics Nunavut H`, `Canadian Syllabics Carrier Po`, `Canadian Syllabics Ki`, `Canadian Syllabics Ki`, `Latin Capital Letter I`, `Canadian Syllabics Carrier Ga`, `Canadian Syllabics Carrier Ga`. – JosefZ Oct 30 '21 at 18:07
Yeah you are right. So I am looking for ways to do an appearance-based conversion rather than approximate-phonetic conversion. – Gokul NC Oct 31 '21 at 08:10

How to convert fancy/artistic unicode text to ASCII?

1 Answers1

Linked

Related