2

This a continuation of this question. I have this string;

s = 'A ligeira raposa marrom ataca o cão preguiçoso Быстрая коричневая лиса прыгает через ленивую собаку +='

I would like to keep the Russian letters and remove the rest. Hence, I would like to get the all the possible letters in the Portuguese alphabet so that I could apply it for any line.

My question is it possible to get all possible letters of a certain language from a website? or directly from the computer itself. Whatever is easier.

Thanks & Best Regards

Michael

jsbueno
  • 99,910
  • 10
  • 151
  • 209

3 Answers3

1

You can use str.translate to remove letters from a string and replace them with nothing - I am using some strings constants (see f.e. string.ascii_letters) here:

from string import ascii_letters, digits, punctuation

s = 'A ligeira raposa marrom ataca o cão preguiçoso Быстрая коричневая лиса прыгает через ленивую собаку +='

# first + second string are translations, last string will be removed from result

to_be_removed = ascii_letters + digits + punctuation + "+=áâãàçéêíóôõú"
t = str.maketrans("", "", to_be_removed)
k = s.translate(t)

print(k.strip())

Output

Быстрая коричневая лиса прыгает через ленивую собаку

You would need to add more non ascii_letters to string.ascii_letters to remove them as well. I took them manually from Portuguese orthography: Diacritics which is a manual onetime effort.

jsbueno
  • 99,910
  • 10
  • 151
  • 209
Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
  • hi, thanks for the reply. However, this works only if I want to remove the Latin scripts. Thanks & Best Regards Michael – Alain Michael Janith Schroter Apr 13 '20 at 11:05
  • 1
    @michael In this case "keeping the russian script" equates to "removing ascii + portogese special diacrits"? – Patrick Artner Apr 13 '20 at 11:07
  • @downvoter - leaving a comment on what made you downvote allows me to fix it. – Patrick Artner Apr 14 '20 at 17:37
  • 1
    Downvoter here: the approach is outright not optimal, since there are 100.000+ unicode characters - explicitly blacklisting what you want to remove would hardly be truly functional. There are metadata information about the characters themselves which can be used for filtering in cases like this. – jsbueno Jun 03 '20 at 12:51
  • @jsbueno your answer is superior - and the only upvote on it is by me. My answer is feasible for keeping russian and removing portugese *shrug* - thats what was needed. I learned from yours that you can query for meta infos as well. thanks for the feedback. – Patrick Artner Jun 03 '20 at 16:34
  • I usually don't downvote things that at least work - and I usually don't downvote without commenting - it is more likely I was in a bad day. The system won't allow me to remove the downvote if the question is not edited, though - I will do a "no op" edit so I can change it. – jsbueno Jun 03 '20 at 22:15
1

Python's tools for dealing with Unicode feature the unicodedata module - which have some tools to deal with this. Testing things on a "character by character" basis, and trying to check for all possible combinations of accented latin letters in an "if_esque" structure not only look and feels bad: it is a bad approach.

One of the most basic tools for dealing with unicode is getting the character names itself - all Latin letters do have "LATIN" in their name, and all cyrillic characters do have "CYRILLIC" in their name.

In [1]: import unicodedata                                                                                          

In [2]: unicodedata.name("ã")                                                                                       
Out[2]: 'LATIN SMALL LETTER A WITH TILDE'

In [3]: unicodedata.name("ы")                                                                                       
Out[3]: 'CYRILLIC SMALL LETTER YERU'

Your strategy will vary if you want to keep whitespace, digits, and so on - but basically, if you want to remove all non cyrillic characters:

In [7]: s = 'A ligeira raposa marrom ataca o cão preguiçoso Быстрая коричневая лиса прыгает через ленивую собаку +='
   ...:                                                                                                             

In [8]: print(''.join(char for char in s if 'CYRILLIC' in unicodedata.name(char)))                                  
Быстраякоричневаялисапрыгаетчерезленивуюсобаку

And conversely, if you want to keep everything and remove all latin characters:

In [9]: print(''.join(char for char in s if 'LATIN' not in unicodedata.name(char)))                                 
        Быстрая коричневая лиса прыгает через ленивую собаку +=

With that information alone, it is possible to achieve your objective - although there is more unicode metadata in characters than their name, like their "category". If you need to refine your filters, unicodedata.category(...) will return a two-character code for a character category. All letters (regardless of alphabet) will have "L" in the first position of that code, for example:

In [10]: unicodedata.category("a")                                                                                  
Out[10]: 'Ll'

In [11]: unicodedata.category("ã")                                                                                  
Out[11]: 'Ll'

In [12]: unicodedata.category("л")                                                                                  
Out[12]: 'Ll'

In [13]: unicodedata.category("A")                                                                                  
Out[13]: 'Lu'

In [14]: unicodedata.category("2")                                                                                  
Out[14]: 'Nd'

jsbueno
  • 99,910
  • 10
  • 151
  • 209
-1

This does not seem to be a Python related question and I would also say it's not programming related.

However - as always there is an answer on the StackExchange network, this time on the linguistics site: https://linguistics.stackexchange.com/questions/28766/character-sets-for-top-100-languages-as-opposed-to-unicode

Vaiden
  • 15,728
  • 7
  • 61
  • 91
  • 1
    if the OP is using Python for their code, it is obviously "Python related" - since they will have imeiate access to Python tools to deal with unicode, such as Python's stdlib `unicodedata` module - (which have the tools to deal with this "conundrum") – jsbueno Apr 14 '20 at 15:19