Purifying a text string in python

Question

This a continuation of this question. I have this string;

s = 'A ligeira raposa marrom ataca o cão preguiçoso Быстрая коричневая лиса прыгает через ленивую собаку +='

I would like to keep the Russian letters and remove the rest. Hence, I would like to get the all the possible letters in the Portuguese alphabet so that I could apply it for any line.

My question is it possible to get all possible letters of a certain language from a website? or directly from the computer itself. Whatever is easier.

Thanks & Best Regards

Michael

Maybe `s.encode()` (encoding to UTF-8) can help you with some ideas. At least for this example, the representation in bytes looks very different for each language word. — boechat107, Apr 13 '20 at 10:02

score 1 · Answer 1 · edited Jun 03 '20 at 22:15

1

You can use str.translate to remove letters from a string and replace them with nothing - I am using some strings constants (see f.e. string.ascii_letters) here:

from string import ascii_letters, digits, punctuation

s = 'A ligeira raposa marrom ataca o cão preguiçoso Быстрая коричневая лиса прыгает через ленивую собаку +='

# first + second string are translations, last string will be removed from result

to_be_removed = ascii_letters + digits + punctuation + "+=áâãàçéêíóôõú"
t = str.maketrans("", "", to_be_removed)
k = s.translate(t)

print(k.strip())

Output

Быстрая коричневая лиса прыгает через ленивую собаку

You would need to add more non ascii_letters to string.ascii_letters to remove them as well. I took them manually from Portuguese orthography: Diacritics which is a manual onetime effort.

edited Jun 03 '20 at 22:15

jsbueno

99,910
10
151
209

answered Apr 13 '20 at 10:08

Patrick Artner

50,409
9
43
69

hi, thanks for the reply. However, this works only if I want to remove the Latin scripts. Thanks & Best Regards Michael – Alain Michael Janith Schroter Apr 13 '20 at 11:05
1

@michael In this case "keeping the russian script" equates to "removing ascii + portogese special diacrits"? – Patrick Artner Apr 13 '20 at 11:07
@downvoter - leaving a comment on what made you downvote allows me to fix it. – Patrick Artner Apr 14 '20 at 17:37
1

Downvoter here: the approach is outright not optimal, since there are 100.000+ unicode characters - explicitly blacklisting what you want to remove would hardly be truly functional. There are metadata information about the characters themselves which can be used for filtering in cases like this. – jsbueno Jun 03 '20 at 12:51
@jsbueno your answer is superior - and the only upvote on it is by me. My answer is feasible for keeping russian and removing portugese *shrug* - thats what was needed. I learned from yours that you can query for meta infos as well. thanks for the feedback. – Patrick Artner Jun 03 '20 at 16:34
I usually don't downvote things that at least work - and I usually don't downvote without commenting - it is more likely I was in a bad day. The system won't allow me to remove the downvote if the question is not edited, though - I will do a "no op" edit so I can change it. – jsbueno Jun 03 '20 at 22:15

score 1 · Accepted Answer · answered Apr 14 '20 at 15:36

Python's tools for dealing with Unicode feature the unicodedata module - which have some tools to deal with this. Testing things on a "character by character" basis, and trying to check for all possible combinations of accented latin letters in an "if_esque" structure not only look and feels bad: it is a bad approach.

One of the most basic tools for dealing with unicode is getting the character names itself - all Latin letters do have "LATIN" in their name, and all cyrillic characters do have "CYRILLIC" in their name.

In [1]: import unicodedata                                                                                          

In [2]: unicodedata.name("ã")                                                                                       
Out[2]: 'LATIN SMALL LETTER A WITH TILDE'

In [3]: unicodedata.name("ы")                                                                                       
Out[3]: 'CYRILLIC SMALL LETTER YERU'

Your strategy will vary if you want to keep whitespace, digits, and so on - but basically, if you want to remove all non cyrillic characters:

In [7]: s = 'A ligeira raposa marrom ataca o cão preguiçoso Быстрая коричневая лиса прыгает через ленивую собаку +='
   ...:                                                                                                             

In [8]: print(''.join(char for char in s if 'CYRILLIC' in unicodedata.name(char)))                                  
Быстраякоричневаялисапрыгаетчерезленивуюсобаку

And conversely, if you want to keep everything and remove all latin characters:

In [9]: print(''.join(char for char in s if 'LATIN' not in unicodedata.name(char)))                                 
        Быстрая коричневая лиса прыгает через ленивую собаку +=

With that information alone, it is possible to achieve your objective - although there is more unicode metadata in characters than their name, like their "category". If you need to refine your filters, unicodedata.category(...) will return a two-character code for a character category. All letters (regardless of alphabet) will have "L" in the first position of that code, for example:

In [10]: unicodedata.category("a")                                                                                  
Out[10]: 'Ll'

In [11]: unicodedata.category("ã")                                                                                  
Out[11]: 'Ll'

In [12]: unicodedata.category("л")                                                                                  
Out[12]: 'Ll'

In [13]: unicodedata.category("A")                                                                                  
Out[13]: 'Lu'

In [14]: unicodedata.category("2")                                                                                  
Out[14]: 'Nd'

score -1 · Answer 3 · answered Apr 13 '20 at 09:57

-1

This does not seem to be a Python related question and I would also say it's not programming related.

However - as always there is an answer on the StackExchange network, this time on the linguistics site: https://linguistics.stackexchange.com/questions/28766/character-sets-for-top-100-languages-as-opposed-to-unicode

answered Apr 13 '20 at 09:57

Vaiden

15,728
7
61
91

1

if the OP is using Python for their code, it is obviously "Python related" - since they will have imeiate access to Python tools to deal with unicode, such as Python's stdlib `unicodedata` module - (which have the tools to deal with this "conundrum") – jsbueno Apr 14 '20 at 15:19

Purifying a text string in python

3 Answers3