4

I need to find a way to rewrite words(translit) from some languages into English language. For example привет (in Russian) sounds like privet (in English).

Meaning and grammar don't matter, but I'd like it to have a more similar sounding. Everything should be in Python, I have diligently looked up on the internet and haven't found a good approach.

For example, something similar to this:

translit("юу со беутифул", "ru") = juu so beutiful

translit("кар", "ru") = kar
Alex
  • 3,923
  • 3
  • 25
  • 43
  • There is no easy solution to this problem, but you could create a mapping file for letters so that п maps to P. You will have to manually create the files though. – Max Paymar Feb 23 '17 at 17:58
  • So you are looking for a cyrillic to roman/latin converter and not a language converter? Have you searched "cyrillic to roman converter library" or "python cyrillic to roman converter"? – scrappedcola Feb 23 '17 at 18:00
  • i saw a couple approaches for couple languages. But i don't need in super solution, something simple will be good. I think, there is a big package for these things. – Alex Feb 23 '17 at 18:00
  • I understand from your example that translit doesn't do what you want, have you tried creating a custom language pack? – fvu Feb 23 '17 at 18:01
  • not only cyrillic, but approach for other languages too. There is constant stream of words and i should convert them in english. – Alex Feb 23 '17 at 18:01
  • it is good converting above for me, but only russian isn't sufficient. – Alex Feb 23 '17 at 18:03
  • Without using a specially built dictionary, that seems to be an impossible endeavor. In many languages, the same letters sound differently depending on the words they are used in. Heck, in some languages there are even words that have the exact same spelling yet different pronunciation (eg: French “le *président*” => the President / “ils *président* la réunion” => they chair the meeting ; first sounds like in English, second sounds like "presid"). – spectras Feb 23 '17 at 18:09
  • thanks all i haven't thought about that. It seemed that i don't need in very accurate solution. If some words lost correct pronunciation don't matter. You know, when you lost you keyboard and write on another language characters. – Alex Feb 23 '17 at 18:15
  • @spectras: Are you sure about "ils *président*"? My French is really rusty, but I think both are pronounced something like "presiDOH" (with a nasalized final vowel, obviously). – Schmuddi Feb 23 '17 at 18:21
  • @Schmuddi> absolutely certain. First example has the final nasal. In the second example however, it's the plural form of verb “présider”. Such plural forms are silent, unless the next word start with a vowel (in which case the liaison can be pronounced as just “t”). — full disclosure: I am from Paris ;) – spectras Feb 23 '17 at 18:25
  • @spectras: Fair enough, I stand corrected. It's things like these why French and I never got along so well. This, and stuff like the subjonctif. :) – Schmuddi Feb 23 '17 at 18:30
  • "ils président" ->ils president ( it doesn't matter as you spell over syllables – Alex Feb 23 '17 at 18:36

4 Answers4

6

Maybe you should give unidecode a try:

>>> import unidecode
>>> unidecode.unidecode("юу со беутифул")
'iuu so beutiful'
>>> unidecode.unidecode("die größten Probleme")
'die grossten Probleme'
>>> unidecode.unidecode("Avec Éloïse, ils président à l'assemblée")
"Avec Eloise, ils president a l'assemblee"

Install it with pip:

pip3 install unidecode
lenz
  • 5,658
  • 5
  • 24
  • 44
  • 1
    Definitely better than transliterate, it cannot do reverse but decode all characters. wp – Kruupös Feb 24 '17 at 09:15
  • Definitely a great library made with nice hand-crafted tables! – alvas Feb 24 '17 at 09:59
  • 1
    Being hand-crafted is great, but it's also the reason why it's one-way – you only get ASCII output, nothing else. A modified version of @alvas' approach works well if you want to remove diacritics ("accents"), but stay within a script, eg. "ὰ"→"α". – lenz Feb 24 '17 at 12:25
  • just want to add that for german, this is not quite good, 'größten' should be 'groessten' – dre_84w934 Aug 08 '18 at 07:47
  • @dre_84w934 Indeed, `unidecode` doesn't have language-specific rules (that would require it to do language idenfification first). – lenz Aug 08 '18 at 08:53
2

Maybe you are already using it; but you can use transliterate package.

Basic install with pip:

pip install transliterate

Then the code

# coding: utf-8

from transliterate import translit

print translit(u"юу со беутифул", 'ru', reversed=True) # juu so beutiful

WITH CUSTOM CLASS

As @Schmuddi propose, you can create your own custom class to handle german special characters, (works only with python 3.X though).

pip3 install transliterate

Then the code:

# coding: utf-8

from transliterate import translit
from transliterate.base import TranslitLanguagePack, registry

class GermanLanguagePack(TranslitLanguagePack):
    language_code = "de"
    language_name = "Deutsch"

    pre_processor_mapping = {
        u"ß": u"ss",
    }

    mapping = (
        u"ÄÖÜäöü",
        u"AOUaou",
    )

registry.register(GermanLanguagePack)

print(translit(u"Die größten Katzenrassen der Welt", "de")) 
#Die grossten Katzenrassen der Welt

Bonus, the French one:

class FrenchLanguagePack(TranslitLanguagePack):
    language_code = "fr"
    language_name = "French"

    pre_processor_mapping = {
        u"œ": u"oe",
        u"Œ": u"oe",
        u"æ": u"ae",
        u"Æ": "AE"
    }


    mapping = (
        u"àâçéèêëïîôùûüÿÀÂÇÉÈÊËÏÎÔÙÛÜŸ",
        u"aaceeeeiiouuuyAACEEEEIIOUUUY"
    )


registry.register(FrenchLanguagePack)

print(translit(u"Avec Éloïse, ils président à l'assemblée", 'fr'))
#Avec Eloise, ils president a l'assemblee

OTHER POSSIBLE SOLUTION

Since transliterate doesn't cover the german langage (yet?), you can use another package to directly translate sentences: py-translate but it uses google translate so you do need an internet connexion.

Basic install with pip:

pip install py-translate

Then your code:

# coding: utf-8

from translate import translator

print translator('ru', 'en', u"юу со беутифул")
print translator('de', 'en', u"Die größten Katzenrassen der Welt")
Kruupös
  • 5,097
  • 3
  • 27
  • 43
  • thanks but i've already used it. It is not sufficient. There are words on german. – Alex Feb 23 '17 at 18:09
  • Armenian Bulgarian (beta) Georgian Greek Macedonian (alpha) Mongolian (alpha) Russian Ukrainian (beta) only supported – Alex Feb 23 '17 at 18:09
  • There is also the package using google traduction, but you need an internet connexion and it's not really relevant. Sorry That I misunderstood your quesiton. – Kruupös Feb 23 '17 at 18:11
  • 1
    @Alex: Have you noticed that you can register your own transliteration mapping for ``transliterate``? Writing a mapping for German to something resembling English seems to be a fairly easy task. I don't think that you'll find any closer match for your problem. – Schmuddi Feb 23 '17 at 18:15
  • thanks for your efforts, but i don't need an online tool(: – Alex Feb 24 '17 at 11:27
1

Here's an alternative solution to @lenz. But I do like @lenz's suggestion of unidecode better =)

From Python - Replace non-ascii character in string (») and Can somone explain how unicodedata.normalize(form, unistr) work with examples?

To resolve umlauts and accent and graves:

>>> re.sub(r'[^\x00-\x7f]',r'', normalize('NFD', u"Avec Éloïse, ils président à l'assemblée"))
u"Avec Eloise, ils president a l'assemblee"

But it doesn't solve sharp-S character and Cyrillic though:

>>> re.sub(r'[^\x00-\x7f]',r'', normalize('NFD', u"die größten Probleme"))
u'die groten Probleme'

>>> re.sub(r'[^\x00-\x7f]',r'', normalize('NFD', u"юу со беутифул"))
u'  '
Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    "ß" (LATIN SMALL LETTER SHARP S) is not the same thing as "β" (GREEK SMALL LETTER BETA)! – lenz Feb 24 '17 at 12:18
  • 1
    The NFD approach is pretty sweet, but it's a pity there are no decomposed versions of characters like "ø", "œ" etc. So for a good coverage, there's no way around hand-crafted tables. – lenz Feb 24 '17 at 12:32
  • Thanks! hahaha, sharp S it is. Edited the answer. – alvas Feb 25 '17 at 01:13
1

This is another possible solution using regex, you can configure this function to replace the special characters for the characters you want:

import re

def remove_accents(string):
    if type(string) is not unicode:
        string = unicode(string, encoding='utf-8')

    string = re.sub(u"[àáâãäå]", 'a', string)
    string = re.sub(u"[èéêë]", 'e', string)
    string = re.sub(u"[ìíîï]", 'i', string)
    string = re.sub(u"[òóôõö]", 'o', string)
    string = re.sub(u"[ùúûü]", 'u', string)
    string = re.sub(u"[ýÿ]", 'y', string)

    return string
AlvaroAV
  • 10,335
  • 12
  • 60
  • 91