3

I want to split the given string into alphabet segments that the string contains. So for example, if the following string is given:

Los eventos automovilísticos comenzaron poco después de la construcción exitosa de los primeros automóviles a gasolina. El veloz zorro marrón saltó sobre el perezoso perro.

Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. The quick brown fox jumped over the lazy dog.

Мотори су почели убрзо након изградње првих успешних аутомобила на бензин.Брза смеђа лисица је прескочила лењог пса.

Автомобилните събития започнаха скоро след конструирането на първите успешни автомобили с бензиново гориво. Бързата кафява лисица прескочи мързеливото куче.

自動車イベントは、最初の成功したガソリン燃料自動車の製造直後に始まりました。 素早い茶色のキツネは怠け者の犬を飛び越えました。

بدأت أحداث السيارات بعد وقت قصير من بناء أول سيارة ناجحة تعمل بالبنزين. قفز الثعلب البني السريع فوق الكلب الكسول.

The above text contains spanish, english, serbian, bulgarian, japanese, arabic paragraphs (the order of the languages follows the paragraphs order).

Then, after applying some magic function, I would like to get the following output:

{
    "langs": [
        {
            "alphabet": "latin",
            "text": "Los eventos automovilísticos comenzaron poco después de la construcción exitosa de los primeros automóviles a gasolina. El veloz zorro marrón saltó sobre el perezoso perro. Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. The quick brown fox jumped over the lazy dog."
        },
        {
            "alphabet": "cyrillic",
            "text": "Мотори су почели убрзо након изградње првих успешних аутомобила на бензин.Брза смеђа лисица је прескочила лењог пса. Автомобилните събития започнаха скоро след конструирането на първите успешни автомобили с бензиново гориво. Бързата кафява лисица прескочи мързеливото куче."
        },
        {
            "alphabet": "japanese",
            "text": "自動車イベントは、最初の成功したガソリン燃料自動車の製造直後に始まりました。 素早い茶色のキツネは怠け者の犬を飛び越えました。"
        },
        {
            "alphabet": "arabic",
            "text": "بدأت أحداث السيارات بعد وقت قصير من بناء أول سيارة ناجحة تعمل بالبنزين. قفز الثعلب البني السريع فوق الكلب الكسول."
        }
    ]
}

As you see, some of the languages are grouped by their family alphabets. For example, spanish and english paragraphs were grouped as latin, or serbian and bulgarian paragraphs were grouped as cyrillic. This is because it is hard to find a specific language (since most of the letters are shared between languages).

Ideally, my final output should be like this:

{
    "langs": [
        {
            "lang": "spanish",
            "text": "Los eventos automovilísticos comenzaron poco después de la construcción exitosa de los primeros automóviles a gasolina. El veloz zorro marrón saltó sobre el perezoso perro."
        },
        {
            "lang": "english",
            "text": "Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. The quick brown fox jumped over the lazy dog."
        },
        {
            "lang": "serbian",
            "text": "Мотори су почели убрзо након изградње првих успешних аутомобила на бензин.Брза смеђа лисица је прескочила лењог пса."
        },
        {
            "lang": "bulgarian",
            "text":"Автомобилните събития започнаха скоро след конструирането на първите успешни автомобили с бензиново гориво. Бързата кафява лисица прескочи мързеливото куче."
        },
        {
            "lang": "japanese",
            "text": "自動車イベントは、最初の成功したガソリン燃料自動車の製造直後に始まりました。 素早い茶色のキツネは怠け者の犬を飛び越えました。"
        },
        {
            "lang": "arabic",
            "text": "بدأت أحداث السيارات بعد وقت قصير من بناء أول سيارة ناجحة تعمل بالبنزين. قفز الثعلب البني السريع فوق الكلب الكسول."
        }
    ]
}

I need to split the text into sub-strings according to the language. For that I am planning to use cld2 which can split text into sentences, but according to my experiments, it does not do well when the string contains text with mixed alphabets (i.e. cyrillic + japanese etc.). However, cld2 does well on the text with mixed languages that share the family of alphabets (i.e. french + english etc.).

That's why, I am planning to split the text into sub-strings by the family of alphabets, then for each of the family, I will aplly cld2 to predict the specific language.

Another important requirements:

  • the mixed languages might not be separated clearly by lines like above example (I did that for the sake of simplicity and to make the problem clear)
  • I need to be able to do this 'offline' without connecting to 3rd party servers like google etc. (since there will be tons of data that need to be handled)

I would appreciate any ideas that you might have on the above problems. Thanks in advance.

Sirojiddin Komolov
  • 771
  • 10
  • 17
  • 1
    I think that an important part of the question has to do with character encoding. How is the input provided? I'm not very knowledgeable about this but I think that the UTF* specification can answer the first part of the question. – Erwan Dec 17 '22 at 17:40
  • The input is just a text. I mean, we can encode it using any methods (utf-8, utf-16 etc.) based on the required format by the next steps. I understand that we can do it by some kind of manual work, i.e. checking each character into one of the alphabet range, then make some assumptions. But this way is not very efficient (because of the loops, maybe even nested) and error-prone (because of the possibilities of 'if' cases). I wonder if there are some machine learning ways of doing this that is faster and less error prone. – Sirojiddin Komolov Dec 19 '22 at 20:46
  • I don't know what is the source of the text but not every text can be represented with any encoding. In fact the fact that the text is properly represented and usable for ML is a consequence of its correct encoding, so the character set (latin, arabic, etc.) can probably be determined directly. In theory using the deterministic way is supposed to be faster and safer, but coding it might be more headache ;) – Erwan Dec 19 '22 at 21:50

1 Answers1

3

The following solution makes use of Google Translate. Ensure that you use pip install googletrans==4.0.0-rc1 to install the 4.0.0 release candidate to avoid potential issues. Other language detection packages at the time of writing, such as langdetect and spacy_langdetect, failed to distinguish Serbian from Macedonian.

Note that all language detection modules in my experience conform to ISO 639-1 language codes so the output will make use of these codes. If you need the actual language name (e.g. "Spanish" instead of "es"), you'll have to code a simple loop that makes the conversion using the produced languageDict. I believe this, along with creating a json style output, is besides the main point of your question so I have opted to omit it.

As a side note, should you need to group the various languages based on their alphabet, this is something that can also be done with a simple loop using the the produced languageDict. Group the ISO 639-1 language codes under their alphabets and then programmatically categorise the text(s) accordingly.

Solution

from googletrans import Translator
from collections import defaultdict

text = """
Los eventos automovilísticos comenzaron poco después de la construcción exitosa de los primeros automóviles a gasolina. El veloz zorro marrón saltó sobre el perezoso perro.

Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. The quick brown fox jumped over the lazy dog.

Мотори су почели убрзо након изградње првих успешних аутомобила на бензин.Брза смеђа лисица је прескочила лењог пса.

An additional English sentence to see how it handles this.

Автомобилните събития започнаха скоро след конструирането на първите успешни автомобили с бензиново гориво. Бързата кафява лисица прескочи мързеливото куче.

自動車イベントは、最初の成功したガソリン燃料自動車の製造直後に始まりました。 素早い茶色のキツネは怠け者の犬を飛び越えました。

بدأت أحداث السيارات بعد وقت قصير من بناء أول سيارة ناجحة تعمل بالبنزين. قفز الثعلب البني السريع فوق الكلب الكسول.
"""

translator = Translator()  # Instantiate google translator
languageDict = defaultdict(list)  # Create default dictionary to elegantly store results
for line in text.splitlines():  # Iterate over text split by lines
    if line != '':  # Ignore blank lines
        detectedLang = translator.detect(line).lang  # Detect language
        languageDict[detectedLang].append(line)  # Store line under corresponding language key
print(dict(languageDict))

Output

{
'es': ['Los eventos automovilísticos comenzaron poco después de la construcción exitosa de los primeros automóviles a gasolina. El veloz zorro marrón saltó sobre el perezoso perro.'], 
'en': ['Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. The quick brown fox jumped over the lazy dog.', 'An additional English sentence to see how it handles this.'], 
'sr': ['Мотори су почели убрзо након изградње првих успешних аутомобила на бензин.Брза смеђа лисица је прескочила лењог пса.'], 
'bg': ['Автомобилните събития започнаха скоро след конструирането на първите успешни автомобили с бензиново гориво. Бързата кафява лисица прескочи мързеливото куче.'], 
'ja': ['自動車イベントは、最初の成功したガソリン燃料自動車の製造直後に始まりました。 素早い茶色のキツネは怠け者の犬を飛び越えました。'], 
'ar': ['بدأت أحداث السيارات بعد وقت قصير من بناء أول سيارة ناجحة تعمل بالبنزين. قفز الثعلب البني السريع فوق الكلب الكسول.']
}
Kyle F Hartzenberg
  • 2,567
  • 3
  • 6
  • 24
  • Thank you a lot for the solution. I think I forgot to specify some other important requirements. So I need to be able to do this 'offline' without connecting to 3rd party servers (since there will be tons of data that need to be handled). Another important requirement is that the mixed languages might not be separated clearly by lines (I did that for the sake of simplicity and to make the problem clear). I've also edited the question to have these requirements into consideration. – Sirojiddin Komolov Dec 17 '22 at 11:56
  • 1
    @SirojiddinKomolov In that case, some decent libraries that are able to detect languages with mixed input in an offline environment include [polyglot](https://polyglot.readthedocs.io/en/latest/Detection.html), and [pycld2](https://pypi.org/project/pycld2/). [This question](https://stackoverflow.com/questions/39142778/how-to-determine-the-language-of-a-piece-of-text) and its answers provides a comprehensive list of options. – Kyle F Hartzenberg Dec 18 '22 at 09:23
  • 1
    thank you for the useful links that you provided. I think I can achieve some good results with cld2 library with some tricks. Let me try it – Sirojiddin Komolov Dec 19 '22 at 20:49