I want to split the given string into alphabet segments that the string contains. So for example, if the following string is given:
Los eventos automovilísticos comenzaron poco después de la construcción exitosa de los primeros automóviles a gasolina. El veloz zorro marrón saltó sobre el perezoso perro.
Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. The quick brown fox jumped over the lazy dog.
Мотори су почели убрзо након изградње првих успешних аутомобила на бензин.Брза смеђа лисица је прескочила лењог пса.
Автомобилните събития започнаха скоро след конструирането на първите успешни автомобили с бензиново гориво. Бързата кафява лисица прескочи мързеливото куче.
自動車イベントは、最初の成功したガソリン燃料自動車の製造直後に始まりました。 素早い茶色のキツネは怠け者の犬を飛び越えました。
بدأت أحداث السيارات بعد وقت قصير من بناء أول سيارة ناجحة تعمل بالبنزين. قفز الثعلب البني السريع فوق الكلب الكسول.
The above text contains spanish, english, serbian, bulgarian, japanese, arabic paragraphs (the order of the languages follows the paragraphs order).
Then, after applying some magic function, I would like to get the following output:
{
"langs": [
{
"alphabet": "latin",
"text": "Los eventos automovilísticos comenzaron poco después de la construcción exitosa de los primeros automóviles a gasolina. El veloz zorro marrón saltó sobre el perezoso perro. Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. The quick brown fox jumped over the lazy dog."
},
{
"alphabet": "cyrillic",
"text": "Мотори су почели убрзо након изградње првих успешних аутомобила на бензин.Брза смеђа лисица је прескочила лењог пса. Автомобилните събития започнаха скоро след конструирането на първите успешни автомобили с бензиново гориво. Бързата кафява лисица прескочи мързеливото куче."
},
{
"alphabet": "japanese",
"text": "自動車イベントは、最初の成功したガソリン燃料自動車の製造直後に始まりました。 素早い茶色のキツネは怠け者の犬を飛び越えました。"
},
{
"alphabet": "arabic",
"text": "بدأت أحداث السيارات بعد وقت قصير من بناء أول سيارة ناجحة تعمل بالبنزين. قفز الثعلب البني السريع فوق الكلب الكسول."
}
]
}
As you see, some of the languages are grouped by their family alphabets. For example, spanish and english paragraphs were grouped as latin, or serbian and bulgarian paragraphs were grouped as cyrillic. This is because it is hard to find a specific language (since most of the letters are shared between languages).
Ideally, my final output should be like this:
{
"langs": [
{
"lang": "spanish",
"text": "Los eventos automovilísticos comenzaron poco después de la construcción exitosa de los primeros automóviles a gasolina. El veloz zorro marrón saltó sobre el perezoso perro."
},
{
"lang": "english",
"text": "Motoring events began soon after the construction of the first successful gasoline-fueled automobiles. The quick brown fox jumped over the lazy dog."
},
{
"lang": "serbian",
"text": "Мотори су почели убрзо након изградње првих успешних аутомобила на бензин.Брза смеђа лисица је прескочила лењог пса."
},
{
"lang": "bulgarian",
"text":"Автомобилните събития започнаха скоро след конструирането на първите успешни автомобили с бензиново гориво. Бързата кафява лисица прескочи мързеливото куче."
},
{
"lang": "japanese",
"text": "自動車イベントは、最初の成功したガソリン燃料自動車の製造直後に始まりました。 素早い茶色のキツネは怠け者の犬を飛び越えました。"
},
{
"lang": "arabic",
"text": "بدأت أحداث السيارات بعد وقت قصير من بناء أول سيارة ناجحة تعمل بالبنزين. قفز الثعلب البني السريع فوق الكلب الكسول."
}
]
}
I need to split the text into sub-strings according to the language. For that I am planning to use cld2
which can split text into sentences, but according to my experiments, it does not do well when the string contains text with mixed alphabets (i.e. cyrillic + japanese etc.). However, cld2
does well on the text with mixed languages that share the family of alphabets (i.e. french + english etc.).
That's why, I am planning to split the text into sub-strings by the family of alphabets, then for each of the family, I will aplly cld2
to predict the specific language.
Another important requirements:
- the mixed languages might not be separated clearly by lines like above example (I did that for the sake of simplicity and to make the problem clear)
- I need to be able to do this 'offline' without connecting to 3rd party servers like google etc. (since there will be tons of data that need to be handled)
I would appreciate any ideas that you might have on the above problems. Thanks in advance.