Tokenizing non English Text in Python to sentences

Question

I have an arabic text file that looks like this

اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب دار

I want to generate a list of sentences from this paragraph using python, if each sentence is separated by a dot.

I found this answer: Tokenizing non English Text in Python

It is splitting text into words but not into sentences.

I also tried this

from nltk.tokenize import sent_tokenize, word_tokenize
import regex
text = "اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب" 
regex.findall(r'\p{L}+', text.replace('[\u200c]', ''))
print(sent_tokenize(data))

It returned the text separated by '\u202a'

زيز 240 و انا بدرب منال تاريخ\u202a.\u202c برقاء

NB: The sentence doesn't make any sense, it is just an example in arabic characters.

I need the output to be in the form of sentences:

[اغاني و اغانياخلاق تربطنا ساخنه , بن الخطاب حريم منتدى نضال و امراه , انا نحن,  احبابك رامي مرض , النقرس ماذا]

which means:

[sentence 1, sentence 2, sentence, 3]

For those who don't read Arabic, can you [edit] your question and add your desired output as well? — Jongware, Nov 23 '18 at 18:14
@usr2564301Edited his question with translations *(using [google translate](https://translate.google.com/))* May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps — Jab, Nov 23 '18 at 18:32
@Jaba: but that does not show a list of sentences, does it? I don't need to know what it *means*, I wanted to see where that long input line needs breaking on. — Jongware, Nov 23 '18 at 19:09
I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences. — J.Doe, Nov 23 '18 at 20:25
I understand from your comment that sentences should be separated with a character. The problem is that I have a doc.text full of a text content without any characters separating between sentences. But let's assume that each sentence ends with a dot. Is there a way to tokenise the non english text? — J.Doe, Nov 24 '18 at 00:21
Thank you for your answer. I am really sorry but I am quite new with python. Can you please show me how to add ** result = text.split('.') ** if I use the regex script mentioned in the question? — J.Doe, Nov 24 '18 at 01:14

Tokenizing non English Text in Python to sentences

0 Answers0