0

I have an arabic text file that looks like this

اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب دار

I want to generate a list of sentences from this paragraph using python, if each sentence is separated by a dot.

I found this answer: Tokenizing non English Text in Python

It is splitting text into words but not into sentences.

I also tried this

from nltk.tokenize import sent_tokenize, word_tokenize
import regex
text = "اغاني و اغانياخلاق تربطنا ساخنه بن الخطاب حريم منتدى نضال و امراه اخرى قابيل و قوموا جميعا حاله الجو متى و انا نحن احبابك رامي مرض النقرس ماذا تاكل‪.‬ افضل من قلب راشد ليش اتعب" 
regex.findall(r'\p{L}+', text.replace('[\u200c]', ''))
print(sent_tokenize(data))

It returned the text separated by '\u202a'

زيز 240 و انا بدرب منال تاريخ\u202a.\u202c برقاء

NB: The sentence doesn't make any sense, it is just an example in arabic characters.

I need the output to be in the form of sentences:

[اغاني و اغانياخلاق تربطنا ساخنه , بن الخطاب حريم منتدى نضال و امراه , انا نحن,  احبابك رامي مرض , النقرس ماذا]

which means:

[sentence 1, sentence 2, sentence, 3]
KC.
  • 2,981
  • 2
  • 12
  • 22
J.Doe
  • 353
  • 1
  • 2
  • 12
  • For those who don't read Arabic, can you [edit] your question and add your desired output as well? – Jongware Nov 23 '18 at 18:14
  • @usr2564301Edited his question with translations *(using [google translate](https://translate.google.com/))* May not be 100% accurate but someone needed to do this at the least as untranslated, it limits the target audience. Hope this helps – Jab Nov 23 '18 at 18:32
  • @Jaba: but that does not show a list of sentences, does it? I don't need to know what it *means*, I wanted to see where that long input line needs breaking on. – Jongware Nov 23 '18 at 19:09
  • I just edited the original post. The paragraph doesn't make any sense. It is just an example. I have a text file with similar paragraphs and I need to tokenise it to sentences. – J.Doe Nov 23 '18 at 20:25
  • How does one know where a sentence ends in arabuc language? – user8408080 Nov 23 '18 at 22:31
  • I understand from your comment that sentences should be separated with a character. The problem is that I have a doc.text full of a text content without any characters separating between sentences. But let's assume that each sentence ends with a dot. Is there a way to tokenise the non english text? – J.Doe Nov 24 '18 at 00:21
  • That would basically be `result = text.split('.')`. – Jongware Nov 24 '18 at 00:56
  • Thank you for your answer. I am really sorry but I am quite new with python. Can you please show me how to add ** result = text.split('.') ** if I use the regex script mentioned in the question? – J.Doe Nov 24 '18 at 01:14

0 Answers0