nltk Arabic Text Output Disconnected

Question

I've been trying to write a code for sentences splitting. And it worked very well in English and other left-to-right Latin-lettered languages. When I tried to do the same with Arabic, the text came up totally disconnected, like each letter individually. I'm not sure what the problem is.

My input text:

عندما يريد العالم أن يتكلّم، فهو يتحدّث بلغة يونيكود. سجّل الآن لحضور المؤتمر الدولي العاشر ليونيكود، الذي سيعقد في آذار بمدينة مَايِنْتْس، ألمانيا. و سيجمع المؤتمر بين خبراء من كافة قطاعات الصناعة على الشبكة العالمية انترنيت ويونيكود، حيث ستتم، على الصعيدين الدولي والمحلي على حد سواء مناقشة سبل استخدام يونكود في النظم القائمة وفيما يخص التطبيقات الحاسوبية، الخطوط، تصميم النصوص والحوسبة متعددة اللغات.

My code:

# -*- coding: utf-8 -*-

import nltk
from nltk import sent_tokenize

import codecs
import csv

sentences = codecs.open('SampleArabic.txt', 'r', 'utf-8-sig').read()

def split_sentences(sentences):
    with codecs.open('Output_AR.txt', 'w', encoding='utf-8') as writer:
        newcount = 0
        for sent in sent_tokenize(sentences):
            print(sent.encode('utf-8'))
            wr = csv.writer(writer,delimiter='\n')
            wr.writerow(str(sent))
            newcount = sentences.count(sentences)+newcount
        print(newcount)
    pass

split_sentences(sentences)

My first issue is that the console prints the text in code:

b'\xd8\xb9\xd9\x86\xd8\xaf\xd9\x85\xd8\xa7 \xd9\x8a\xd8\xb1\xd9\x8a\xd8\xaf \xd8\xa7\xd9\x84\xd8\xb9\xd8\xa7\xd9\x84\xd9\x85 \xd8\xa3\xd9\x86 \xd9\x8a\xd8\xaa\xd9\x83\xd9\x84\xd9\x91\xd9\x85 \xe2\x80\xac \xd8\x8c \xd9\x81\xd9\x87\xd9\x88 \xd9\x8a\xd8\xaa\xd8\xad\xd8\xaf\xd9\x91\xd8\xab \xd8\xa8\xd9\x84\xd8\xba\xd8\xa9 \xd9\x8a\xd9\x88\xd9\x86\xd9\x8a\xd9\x83\xd9\x88\xd8\xaf.'
b'\xd8\xb3\xd8\xac\xd9\x91\xd9\x84 \xd8\xa7\xd9\x84\xd8\xa2\xd9\x86 \xd9\x84\xd8\xad\xd8\xb6\xd9\x88\xd8\xb1 \xd8\xa7\xd9\x84\xd9\x85\xd8\xa4\xd8\xaa\xd9\x85\xd8\xb1 \xd8\xa7\xd9\x84\xd8\xaf\xd9\x88\xd9\x84\xd9\x8a \xd8\xa7\xd9\x84\xd8\xb9\xd8\xa7\xd8\xb4\xd8\xb1 \xd9\x84\xd9\x8a\xd9\x88\xd9\x86\xd9\x8a\xd9\x83\xd9\x88\xd8\xaf\xd8\x8c \xd8\xa7\xd9\x84\xd8\xb0\xd9\x8a \xd8\xb3\xd9\x8a\xd8\xb9\xd9\x82\xd8\xaf \xd9\x81\xd9\x8a \xd8\xa2\xd8\xb0\xd8\xa7\xd8\xb1 \xd8\xa8\xd9\x85\xd8\xaf\xd9\x8a\xd9\x86\xd8\xa9 \xd9\x85\xd9\x8e\xd8\xa7\xd9\x8a\xd9\x90\xd9\x86\xd9\x92\xd8\xaa\xd9\x92\xd8\xb3\xd8\x8c \xd8\xa3\xd9\x84\xd9\x85\xd8\xa7\xd9\x86\xd9\x8a\xd8\xa7.'
b'\xd9\x88 \xd8\xb3\xd9\x8a\xd8\xac\xd9\x85\xd8\xb9 \xd8\xa7\xd9\x84\xd9\x85\xd8\xa4\xd8\xaa\xd9\x85\xd8\xb1 \xd8\xa8\xd9\x8a\xd9\x86 \xd8\xae\xd8\xa8\xd8\xb1\xd8\xa7\xd8\xa1 \xd9\x85\xd9\x86 \xd9\x83\xd8\xa7\xd9\x81\xd8\xa9 \xd9\x82\xd8\xb7\xd8\xa7\xd8\xb9\xd8\xa7\xd8\xaa \xd8\xa7\xd9\x84\xd8\xb5\xd9\x86\xd8\xa7\xd8\xb9\xd8\xa9 \xd8\xb9\xd9\x84\xd9\x89 \xd8\xa7\xd9\x84\xd8\xb4\xd8\xa8\xd9\x83\xd8\xa9 \xd8\xa7\xd9\x84\xd8\xb9\xd8\xa7\xd9\x84\xd9\x85\xd9\x8a\xd8\xa9 \xd8\xa7\xd9\x86\xd8\xaa\xd8\xb1\xd9\x86\xd9\x8a\xd8\xaa \xd9\x88\xd9\x8a\xd9\x88\xd9\x86\xd9\x8a\xd9\x83\xd9\x88\xd8\xaf\xd8\x8c \xd8\xad\xd9\x8a\xd8\xab \xd8\xb3\xd8\xaa\xd8\xaa\xd9\x85\xd8\x8c \xd8\xb9\xd9\x84\xd9\x89 \xd8\xa7\xd9\x84\xd8\xb5\xd8\xb9\xd9\x8a\xd8\xaf\xd9\x8a\xd9\x86 \xd8\xa7\xd9\x84\xd8\xaf\xd9\x88\xd9\x84\xd9\x8a \xd9\x88\xd8\xa7\xd9\x84\xd9\x85\xd8\xad\xd9\x84\xd9\x8a \xd8\xb9\xd9\x84\xd9\x89 \xd8\xad\xd8\xaf \xd8\xb3\xd9\x88\xd8\xa7\xd8\xa1 \xd9\x85\xd9\x86\xd8\xa7\xd9\x82\xd8\xb4\xd8\xa9 \xd8\xb3\xd8\xa8\xd9\x84 \xd8\xa7\xd8\xb3\xd8\xaa\xd8\xae\xd8\xaf\xd8\xa7\xd9\x85 \xd9\x8a\xd9\x88\xd9\x86\xd9\x83\xd9\x88\xd8\xaf \xd9\x81\xd9\x8a \xd8\xa7\xd9\x84\xd9\x86\xd8\xb8\xd9\x85 \xd8\xa7\xd9\x84\xd9\x82\xd8\xa7\xd8\xa6\xd9\x85\xd8\xa9 \xd9\x88\xd9\x81\xd9\x8a\xd9\x85\xd8\xa7 \xd9\x8a\xd8\xae\xd8\xb5 \xd8\xa7\xd9\x84\xd8\xaa\xd8\xb7\xd8\xa8\xd9\x8a\xd9\x82\xd8\xa7\xd8\xaa \xd8\xa7\xd9\x84\xd8\xad\xd8\xa7\xd8\xb3\xd9\x88\xd8\xa8\xd9\x8a\xd8\xa9\xd8\x8c \xd8\xa7\xd9\x84\xd8\xae\xd8\xb7\xd9\x88\xd8\xb7\xd8\x8c \xd8\xaa\xd8\xb5\xd9\x85\xd9\x8a\xd9\x85 \xd8\xa7\xd9\x84\xd9\x86\xd8\xb5\xd9\x88\xd8\xb5 \xd9\x88\xd8\xa7\xd9\x84\xd8\xad\xd9\x88\xd8\xb3\xd8\xa8\xd8\xa9 \xd9\x85\xd8\xaa\xd8\xb9\xd8\xaf\xd8\xaf\xd8\xa9 \xd8\xa7\xd9\x84\xd9\x84\xd8\xba\xd8\xa7\xd8\xaa.'
3

But I think it is the minor problem.

The main issue, as I mentioned before, is that the output text file has the text totally disconnected.

In Notepad it looks like this: https://i.stack.imgur.com/Fhmqh.png

And in NotePad++ it looks like this: https://i.stack.imgur.com/gcA6z.png

I'm using Python 3.4. And this is only my 2nd attempt with Python. So, I might need some extra details.

"My first issue is .." – the `b` indicates you are printing byte strings, not regular strings. Also, does your console support Arabic? 'The main issue [..] is that the output text file has the text totally disconnected" – that is not a Python issue. The output printing engine *must support Arabic*. Then you get properly connected text. — Jongware, Oct 11 '18 at 08:19

score 0 · Accepted Answer · answered Oct 11 '18 at 08:14

I don't think nltk supports Arabic first of all, so sent_tokenize won't work properly. If you look at the source code you can see it defaults to English if no language is specified.

Your code example does not have the correct indentation.

Next function names should start with lowercase, only classes should have uppercase names. See PEP 8 -- Style Guide for Python Code

Your print(sent.encode('utf-8')) is what causes the console output. What you see is the bytes version of whatever string sent_tokenize considers to be a sentence. See the documentation for str.encode(). If you want it to look "normal" do just print(sent).

Lastly I don't see a reason to write to csv, if you want to output the text to a file you can simply do

with open('Output_AR.txt', 'w', encoding='utf-8') as f:
    for sent in sent_tokenize(sentences):
        f.write(sent)

Or just write all the lines to the file at once like so:

with open('Output_AR.txt', 'w', encoding='utf-8') as f:
    f.writelines(sent_tokenize(sentences))

I don't really understand what you are trying to do with NewCount (and it should be renamed to be lowercase) but you can just

with open('Output_AR.txt', 'w', encoding='utf-8') as f:
    for i, sent in enumerate(sent_tokenize(sentences)):
        f.write(f"{i} {sent}")

if you want to include the sentence number (which it looks like you do?).

Most likely what you want to do won't work (properly) though because nltk doesn't support the language. Check this out to see if it helps you: Python Arabic NLP

nltk Arabic Text Output Disconnected

1 Answers1