how to remove stopwords in Arabic?

Question

I have a training file and testing file, I want to detect emotion from tweets using machine learning algorithms, in this code, I will employ the preprocessing step in the training dataset in Arabic, and appear this error when removing stop_words! do you need to install an Arabic stopwords file or can I import it from NLTK?

#csv file for train
df=pd.read_csv("C:/Users/User/Desktop/2018-EI-oc-Ar-fear-train.csv")

#csv file for test
df_test=pd.read_csv("C:/Users/User/Desktop/2018-EI-oc-Ar-fear-test-gold.csv")

def stopWordRmove(text):
    ar_stop_list = open("ar_stop_word_list.txt", "r") # this error appear in this line
    stop_words = ar_stop_list.read().split('\n')   
    needed_words = []
    words = word_tokenize(text)
    for w in words:
        if w not in (stop_words):
            needed_words.append(w)
    filtered_sentence = " ".join(needed_words)
    return filtered_sentence

def noramlize(Tweet):
    Tweet = re.sub(r"[إأٱآا]", "ا", Tweet)
    Tweet = re.sub(r"ى", "ي", Tweet)
    Tweet = re.sub(r"ؤ", "ء", Tweet)
    Tweet = re.sub(r"ئ", "ء", Tweet)
    Tweet = re.sub(r'[^ا-ي ]', "", Tweet)

    noise = re.compile(""" ّ    | # Tashdid
                         َ    | # Fatha
                         ً    | # Tanwin Fath
                         ُ    | # Damma
                         ٌ    | # Tanwin Damm
                         ِ    | # Kasra
                         ٍ    | # Tanwin Kasr
                         ْ    | # Sukun
                         ـ     # Tatwil/Kashida
                     """, re.VERBOSE)
    Tweet = re.sub(noise, '', Tweet)
    return Tweet

def stopWordRmove(Tweet):
    ar_stop_list = open("ar_stop_word_list.txt", "r")
    stop_words = ar_stop_list.read().split('\n')
    needed_words = []
    words = word_tokenize(Tweet)
    for w in words:
        if w not in (stop_words):
            needed_words.append(w)
    filtered_sentence = " ".join(needed_words)
    return filtered_sentence

def stemming(Tweet):
    st = ISRIStemmer()
    stemmed_words = []
    words = word_tokenize(Tweet)
    for w in words:
        stemmed_words.append(st.stem(w))
    stemmed_sentence = " ".join(stemmed_words)
    return stemmed_sentence

def prepareDataSets(df):
    sentences = []
    for index, r in df.iterrows():
        text = stopWordRmove(r['Tweet'])
        text = noramlize(r['Tweet'])
        text = stemming(r['Tweet'])
    
    df_sentences = DataFrame(sentences, columns=['Tweet', 'Affect Dimension'])
    return df_sentences

preprocessed_df = prepareDataSets(df)

FileNotFoundError: [Errno 2] No such file or directory: 'ar_stop_word_list.txt'

how can I Remove stopwords from Arabic Tweet?

What is your file structure? Remove all the redundant code about your whole project and just create a new script in the same directory that only uses `open`. — Countour-Integral, Jan 17 '21 at 00:43
@Countour-Integral, my file contains two-column: the first one is (Tweet) written in the Arabic language, and the second one is to Detect Emotion column which is fear, I want to predict the class label, knowing that all class labels have (fear) value! — bashar, Jan 17 '21 at 00:51
I meant your directory-file structure (**FILE PATH**), like what does `import os; print(os.getcwd())` output? Where and how are you running this script from? Like what is the parent folder of your file? Your problem is at `open("ar_stop_word_list.txt", "r")` you only need to tell us info about the things I mentioned above. Everything else is not really going to help. — Countour-Integral, Jan 17 '21 at 00:58
@Countour-Integral, I have a file ("ar_stop_word_list.txt") on the desktop? where can I put it (path)? — bashar, Jan 17 '21 at 01:11
Something like [this](https://stackoverflow.com/a/24961998/13616163), just with your own files. The error means that the file was not found, this may be caused for multiple reasons. Also share how and from where you are running your file (do you run `python file.py` or something else) — Countour-Integral, Jan 17 '21 at 01:14
You should note that in the question, I do not know how notebooks handle files. — Countour-Integral, Jan 17 '21 at 01:20
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/227427/discussion-between-bashar-and-countour-integral). — bashar, Jan 17 '21 at 01:26
Does this answer your question? [How to remove stop words using nltk or python](https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python) — sophros, Jan 17 '21 at 07:08
Try providing the absolute(full) path in the jupyter for the stop words input file — Naveen, Jan 17 '21 at 07:15

score 0 · Answer 1 · answered Sep 01 '21 at 19:51

0

ar_stop_list = open("arabic_stopwords.txt", encoding="utf-8")
stop_words = ar_stop_list.read().split('\n')

Make sure the text file path is correct.

answered Sep 01 '21 at 19:51

Sayed Hamdi

21
4

how to remove stopwords in Arabic?

1 Answers1