FileNotFoundError in Python during Arabic text analysis

Question

I have a folder has 150 Arabic text files. I want to find the similarities between each other. how can I do that? I tried what explained here

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

but I faced a problem with declaring documents. I modified it like:

from sklearn.feature_extraction.text import TfidfVectorizer

text_files= r"C:\Users\Nujou\Desktop\Master\thesis\corpora\modified Corpora\Training set\5K\ST"
for f in text_files:
    documents= open(f, 'r', encoding='utf-8-sig').read()
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

but it appears this error:

documents= open(f, 'r', encoding='utf-8-sig').read()
FileNotFoundError: [Errno 2] No such file or directory: 'C'

any solution?

Edit:

I tried this also:

from sklearn.feature_extraction.text import TfidfVectorizer

import os

text_files= os.listdir(r"C:\Users\Nujou\Desktop\Master\thesis\corpora\modified Corpora\Training set\5K\ST")

documents= []
for f in text_files:
    file= open(f, 'r', 'utf-8-sig')
    documents.append(file.read())
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

and it occurred this error:

file= open(f, 'r', 'utf-8-sig')
TypeError: an integer is required (got type str)

There is no problem with comparison of Arabic texts, but with the path to your file. Is the `ST` really a text file? It seems more like a folder. — Lenka Vraná, Dec 13 '18 at 16:32
Moreover, your for cycle now loops over string characters, which means that it takes only one letter from the string `text_files` in the `open()` command. You could try `documents = open(text_files, 'r', encoding='utf-8-sig').read()` if `text_files` is really apath to existing file. — Lenka Vraná, Dec 13 '18 at 16:42
ST is a folder contains text files that I want to compare them to each other. — Nujud Ali, Dec 13 '18 at 17:07

Lenka Vraná · Accepted Answer · 2018-12-16T20:42:11.580

3

You don't have a problem with the comparison of Arabic texts. You have trouble loading the documents into Python.

If ST is a folder, you need to get the list of all the files inside the folder:

import os
inputDir = r'your/path/here'
text_files = os.listdir(inputDir)

documents = []
for f in text_files:
    file = open(os.path.join(inputDir, f), 'r', encoding = 'utf-8-sig')
    documents.append(file.read())

The current version of your code also keeps only the last document from the loop, not all of them. However, that is another issue for another question.

edited Dec 16 '18 at 20:42

answered Dec 13 '18 at 17:15

Lenka Vraná

1,686
2
19
29

so how can I modify the code to work with multiple files inside the folder? it showed this error when I am trying to run file= open(f, 'r', 'utf-8-sig') TypeError: an integer is required (got type str). – Nujud Ali Dec 13 '18 at 17:29
Did you try the `os.listdir()` function as I have suggested? – Lenka Vraná Dec 13 '18 at 17:33
You have to keep the order of the parameters or you need to use their names. The third parameter in the function `open()` is called `buffering` and expects integers. You probably meant the `encoding` parameter? – Lenka Vraná Dec 14 '18 at 12:21
yes, encoding parameter and the order is correct I use it many times like this order. – Nujud Ali Dec 14 '18 at 14:59
You can check the parameters of `open()` function [here](https://docs.python.org/3/library/functions.html#open). The first parameter is called `file`, second `mode`, third `buffering` and fourth `encoding`. Therefore you need to either use the parameter `encoding` on the fourth position or explicitly write `encoding = 'utf-8-sig'`. Please check the updated example in my answer. – Lenka Vraná Dec 16 '18 at 11:16
thanks for your help. this error is appeared when I am trying to run you editing. text_files = os.listdir(inputDir) TypeError: listdir: path should be string, bytes, os.PathLike or None, not list – Nujud Ali Dec 19 '18 at 12:29
thanks for your help. this error is appeared when I am trying to run you editing. text_files = os.listdir(inputDir) TypeError: listdir: path should be string, bytes, os.PathLike or None, not list – Nujud Ali Dec 19 '18 at 12:29
I suppose you define the inputDir as `inputDir = r"C:\Users\Nujou\Desktop\Master\thesis\corpora\modified Corpora\Training set\5K\ST"`? – Lenka Vraná Dec 19 '18 at 14:08
I think it worked, and this is one of the results (0, 60) 0.299395929151. Do you know what is mean? thanks for your help. – Nujud Ali Dec 21 '18 at 11:09
tfidf = TfidfVectorizer().fit_transform(documents) pairwise_similarity = tfidf * tfidf.T print(pairwise_similarity) #this is the rest of the code – Nujud Ali Dec 21 '18 at 11:13

FileNotFoundError in Python during Arabic text analysis

1 Answers1