2

I have a folder has 150 Arabic text files. I want to find the similarities between each other. how can I do that? I tried what explained here

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

but I faced a problem with declaring documents. I modified it like:

from sklearn.feature_extraction.text import TfidfVectorizer

text_files= r"C:\Users\Nujou\Desktop\Master\thesis\corpora\modified Corpora\Training set\5K\ST"
for f in text_files:
    documents= open(f, 'r', encoding='utf-8-sig').read()
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

but it appears this error:

documents= open(f, 'r', encoding='utf-8-sig').read()
FileNotFoundError: [Errno 2] No such file or directory: 'C'

any solution?

Edit:

I tried this also:

from sklearn.feature_extraction.text import TfidfVectorizer

import os

text_files= os.listdir(r"C:\Users\Nujou\Desktop\Master\thesis\corpora\modified Corpora\Training set\5K\ST")

documents= []
for f in text_files:
    file= open(f, 'r', 'utf-8-sig')
    documents.append(file.read())
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T

and it occurred this error:

file= open(f, 'r', 'utf-8-sig')
TypeError: an integer is required (got type str)
Dominique
  • 16,450
  • 15
  • 56
  • 112
Nujud Ali
  • 135
  • 2
  • 9
  • There is no problem with comparison of Arabic texts, but with the path to your file. Is the `ST` really a text file? It seems more like a folder. – Lenka Vraná Dec 13 '18 at 16:32
  • 1
    Moreover, your for cycle now loops over string characters, which means that it takes only one letter from the string `text_files` in the `open()` command. You could try `documents = open(text_files, 'r', encoding='utf-8-sig').read()` if `text_files` is really apath to existing file. – Lenka Vraná Dec 13 '18 at 16:42
  • ST is a folder contains text files that I want to compare them to each other. – Nujud Ali Dec 13 '18 at 17:07

1 Answers1

3

You don't have a problem with the comparison of Arabic texts. You have trouble loading the documents into Python.

If ST is a folder, you need to get the list of all the files inside the folder:

import os
inputDir = r'your/path/here'
text_files = os.listdir(inputDir)

documents = []
for f in text_files:
    file = open(os.path.join(inputDir, f), 'r', encoding = 'utf-8-sig')
    documents.append(file.read())

The current version of your code also keeps only the last document from the loop, not all of them. However, that is another issue for another question.

Lenka Vraná
  • 1,686
  • 2
  • 19
  • 29
  • so how can I modify the code to work with multiple files inside the folder? it showed this error when I am trying to run file= open(f, 'r', 'utf-8-sig') TypeError: an integer is required (got type str). – Nujud Ali Dec 13 '18 at 17:29
  • Did you try the `os.listdir()` function as I have suggested? – Lenka Vraná Dec 13 '18 at 17:33
  • You have to keep the order of the parameters or you need to use their names. The third parameter in the function `open()` is called `buffering` and expects integers. You probably meant the `encoding` parameter? – Lenka Vraná Dec 14 '18 at 12:21
  • yes, encoding parameter and the order is correct I use it many times like this order. – Nujud Ali Dec 14 '18 at 14:59
  • You can check the parameters of `open()` function [here](https://docs.python.org/3/library/functions.html#open). The first parameter is called `file`, second `mode`, third `buffering` and fourth `encoding`. Therefore you need to either use the parameter `encoding` on the fourth position or explicitly write `encoding = 'utf-8-sig'`. Please check the updated example in my answer. – Lenka Vraná Dec 16 '18 at 11:16
  • thanks for your help. this error is appeared when I am trying to run you editing. text_files = os.listdir(inputDir) TypeError: listdir: path should be string, bytes, os.PathLike or None, not list – Nujud Ali Dec 19 '18 at 12:29
  • thanks for your help. this error is appeared when I am trying to run you editing. text_files = os.listdir(inputDir) TypeError: listdir: path should be string, bytes, os.PathLike or None, not list – Nujud Ali Dec 19 '18 at 12:29
  • I suppose you define the inputDir as `inputDir = r"C:\Users\Nujou\Desktop\Master\thesis\corpora\modified Corpora\Training set\5K\ST"`? – Lenka Vraná Dec 19 '18 at 14:08
  • I think it worked, and this is one of the results (0, 60) 0.299395929151. Do you know what is mean? thanks for your help. – Nujud Ali Dec 21 '18 at 11:09
  • tfidf = TfidfVectorizer().fit_transform(documents) pairwise_similarity = tfidf * tfidf.T print(pairwise_similarity) #this is the rest of the code – Nujud Ali Dec 21 '18 at 11:13