I have a folder has 150 Arabic text files. I want to find the similarities between each other. how can I do that? I tried what explained here
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [open(f) for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T
but I faced a problem with declaring documents. I modified it like:
from sklearn.feature_extraction.text import TfidfVectorizer
text_files= r"C:\Users\Nujou\Desktop\Master\thesis\corpora\modified Corpora\Training set\5K\ST"
for f in text_files:
documents= open(f, 'r', encoding='utf-8-sig').read()
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T
but it appears this error:
documents= open(f, 'r', encoding='utf-8-sig').read()
FileNotFoundError: [Errno 2] No such file or directory: 'C'
any solution?
Edit:
I tried this also:
from sklearn.feature_extraction.text import TfidfVectorizer
import os
text_files= os.listdir(r"C:\Users\Nujou\Desktop\Master\thesis\corpora\modified Corpora\Training set\5K\ST")
documents= []
for f in text_files:
file= open(f, 'r', 'utf-8-sig')
documents.append(file.read())
tfidf = TfidfVectorizer().fit_transform(documents)
# no need to normalize, since Vectorizer will return normalized tf-idf
pairwise_similarity = tfidf * tfidf.T
and it occurred this error:
file= open(f, 'r', 'utf-8-sig')
TypeError: an integer is required (got type str)