I have multiple text files in a folder. Total number of text files are 21941. My code is working good for the small number of text files but when I run for 5000 text files it gets stuck in reading. When I run my code for full data it takes 3 hours for reading the data but still not able to finish reading all data. Please help me how can I improve my code or how can I use GPU or multiprocessing for this task.
this block of code read a file and return it a list of words.
def wordList(doc):
"""
1: Remove Punctuation
2: Remove Stop Words
3: return
"""
file = open("C:\\Users\\Zed\\PycharmProjects\\ACL txt\\"+doc, 'r', encoding="utf8", errors='ignore')
text = file.read().strip()
file.close()
nopunc=[char for char in text if char not in string.punctuation]
nopunc=''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
This block of code read file names from folder
file_names=[]
for file in Path("ACL txt").rglob("*.txt"):
file_names.append(file.name)
And this block of code make a dictionary with all documents . file name as a key and its content as a list.
documents = {}
for i in file_names[:5000]:
documents[i]=wordList(i)
this is link of data set
My system spec is I7 quad core with 16gb ram