I have this function.
I have a folder called "Corpus" with many files inside.
I opened the files, a file file, and modified these files. The modification was to delete the period, comma, question mark, and so on.
But the problem is that I do not want to save the modification in an array, but I want to save the modification in each of the files in the corpus file.
For example, test if the first file within the corpus folder contains a period and a comma, and if it contains a period and a comma, I delete it from the first file and then move to the second file.
This means that I want to modify the same files that are in the Corpus folder and return all files at the end.
how can i do that?
# read files from corpus folder + tokenize
def read_files_from_corpus():
dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'
all_tokens_without_sw = []
for document in os.listdir(dir_path):
with open(dir_path + document, "r") as reader:
dir_path = open(dir_path + document, 'w')
text = reader.read()
# --------
text = text.replace('.', ' ').replace(',', ' ')
text = text.replace(':', ' ').replace('?', ' ').replace('!', ' ')
text = text.replace(' ', ' ') # convert double space into single space
text = text.replace('"', ' ').replace('``', ' ')
text = text.strip() # remove space at the end
# ------
text_tokens = word_tokenize(text)
dir_path.writelines(["%s " % item for item in text_tokens])
all_tokens_without_sw = all_tokens_without_sw + text_tokens
return all_tokens_without_sw