0

I have this function.

I have a folder called "Corpus" with many files inside.

I opened the files, a file file, and modified these files. The modification was to delete the period, comma, question mark, and so on.

But the problem is that I do not want to save the modification in an array, but I want to save the modification in each of the files in the corpus file.

For example, test if the first file within the corpus folder contains a period and a comma, and if it contains a period and a comma, I delete it from the first file and then move to the second file.

This means that I want to modify the same files that are in the Corpus folder and return all files at the end.

how can i do that?

    # read files from corpus folder + tokenize
def read_files_from_corpus():
    dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'

    all_tokens_without_sw = []

    for document in os.listdir(dir_path):
        with open(dir_path + document, "r") as reader:
            dir_path = open(dir_path + document, 'w')

            text = reader.read()

            # --------
            text = text.replace('.', ' ').replace(',', ' ')
            text = text.replace(':', ' ').replace('?', ' ').replace('!', ' ')
            text = text.replace('  ', ' ')  # convert double space into single space
            text = text.replace('"', ' ').replace('``', ' ')
            text = text.strip()  # remove space at the end

            # ------
            text_tokens = word_tokenize(text)
            dir_path.writelines(["%s " % item for item in text_tokens])

        all_tokens_without_sw = all_tokens_without_sw + text_tokens

    return all_tokens_without_sw
roula
  • 27
  • 6
  • no, the codes that found in stackoverflow links doesn't help me. – roula Jun 24 '21 at 16:18
  • Please clarify how it didn't help, considering that the one answer posted so far is very similar to the top solution in that post I linked. – Random Davis Jun 24 '21 at 16:28

1 Answers1

2

You need to open the file for reading and writing and after reading whole file content seek again to start of file to overwrite the data after making the needed changes.

def read_files_from_corpus():
    dir_path = 'C:/Users/Super/Desktop/IR/homework/Lab4/corpus/corpus/'

    all_tokens_without_sw = []

    for document in os.listdir(dir_path):
        # open for reading and writing 
        with open(dir_path + document, "r+") as reader:
            
            text = reader.read()
            # --------
            text = text.replace('.', ' ').replace(',', ' ')
            text = text.replace(':', ' ').replace('?', ' ').replace('!', ' ')
            text = text.replace('  ', ' ')  # convert double space into single space
            text = text.replace('"', ' ').replace('``', ' ')
            text = text.strip()  # remove space at the end

                # seek to start of file to overwrite data
            reader.seek(0)
            text_tokens = word_tokenize(text)

              # write data back to the file
            reader.writelines(["%s " % item for item in text_tokens])

        all_tokens_without_sw = all_tokens_without_sw + text_tokens

    return all_tokens_without_sw

this code only open file reader and edit it. Hope that is what you want.

KMG
  • 1,433
  • 1
  • 8
  • 19
  • This is pretty similar to the top solution of the potential duplicate I linked, so I'm not sure why @roula said that "the codes that found in stackoverflow links doesn't help me". – Random Davis Jun 24 '21 at 16:27
  • Thanks, my problem solved. But how can I return the files through the word “return”, that is, I do not want to store the files in an array then return the array, I want to return file after modification through “return” – roula Jun 24 '21 at 16:35
  • @roula you need to save the file object(```reader``` objects) to an array at each loop iteration then return this array along side anything else you want. – KMG Jun 24 '21 at 16:45