I was working on a script which reading a folder of files(each of size ranging from 20 MB to 100 MB), modifies some data in each line, and writes back to a copy of the file.
with open(inputPath, 'r+') as myRead:
my_list = myRead.readlines()
new_my_list = clean_data(my_list)
with open(outPath, 'w+') as myWrite:
tempT = time.time()
myWrite.writelines('\n'.join(new_my_list) + '\n')
print(time.time() - tempT)
print(inputPath, 'Cleaning Complete.')
On running this code with a 90 MB file (~900,000 lines), it printed 140 seconds as the time taken to write to the file. Here I used writelines()
. So I searched for different ways to improve file writing speed, and in most of the articles that I read, it said write()
and writelines()
should not show any difference since I am writing a single concatenated string. I also checked the time taken for only the following statement:
new_string = '\n'.join(new_my_list) + '\n'
And it took only 0.4 seconds, so the large time taken was not because of creating the list.
Just to try out write()
I tried this code:
with open(inputPath, 'r+') as myRead:
my_list = myRead.readlines()
new_my_list = clean_data(my_list)
with open(outPath, 'w+') as myWrite:
tempT = time.time()
myWrite.write('\n'.join(new_my_list) + '\n')
print(time.time() - tempT)
print(inputPath, 'Cleaning Complete.')
And it printed 2.5 seconds. Why is there such a large difference in the file writing time for write()
and writelines()
even though it is the same data? Is this normal behaviour or is there something wrong in my code? The output file seems to be the same for both cases, so I know that there is no loss in data.