I am working with a set of ~1000 large (700MB+) CSVs containing GPS data. The timestamps are currently in the UTC timezone, and I would like to change them to PST.
I wrote a Python script to parse the file, update the two timestamp fields with the correct value, and then write them to file. Originally I wanted to minimize the number of disk writes, so on each line I appended the updated line to a string. At then end, I did one big write to the file. This works as expected with small files, but hangs with large files.
I then changed the script to write to the file as each line is processed. This works, and does not hang.
How come the first solution does not work with large files, and is there a better way to do it than writing the file one line at a time?
Building a large string:
def correct(d, s):
# given a directory and a filename, corrects for timezone
file = open(os.path.dirname(os.path.realpath(sys.argv[0])) + separator() + d + separator() + s)
contents = file.read().splitlines()
header = contents[0]
corrected_contents = header + '\n'
for line in contents[1:]:
values = line.split(',')
sample_date = correct_time(values[1])
system_date = correct_time(values[-1])
values[1] = sample_date
values[-1] = system_date
corrected_line = ','.join(map(str, values)) + '\n'
corrected_contents += corrected_line
corrected_file = os.path.dirname(os.path.realpath(sys.argv[0])) + separator() + d + separator() + "corrected_" + s
with open (corrected_file, 'w') as text_file:
text_file.write(corrected_contents)
return corrected_file
Writing each line:
def correct(d, s):
# given a directory and a filename, corrects for timezone
file = open(os.path.dirname(os.path.realpath(sys.argv[0])) + separator() + d + separator() + s)
contents = file.read().splitlines()
header = contents[0]
corrected_file = os.path.dirname(os.path.realpath(sys.argv[0])) + separator() + d + separator() + "corrected_" + s
with open (corrected_file, 'w') as text_file:
text_file.write(header + '\n')
for line in contents[1:]:
values = line.split(',')
sample_date = correct_time(values[1])
system_date = correct_time(values[-1])
values[1] = sample_date
values[-1] = system_date
corrected_line = ','.join(map(str, values)) + '\n'
text_file.write(corrected_line)
return corrected_file