Writing Large CSVs - Memory Usage v. Random Disk Access

Question

I am working with a set of ~1000 large (700MB+) CSVs containing GPS data. The timestamps are currently in the UTC timezone, and I would like to change them to PST.

I wrote a Python script to parse the file, update the two timestamp fields with the correct value, and then write them to file. Originally I wanted to minimize the number of disk writes, so on each line I appended the updated line to a string. At then end, I did one big write to the file. This works as expected with small files, but hangs with large files.

I then changed the script to write to the file as each line is processed. This works, and does not hang.

How come the first solution does not work with large files, and is there a better way to do it than writing the file one line at a time?

Building a large string:

def correct(d, s):
    # given a directory and a filename, corrects for timezone
    file = open(os.path.dirname(os.path.realpath(sys.argv[0])) + separator() + d + separator() + s)
    contents = file.read().splitlines()

    header = contents[0]

    corrected_contents = header + '\n'

    for line in contents[1:]:
        values = line.split(',')

        sample_date = correct_time(values[1])
        system_date = correct_time(values[-1])

        values[1] = sample_date
        values[-1] = system_date

        corrected_line = ','.join(map(str, values)) + '\n'
        corrected_contents += corrected_line

    corrected_file = os.path.dirname(os.path.realpath(sys.argv[0])) + separator() + d + separator() + "corrected_" + s
    with open (corrected_file, 'w') as text_file:
        text_file.write(corrected_contents)
    return corrected_file

Writing each line:

def correct(d, s):
    # given a directory and a filename, corrects for timezone
    file = open(os.path.dirname(os.path.realpath(sys.argv[0])) + separator() + d + separator() + s)
    contents = file.read().splitlines()

    header = contents[0]

    corrected_file = os.path.dirname(os.path.realpath(sys.argv[0])) + separator() + d + separator() + "corrected_" + s
    with open (corrected_file, 'w') as text_file:
        text_file.write(header + '\n')

        for line in contents[1:]:
            values = line.split(',')

            sample_date = correct_time(values[1])
            system_date = correct_time(values[-1])

            values[1] = sample_date
            values[-1] = system_date

            corrected_line = ','.join(map(str, values)) + '\n'
            text_file.write(corrected_line)

    return corrected_file

I would say even the second approach is pretty bad. Check this our for better file read https://jeffknupp.com/blog/2016/03/07/python-with-context-managers/ — gout, Jun 08 '18 at 21:25
I agree: there is no reason to read the whole file into memory and split it into lines, if you are going to process it one line at a time in any case. — NickD, Jun 08 '18 at 22:01

NickD · Accepted Answer · 2018-06-08T21:53:01.920

I believe that this line:

   corrected_contents += corrected_line

is the culprit. IIUC (and I'm sure people will correct me if I'm wrong) this allocates a bigger string, copies the old contents over and then appends the new stuff - for every line in the file. As it gets longer, more and more has to be copied over and you end up with the behavior you are observing.

There is more information about string concatenation at How do I append one string to another in Python? where it is mentioned that apparently CPython optimizes this in certain cases and turns it from quadratic to linear (so I may be wrong above: yours may be such an optimized case). It also mentions that pypy does not. So it also depends on how you run your program. It might also be the case that the optimization does not apply because your string is too big (it's enough to fill a CD after all).

The linked answer also has a wealth of information on methods to get around the problem (if it is indeed the problem). Well worth reading.

A classic [Shlemiel the painter’s algorithm](https://www.joelonsoftware.com/2001/12/11/back-to-basics/). It's also possible with a 32-bit build of Python to be running out of memory space. — Mark Ransom, Jun 08 '18 at 21:50

Writing Large CSVs - Memory Usage v. Random Disk Access

1 Answers1