What can I do using Python built-ins to successfully process a massive .txt file?

Question

I have a project where I need to read data from a relatively large .txt file that contains 5 columns and about 25 million rows of comma-separted-data, process the data, and then write the processed data to a new .txt file. My computer freezes when I try to process a file this large.

I've already written the function to process the data and it works on small input .txt files, so I just need to adjust it to work with the larger file.

Here's an abridged version of my code:

import csv
import sys

def process_data(input_file, output_file):

    prod_dict = {}
    with open(input_file, "r") as file:

        # some code that reads all data from input file into dictionary


    # some code that sorts dictionary into an array with desired row order

    # list comprehension code that puts array into desired output form

    with open(output_file, 'w') as myfile:
        wr = csv.writer(myfile)
        for i in final_array:
            wr.writerow(i)

def main():
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    process_data(input_file, output_file)

if __name__ == '__main__':
    main()

Do you need to read all of the file at once, or could you read and process in chunks? — sundance, Aug 03 '18 at 16:45
It's important to know why you need to read the entire file into memory to be able to provide an answer here. What operations are you performing on the read data? — Matthew Story, Aug 03 '18 at 16:45
@sundance I don't need to read all of the file at once -- I could read it in chunks but I'm not sure how to do that. — TJE, Aug 03 '18 at 16:47
If you need to sort the data in the file, you _will_ have to read it all in at once. I doubt that's the problem. Anyway, I suggest you profile your script to see where it's spending most of its execution time—and then try to optimize that part of the code. If you don't know how to do that, see question [How can you profile a script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-script) — martineau, Aug 03 '18 at 17:20

NosirrahSec · Answer 1 · 2018-08-03T16:59:41.080

0

You need to process it line by line, it sounds like.

(Not the entire file loaded into memory.)

for line in open('really_big_file.dat'): process_data(line)

as explained: https://stackoverflow.com/a/519653/9914705

edited Aug 03 '18 at 16:59

answered Aug 03 '18 at 16:49

NosirrahSec

61
8

How do I process it line by line? – TJE Aug 03 '18 at 16:51
for line in open('file.txt'): process_line(line) – Fred Aug 03 '18 at 16:55
https://stackoverflow.com/questions/519633/lazy-method-for-reading-big-file-in-python This seems to do what you wish. `for line in open('really_big_file.dat'): process_data(line)` – NosirrahSec Aug 03 '18 at 16:57

score 0 · Answer 2 · answered Aug 04 '18 at 00:02

The file is obviously too large to read the entire thing into memory at once. Sounds like you need to process the file in chunks.

There are many sorting algorithms, including some that do not require reading the entire file into memory at once. In particular, please look into the concept of "merge sort". There is a nice animation of the technique in the wikipedia article that demonstrates the concept. You can do a merge sort without ever having more than two of the items to be sorted in memory at once. It's basically just "divide and conquer".

The general procedure:

Choose a number of items that you can comfortably deal with in memory. (10000 maybe, or 100000 but it can be as small or as large as you like. I'll assume 10000.)
Iteratively pull items from the source file, stopping when you have read that many lines (but leave your file open and its current file pointer in place). You can use the file object's readline method (and there other ways to use the file's built-in generator function too, but readline works fine).
Sort those 10000 lines (and do whatever other transformations you may need to do) and write the resulting list to a temporary file. (You'll need to generate a unique name for each temporary file that allows you to find it later. Assume this first temp file is named "temp0")
Read another 10000 lines and sort those, storing the result into another temporary file ("temp1").
Lather, rinse, repeat, until you've separated your original input file into 2500 sorted temporary files: [temp0, temp1, temp2, ... temp2499]
Now you simply start merging file pairs, keeping them sorted as you go. First you merge (temp0 and temp1) into a new temp file (temp_0_1). Then merge (temp2 and temp3) into (temp_2_3). And so forth until you've merged (temp2498 and temp2499) into (temp_2498_2499). (You can remove the first set of temp files as you go.)
Now merge file pairs again, this time you're merging (temp_0_1 with temp_2_3) to form (temp_0_1_2_3), and (temp_4_5 with temp_6_7) to produce (temp_4_5_6_7). And so forth up to (temp_2496_2497_2498_2499).
Continue iteratively merging pairs of files. At each step, the number of files you have left is divided in two. (Though the file sizes are, on average, doubling). Eventually, there will only be a single file, that is sorted.
For every merge above, you never need to hold in memory more than one line from each of the two files you're merging. Since the files you started with were already sorted, the first line in each file is the one with the lowest sort key, so you can simply compare the lowest from file A with the lowest from file B. Whichever is lowest gets written to the output, then gets replaced with the next record from the respective file.

What can I do using Python built-ins to successfully process a massive .txt file?

2 Answers2