0

I have an app that will generate very long file outputs (about 10^7 lines usually). I need to take the file output and split it into 8 equal pieces and add a header and footer in the process.

Having some background in python (and the whole process is the backend part of a big django app) i thought of something like

with open('file', 'r') as file:
    for line in file.readlines():
        #throw lines to every file equally

But i dont think that will be the best way. What is the recomenden aproach here? Should i use some unix tools using subprocess? Or is there any fast pythonic way to achieve it?

Jacek
  • 171
  • 2
  • 12
  • Maybe some useful info [here](https://stackoverflow.com/questions/16669428/process-very-large-20gb-text-file-line-by-line) – DaveStSomeWhere Aug 30 '19 at 16:39
  • Since it is very large, I would probably do it in two passes: In pass 1 I would simply count the number of lines (perhaps `num_lines = sum(1 for line in open('file'))`) and compute how many lines need to go in each of the 8 files. Pass 2 reprocess the input file again and writes out the appropriate number of lines to each file. – Booboo Aug 30 '19 at 18:21
  • "Equal" by line count or byte size? What if the lines are of different lengths in different regions of the file? – tripleee Aug 31 '19 at 08:35
  • They are not - its a constant lenght integer, 10 digit ID. – Jacek Sep 02 '19 at 09:03

1 Answers1

0

Since the input file can be extremely large, my concern is memory efficiency. So I am using a two-pass algorithm. In the first pass I count the number of lines being memory efficient and compute how many lines should be apportioned to each of the output files. In the second pass, I reread the input file one line at a time writing the output to each out[put file in turn for the required number of lines:

def split(infile, outfiles, n_files=8):
    """Splits a file into n_files pieces

    :param infile name of the input files
    :type str

    :param outfiles list of output file names of length n_files
    :type list

    :param n_files number of equal pieces the input file should be split into
    :type int
    """

    # get the total number of lines:
    with open(infile, 'r') as f:
        num_lines = sum(1 for line in f)
    lines_per_file = num_lines // n_files;
    if lines_per_file == 0 and num_lines > 0:
        lines_per_file = 1

    # compute number of lines to go into each file:
    count = []
    for i in range(n_files - 1):
        lines_per_file = min(lines_per_file, num_lines)
        count.append(lines_per_file)
        num_lines -= lines_per_file
    # and for the last file anything that is left over:
    count.append(num_lines)

    with open(infile, 'r') as f1:
        for i in range(n_files):
            with open(outfiles[i], 'w') as f2:
                for j in range(count[i]):
                    print(f1.readline(), end='', file=f2)
Booboo
  • 38,656
  • 3
  • 37
  • 60