Split a file based on number of occurrences of 1 in position 1 of a line

Question

I routinely use PowerShell to split larger text or csv files in to smaller files for quicker processing. However, I have a few files that come over that are an usual format. These are basically print files to a text file. Each record starts with a single line that starts with a 1 and there is nothing else on the line.

What I need to be able to do is to split a file based on the number of statements. So, basically if I want to split the file in to chunks of 3000 statements, I would go down until I see the 3001 occurrence of 1 in position 1 and copy everything before that to the new file. I can run this from windows, linux or OS X so pretty much anything is open for the split.

Any ideas would be greatly appreciated.

Questions like this can be closed easily. There are plenty of good options available to you for this. However you need to show some coding / research effort. SO is not a code writing service. If you have something you have tried update your answer and we would be more that happy to help. Answering questions like this is also discouraged. — Matt, Oct 23 '14 at 03:47

score 0 · Answer 1 · answered Oct 23 '14 at 03:26

Maybe try recognizing it by the fact that there is a '1' plus a new line?

with open(input_file, 'r') as f:
    my_string = f.read()

my_list = my_string.split('\n1\n')

Separates each record to a list assuming it has the following format:

1
....
....
1
....
....
....

You can then output each element in the list to a separate file.

for x in range(len(my_list)):
    print >> str(x)+'.txt', my_list[x]

score 0 · Answer 2 · edited May 23 '17 at 12:20

To avoid loading the file in memory, you could define a function that generates records incrementally and then use itertool's grouper recipe to write each 3000 records to a new file:

#!/usr/bin/env python3
from itertools import zip_longest

with open('input.txt') as input_file:
    files = zip_longest(*[generate_records(input_file)]*3000, filevalue=())
    for n, records in enumerate(files):
        open('output{n}.txt'.format(n=n), 'w') as output_file:
            output_file.writelines(''.join(lines)
                                   for r in records for lines in r)

where generate_records() yields one record at a time where a record is also an iterator over lines in the input file:

from itertools import chain

def generate_records(input_file, start='1\n', eof=[]):
    def record(yield_start=True):
        if yield_start:
            yield start
        for line in input_file:
            if line == start: # start new record
                break
            yield line
        else: # EOF
            eof.append(True)
    # the first record may include lines before the first 1\n
    yield chain(record(yield_start=False), 
                record())
    while not eof:
        yield record()

generate_records() is a generator that yield generators like itertools.groupby() does.

For performance reasons, you could read/write chunks of multiple lines at once.

Split a file based on number of occurrences of 1 in position 1 of a line

2 Answers2