-3

I routinely use PowerShell to split larger text or csv files in to smaller files for quicker processing. However, I have a few files that come over that are an usual format. These are basically print files to a text file. Each record starts with a single line that starts with a 1 and there is nothing else on the line.

What I need to be able to do is to split a file based on the number of statements. So, basically if I want to split the file in to chunks of 3000 statements, I would go down until I see the 3001 occurrence of 1 in position 1 and copy everything before that to the new file. I can run this from windows, linux or OS X so pretty much anything is open for the split.

Any ideas would be greatly appreciated.

  • Questions like this can be closed easily. There are plenty of good options available to you for this. However you need to show some coding / research effort. SO is not a code writing service. If you have something you have tried update your answer and we would be more that happy to help. Answering questions like this is also discouraged. – Matt Oct 23 '14 at 03:47

2 Answers2

0

Maybe try recognizing it by the fact that there is a '1' plus a new line?

with open(input_file, 'r') as f:
    my_string = f.read()

my_list = my_string.split('\n1\n')

Separates each record to a list assuming it has the following format:

1
....
....
1
....
....
....

You can then output each element in the list to a separate file.

for x in range(len(my_list)):
    print >> str(x)+'.txt', my_list[x]
Alex Huszagh
  • 13,272
  • 3
  • 39
  • 67
0

To avoid loading the file in memory, you could define a function that generates records incrementally and then use itertool's grouper recipe to write each 3000 records to a new file:

#!/usr/bin/env python3
from itertools import zip_longest

with open('input.txt') as input_file:
    files = zip_longest(*[generate_records(input_file)]*3000, filevalue=())
    for n, records in enumerate(files):
        open('output{n}.txt'.format(n=n), 'w') as output_file:
            output_file.writelines(''.join(lines)
                                   for r in records for lines in r)

where generate_records() yields one record at a time where a record is also an iterator over lines in the input file:

from itertools import chain

def generate_records(input_file, start='1\n', eof=[]):
    def record(yield_start=True):
        if yield_start:
            yield start
        for line in input_file:
            if line == start: # start new record
                break
            yield line
        else: # EOF
            eof.append(True)
    # the first record may include lines before the first 1\n
    yield chain(record(yield_start=False), 
                record())
    while not eof:
        yield record()

generate_records() is a generator that yield generators like itertools.groupby() does.

For performance reasons, you could read/write chunks of multiple lines at once.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670