Split file after X lines at blank line

Question

I need to split large textfiles into smaller chunks whereby the textfiles contain data that needs to stay together. Each related chunk of data is separated from the next by a newline, like so:

Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1

More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2

How could I define a number of lines after which, at the next blank line to maintain the data chunks, the file is split? I’d like to use Python for this but I can’t figure out to use a split function after X lines.

It might be helpful to you http://stackoverflow.com/a/544932/568901 — sangheestyle, Feb 28 '17 at 21:37

Ohjeah · Answer 1 · 2017-03-02T19:12:08.950

3

from itertools import groupby

with open(myfile, 'r') as f:
    chunks = [[x.strip() for x in v] for k, v in 
              groupby(f, lambda x: x.strip()) if k]

edited Mar 02 '17 at 19:12

answered Feb 28 '17 at 21:56

Ohjeah

1,269
18
24

1

drop the `f.readlines()` to let the iterator on the file lines work. Don't read the whole file at once. Otherwise good solution. – Jean-François Fabre Mar 02 '17 at 07:14
Thanks for the hint. I did not know that you can iterate a file like this. – Ohjeah Mar 02 '17 at 19:13

Tristan · Accepted Answer · 2017-03-02T10:55:10.367

If you want to write new chunk1.txt ... chunkN.txt for each chunk, you could do this in a manner like this:

def chunk_file(name, lines_per_chunk, chunks_per_file):

    def write_chunk(chunk_no, chunk):
        with open("chunk{}.txt".format(chunk_no), "w") as outfile:
            outfile.write("".join(i for i in chunk))

    count, chunk_no, chunk_count, chunk = 1, 1, 0, []
    with open(name, "r") as f:
        for row in f:
            if count > lines_per_chunk and row == "\n":
                chunk_count += 1
                count = 1
                chunk.append("\n")
                if chunk_count == chunks_per_file:
                    write_chunk(chunk_no, chunk)
                    chunk = []
                    chunk_count = 0
                    chunk_no += 1
            else:
                count += 1
                chunk.append(row)
    if chunk:
        write_chunk(chunk_no, chunk)

chunk_file("test.txt", 3, 1)

You have to specify the lines which belong to a chunk, after which a newline is anticipated.

Say you want to chunk this file:

Some Data belonnging to chunk 1

Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1

More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2

The first chunk differs strongly in line count from the second chunk. (7 lines vs 3 lines)

The output for this example would be chunk1.txt:

Some Data belonnging to chunk 1

Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1
Some Data belonnging to chunk 1

And chunk2.txt:

More Data, belonnging to chunk 2
More Data, belonnging to chunk 2
More Data, belonnging to chunk 2

This approach assumes that lines_per_chunk is a minimum chunk size, so it works even if the chunks have different line counts. We are only looking for a blank line to end the chunk, when the minimum chunk size is reached. In the above example it is no problem, that there is a blank line on line 2, since the minimum chunk size is not reached yet. If a blank line occurs on line 4 and the chunk data continues afterwards, there would be a problem, since the criterion specified (line numbers and blank lines) could not identify chunks alone.

This would not work with large files (>1million lines) and for chunks which differ vastly in line count (from 8 to 70 lines perhaps), would it? — karkraeg, Mar 02 '17 at 09:03
@kbecker87 I just modified the solution to only ready the lines when evaluated and tested the script on a file with 1million lines. It took ~8 seconds to chunk. It would also work if the chunks differ vastly in size. With your example you have to set minimum size to 8 lines to recognize the first chunk. If in the chunk with 70 lines there are no blank lines after the first 8 lines, it would work. Otherwise you need another criterion to identify chunks. — Tristan, Mar 02 '17 at 09:27
this works just fine for splitting a file into single files at each chunk. Actually I need to safe say 1000 chunks into one file, the next 1000 into the next and so on. — karkraeg, Mar 02 '17 at 09:59
@kbecker87 I edited the answer to allow for an additional argument chunks_per_file, which enables you to choose how many chunks go into one file. — Tristan, Mar 02 '17 at 10:18
perfect! I tried to keep the blank lines between the chunks but can’t even figure this out — karkraeg, Mar 02 '17 at 10:39
@kbecker87 You could append a newline ("\n") to the list of lines for current chunk file. I added that above. — Tristan, Mar 02 '17 at 10:55

Split file after X lines at blank line

2 Answers2

Linked