Python: How do I split a .txt file into two or more files with the same number of lines in each?

Question

(I believe I have been looking for hours on stackexchange's and the internet, but couldn't find the right answer)

What I'm trying to do here is to count the number of lines a file has, I achieved that with this code here

# Does not loud into memory
def file_len(fname):
with open(fname) as f:
    for i, l in enumerate(f, 1):
        pass
    print(i)

file_len('bigdata.txt')

then I take the number of lines of the file and divide it by two/three/etc (to make two/three/etc files with the same amount of lines) e.g. bigdata.txt = 1000000 lines 1000000/2=500000 So here I will have two files with a 500000 lines in each, one starting from 1 to 500000 & the other from 500001 to 1000000. I already have this code which looks for a pattern in the original file(bigdata.txt), but I'm not looking for any pattern, just want to split the thing into two halfs or whatsover. Here is the code for it:

# Does not loud into memory
with open('bigdata.txt', 'r') as r:
with open('fhalf', 'w') as f:
    for line in r:
        if line == 'pattern\n': # Splits the file when there is an occurence of the pattern.
#But the occurence as you may notice won't be included in either the two files which is not a good thing since I need all the data.
            break
                f.write(line)
with open('shalf.txt', 'w') as f:
    for line in r:
        f.write(line)

So I'm looking for a simple solution and I know there is one, just can't figure it out for this moment. sample would be: file1.txt , file2.txt each with the same number lines give or take one. Thank you all for your time.

Just by the looks of it, your first function can't be right. Python is a zero-indexed language, so you'll need to `+1` that `i`. — Bram Vanroy, Sep 02 '18 at 13:37
Yes, you' re correct, that's my first solution too, then I ditched the +1 .This was my first solution: def file_len(fname): with open(fname) as f: for i, l in enumerate(f): pass return i + 1 — coredumped0x, Sep 02 '18 at 13:53

Joe Iddon · Accepted Answer · 2018-09-02T13:51:37.260

Read in all the lines to a list with .readlines() and then calculate how many lines need to be distributed to each file and then get writing!

num_files = 2
with open('bigdata.txt') as in_file:
    lines = in_file.readlines()
    lines_per_file = len(lines) // num_files
    for n in range(num_files):
        with open('file{}.txt'.format(n+1), 'w') as out_file:
            for i in range(n * lines_per_file, (n+1) * lines_per_file):
                out_file.write(lines[i])

And a full test:

$ cat bigdata.txt 
line1
line2
line3
line4
line5
line6
$ python -q
>>> num_files = 2
>>> with open('bigdata.txt') as in_file:
...     lines = in_file.readlines()
...     lines_per_file = len(lines) // num_files
...     for n in range(num_files):
...         with open('file{}.txt'.format(n+1), 'w') as out_file:
...             for i in range(n * lines_per_file, (n+1) * lines_per_file):
...                 out_file.write(lines[i])
... 
>>> 
$ more file*
::::::::::::::
file1.txt
::::::::::::::
line1
line2
line3
::::::::::::::
file2.txt
::::::::::::::
line4
line5
line6

If you can't read bigdata.txt into memory then the .readlines() solution won't cut it.

You will have to write the lines as you read them which is no big deal.

As for working out the length in the first place, this question discusses some methods, my favourite being Kyle's sum() method.

num_files = 2
num_lines = sum(1 for line in open('bigdata.txt'))
lines_per_file = num_lines // num_files
with open('bigdata.txt') as in_file:
    for n in range(num_files):
        with open('file{}.txt'.format(n+1), 'w') as out_file:
            for _ in range(lines_per_file):
                out_file.write(in_file.readline())

The whole point of `file_len` is that you can count the lines without reading the file in memory. Useful for huge files. Assuming that's the reason OP chose this `file_len` implementation, your solution will not be suitable. Fine answer if memory is not an issue though. — Bram Vanroy, Sep 02 '18 at 13:43
Yeah, I deliberately used the file_len since I'm dealing with +10000000 lines, so I needed to avoid reading that in memory. — coredumped0x, Sep 02 '18 at 13:49
@MurphyAdam See my updated answer. It does not read the whole file into memory, just one line at a time — Joe Iddon, Sep 02 '18 at 13:52

Python: How do I split a .txt file into two or more files with the same number of lines in each?

1 Answers1