How can I split a file in python?

Question

Is it possible to split a file? For example you have huge wordlist, I want to split it so that it becomes more than one file. How is this possible?

This is certainly possible. If you want useful answers, you may want to provide some useful details. — EBGreen, Feb 13 '09 at 16:08
do you want to do it with python? how is this file structured? is it a text file? — Paolo Tedesco, Feb 13 '09 at 16:10
Is this a duplicate? See: [http://stackoverflow.com/questions/291740/how-do-i-split-a-huge-text-file-in-python](http://stackoverflow.com/questions/291740/how-do-i-split-a-huge-text-file-in-python) — quamrana, Feb 13 '09 at 17:00

score 22 · Answer 1 · 2009-02-15T21:08:50.030

22

This one splits a file up by newlines and writes it back out. You can change the delimiter easily. This can also handle uneven amounts as well, if you don't have a multiple of splitLen lines (20 in this example) in your input file.

splitLen = 20         # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.

# This is shorthand and not friendly with memory
# on very large files (Sean Cavanagh), but it works.
input = open('input.txt', 'r').read().split('\n')

at = 1
for lines in range(0, len(input), splitLen):
    # First, get the list slice
    outputData = input[lines:lines+splitLen]

    # Now open the output file, join the new slice with newlines
    # and write it out. Then close the file.
    output = open(outputBase + str(at) + '.txt', 'w')
    output.write('\n'.join(outputData))
    output.close()

    # Increment the counter
    at += 1

edited Feb 15 '09 at 21:08

answered Feb 13 '09 at 16:17

Might mention that for REALLY BIG FILES, open().read() chews a lot of memory and time. But mostly it's okay. – Sean Cavanagh Feb 13 '09 at 16:21
Oh, I know. I just wanted to throw together a working script quickly, and I normally work with small files. I end up with shorthand like that. – Feb 15 '09 at 21:06
This method is actually very fast. I split 1GB file with 7M lines in 28 sec using 1.5GB memory. Compared to this: http://stackoverflow.com/questions/20602869/batch-file-to-split-csv-file it is much faster. – keiv.fly Feb 25 '16 at 10:13

score 15 · Answer 2 · answered Nov 12 '12 at 14:15

15

A better loop for sli's example, not hogging memory :

splitLen = 20         # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.

input = open('input.txt', 'r')

count = 0
at = 0
dest = None
for line in input:
    if count % splitLen == 0:
        if dest: dest.close()
        dest = open(outputBase + str(at) + '.txt', 'w')
        at += 1
    dest.write(line)
    count += 1

answered Nov 12 '12 at 14:15

lacorbeille

325
1
4
8

1

Careful when copying this code! It leaves open file handles for dest and input. Also, not a great idea to over-write the built-in method "input" – Dhara Nov 07 '19 at 10:52

score 9 · Answer 3 · edited Feb 08 '18 at 13:57

Solution to split binary files into chapters .000, .001, etc.:

FILE = 'scons-conversion.7z'

MAX  = 500*1024*1024  # 500Mb  - max chapter size
BUF  = 50*1024*1024*1024  # 50GB   - memory buffer size

chapters = 0
uglybuf  = ''
with open(FILE, 'rb') as src:
  while True:
    tgt = open(FILE + '.%03d' % chapters, 'wb')
    written = 0
    while written < MAX:
      if len(uglybuf) > 0:
        tgt.write(uglybuf)
      tgt.write(src.read(min(BUF, MAX - written)))
      written += min(BUF, MAX - written)
      uglybuf = src.read(1)
      if len(uglybuf) == 0:
        break
    tgt.close()
    if len(uglybuf) == 0:
      break
    chapters += 1

score 3 · Answer 4 · edited Jun 25 '15 at 01:04

def split_file(file, prefix, max_size, buffer=1024):
    """
    file: the input file
    prefix: prefix of the output files that will be created
    max_size: maximum size of each created file in bytes
    buffer: buffer size in bytes

    Returns the number of parts created.
    """
    with open(file, 'r+b') as src:
        suffix = 0
        while True:
            with open(prefix + '.%s' % suffix, 'w+b') as tgt:
                written = 0
                while written < max_size:
                    data = src.read(buffer)
                    if data:
                        tgt.write(data)
                        written += buffer
                    else:
                        return suffix
                suffix += 1


def cat_files(infiles, outfile, buffer=1024):
    """
    infiles: a list of files
    outfile: the file that will be created
    buffer: buffer size in bytes
    """
    with open(outfile, 'w+b') as tgt:
        for infile in sorted(infiles):
            with open(infile, 'r+b') as src:
                while True:
                    data = src.read(buffer)
                    if data:
                        tgt.write(data)
                    else:
                        break

There is a bug if `max_size` is integer times of 1024. `written <= max_size` should be `written < max_size`. I can't edit it because it's only remove a character. — yangsibai, Jun 24 '15 at 07:37
@osrpt Note that this introduces a different off-by-one error where it creates an extra file with zero bytes if the second-to-last file reads all the remaining bytes (eg: if you split a file in half it creates two files and a third file with zero bytes). I suppose this problem isn't as bad. — NullUserException, Jun 25 '15 at 01:11

Igor Z · Answer 5 · 2017-09-22T10:00:22.450

import re
PATENTS = 'patent.data'

def split_file(filename):
    # Open file to read
    with open(filename, "r") as r:

        # Counter
        n=0

        # Start reading file line by line
        for i, line in enumerate(r):

            # If line match with teplate -- <?xml --increase counter n
            if re.match(r'\<\?xml', line):
                n+=1

                # This "if" can be deleted, without it will start naming from 1
                # or you can keep it. It depends where is "re" will find at
                # first time the template. In my case it was first line
                if i == 0:
                    n = 0               

            # Write lines to file    
            with open("{}-{}".format(PATENTS, n), "a") as f:
                f.write(line)

split_file(PATENTS)

As a result you will get:

patent.data-0

patent.data-1

patent.data-N

score 2 · Answer 6 · edited Feb 22 '20 at 16:56

2

You can use use this pypi filesplit module.

edited Feb 22 '20 at 16:56

Cœur

37,241
25
195
267

answered Jan 24 '18 at 21:31

Ram

575
2
8
18

score 2 · Answer 7 · edited Feb 13 '09 at 16:37

2

Sure it's possible:

open input file
open output file 1
count = 0
for each line in file:
    write to output file
    count = count + 1
    if count > maxlines:
         close output file
         open next output file
         count = 0

edited Feb 13 '09 at 16:37

Aaron Digulla

321,842
108
597
820

answered Feb 13 '09 at 16:10

Charlie Martin

110,348
25
193
263

score 1 · Answer 8 · answered Jun 24 '19 at 07:37

This is a late answer, but a new question was linked here and none of the answers mentioned itertools.groupby.

Assuming you have a (huge) file file.txt that you want to split in chunks of MAXLINES lines file_part1.txt, ..., file_partn.txt, you could do:

with open(file.txt) as fdin:
    for i, sub in itertools.groupby(enumerate(fdin), lambda x: 1 + x[0]//3):
        fdout = open("file_part{}.txt".format(i))
        for _, line in sub:
            fdout.write(line)

mks2192 · Answer 9 · 2022-05-22T07:19:32.207

 import subprocess
 subprocess.run('split -l number_of_lines file_path', shell = True)

For example if you want 50000 lines in one files and path is /home/data then you can run below command

subprocess.run('split -l 50000 /home/data', shell = True)

If you are not sure how many lines to keep in split files but knows how many split you want then In Jupyter Notebook/Shell you can check total number of Lines using below command and then divide by total number of split you want

! wc -l file_path

in this case

! wc -l /home/data

And Just so you know output file will not have file extension but its same extension as input file You can change it manually if Windows

score -1 · Answer 10 · answered Oct 22 '19 at 05:53

All the provided answers are good and (probably) work However, they need to load the file into memory (as a whole or partially). We know Python is not very efficient in this kind of tasks (or at least is not as efficient as the OS level commands).

I found the following is the most efficient way to do it:

import os

MAX_NUM_LINES = 1000
FILE_NAME = "input_file.txt"
SPLIT_PARAM = "-d"
PREFIX = "__"

if os.system(f"split -l {MAX_NUM_LINES} {SPLIT_PARAM} {FILE_NAME} {PREFIX}") == 0:
    print("Done:")
    print(os.system(f"ls {PREFIX}??"))
else:
    print("Failed!")

Read more about split here: https://linoxide.com/linux-how-to/split-large-text-file-smaller-files-linux/

How can I split a file in python?

10 Answers10

Linked

Related