22

Is it possible to split a file? For example you have huge wordlist, I want to split it so that it becomes more than one file. How is this possible?

Georg Schölly
  • 124,188
  • 49
  • 220
  • 267
localhost
  • 863
  • 6
  • 12
  • 12
  • This is certainly possible. If you want useful answers, you may want to provide some useful details. – EBGreen Feb 13 '09 at 16:08
  • do you want to do it with python? how is this file structured? is it a text file? – Paolo Tedesco Feb 13 '09 at 16:10
  • Is this a duplicate? See: [http://stackoverflow.com/questions/291740/how-do-i-split-a-huge-text-file-in-python](http://stackoverflow.com/questions/291740/how-do-i-split-a-huge-text-file-in-python) – quamrana Feb 13 '09 at 17:00

10 Answers10

22

This one splits a file up by newlines and writes it back out. You can change the delimiter easily. This can also handle uneven amounts as well, if you don't have a multiple of splitLen lines (20 in this example) in your input file.

splitLen = 20         # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.

# This is shorthand and not friendly with memory
# on very large files (Sean Cavanagh), but it works.
input = open('input.txt', 'r').read().split('\n')

at = 1
for lines in range(0, len(input), splitLen):
    # First, get the list slice
    outputData = input[lines:lines+splitLen]

    # Now open the output file, join the new slice with newlines
    # and write it out. Then close the file.
    output = open(outputBase + str(at) + '.txt', 'w')
    output.write('\n'.join(outputData))
    output.close()

    # Increment the counter
    at += 1
  • Might mention that for REALLY BIG FILES, open().read() chews a lot of memory and time. But mostly it's okay. – Sean Cavanagh Feb 13 '09 at 16:21
  • Oh, I know. I just wanted to throw together a working script quickly, and I normally work with small files. I end up with shorthand like that. –  Feb 15 '09 at 21:06
  • This method is actually very fast. I split 1GB file with 7M lines in 28 sec using 1.5GB memory. Compared to this: http://stackoverflow.com/questions/20602869/batch-file-to-split-csv-file it is much faster. – keiv.fly Feb 25 '16 at 10:13
15

A better loop for sli's example, not hogging memory :

splitLen = 20         # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.

input = open('input.txt', 'r')

count = 0
at = 0
dest = None
for line in input:
    if count % splitLen == 0:
        if dest: dest.close()
        dest = open(outputBase + str(at) + '.txt', 'w')
        at += 1
    dest.write(line)
    count += 1
lacorbeille
  • 325
  • 1
  • 4
  • 8
  • 1
    Careful when copying this code! It leaves open file handles for dest and input. Also, not a great idea to over-write the built-in method "input" – Dhara Nov 07 '19 at 10:52
9

Solution to split binary files into chapters .000, .001, etc.:

FILE = 'scons-conversion.7z'

MAX  = 500*1024*1024  # 500Mb  - max chapter size
BUF  = 50*1024*1024*1024  # 50GB   - memory buffer size

chapters = 0
uglybuf  = ''
with open(FILE, 'rb') as src:
  while True:
    tgt = open(FILE + '.%03d' % chapters, 'wb')
    written = 0
    while written < MAX:
      if len(uglybuf) > 0:
        tgt.write(uglybuf)
      tgt.write(src.read(min(BUF, MAX - written)))
      written += min(BUF, MAX - written)
      uglybuf = src.read(1)
      if len(uglybuf) == 0:
        break
    tgt.close()
    if len(uglybuf) == 0:
      break
    chapters += 1
Robin
  • 519
  • 1
  • 7
  • 13
anatoly techtonik
  • 19,847
  • 9
  • 124
  • 140
3
def split_file(file, prefix, max_size, buffer=1024):
    """
    file: the input file
    prefix: prefix of the output files that will be created
    max_size: maximum size of each created file in bytes
    buffer: buffer size in bytes

    Returns the number of parts created.
    """
    with open(file, 'r+b') as src:
        suffix = 0
        while True:
            with open(prefix + '.%s' % suffix, 'w+b') as tgt:
                written = 0
                while written < max_size:
                    data = src.read(buffer)
                    if data:
                        tgt.write(data)
                        written += buffer
                    else:
                        return suffix
                suffix += 1


def cat_files(infiles, outfile, buffer=1024):
    """
    infiles: a list of files
    outfile: the file that will be created
    buffer: buffer size in bytes
    """
    with open(outfile, 'w+b') as tgt:
        for infile in sorted(infiles):
            with open(infile, 'r+b') as src:
                while True:
                    data = src.read(buffer)
                    if data:
                        tgt.write(data)
                    else:
                        break
NullUserException
  • 83,810
  • 28
  • 209
  • 234
michaelmeyer
  • 7,985
  • 7
  • 30
  • 36
  • 1
    There is a bug if `max_size` is integer times of 1024. `written <= max_size` should be `written < max_size`. I can't edit it because it's only remove a character. – yangsibai Jun 24 '15 at 07:37
  • @osrpt Note that this introduces a different off-by-one error where it creates an extra file with zero bytes if the second-to-last file reads all the remaining bytes (eg: if you split a file in half it creates two files and a third file with zero bytes). I suppose this problem isn't as bad. – NullUserException Jun 25 '15 at 01:11
2
import re
PATENTS = 'patent.data'

def split_file(filename):
    # Open file to read
    with open(filename, "r") as r:

        # Counter
        n=0

        # Start reading file line by line
        for i, line in enumerate(r):

            # If line match with teplate -- <?xml --increase counter n
            if re.match(r'\<\?xml', line):
                n+=1

                # This "if" can be deleted, without it will start naming from 1
                # or you can keep it. It depends where is "re" will find at
                # first time the template. In my case it was first line
                if i == 0:
                    n = 0               

            # Write lines to file    
            with open("{}-{}".format(PATENTS, n), "a") as f:
                f.write(line)

split_file(PATENTS)

As a result you will get:

patent.data-0

patent.data-1

patent.data-N

Igor Z
  • 601
  • 6
  • 7
2

You can use use this pypi filesplit module.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Ram
  • 575
  • 2
  • 8
  • 18
2

Sure it's possible:

open input file
open output file 1
count = 0
for each line in file:
    write to output file
    count = count + 1
    if count > maxlines:
         close output file
         open next output file
         count = 0
Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
Charlie Martin
  • 110,348
  • 25
  • 193
  • 263
1

This is a late answer, but a new question was linked here and none of the answers mentioned itertools.groupby.

Assuming you have a (huge) file file.txt that you want to split in chunks of MAXLINES lines file_part1.txt, ..., file_partn.txt, you could do:

with open(file.txt) as fdin:
    for i, sub in itertools.groupby(enumerate(fdin), lambda x: 1 + x[0]//3):
        fdout = open("file_part{}.txt".format(i))
        for _, line in sub:
            fdout.write(line)
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
0
 import subprocess
 subprocess.run('split -l number_of_lines file_path', shell = True)

For example if you want 50000 lines in one files and path is /home/data then you can run below command

subprocess.run('split -l 50000 /home/data', shell = True)

If you are not sure how many lines to keep in split files but knows how many split you want then In Jupyter Notebook/Shell you can check total number of Lines using below command and then divide by total number of split you want

! wc -l file_path

in this case

! wc -l /home/data

And Just so you know output file will not have file extension but its same extension as input file You can change it manually if Windows

mks2192
  • 306
  • 2
  • 11
-1

All the provided answers are good and (probably) work However, they need to load the file into memory (as a whole or partially). We know Python is not very efficient in this kind of tasks (or at least is not as efficient as the OS level commands).

I found the following is the most efficient way to do it:

import os

MAX_NUM_LINES = 1000
FILE_NAME = "input_file.txt"
SPLIT_PARAM = "-d"
PREFIX = "__"

if os.system(f"split -l {MAX_NUM_LINES} {SPLIT_PARAM} {FILE_NAME} {PREFIX}") == 0:
    print("Done:")
    print(os.system(f"ls {PREFIX}??"))
else:
    print("Failed!")

Read more about split here: https://linoxide.com/linux-how-to/split-large-text-file-smaller-files-linux/

Borhan Kazimipour
  • 405
  • 1
  • 6
  • 13