Is it possible to split a file? For example you have huge wordlist, I want to split it so that it becomes more than one file. How is this possible?
-
This is certainly possible. If you want useful answers, you may want to provide some useful details. – EBGreen Feb 13 '09 at 16:08
-
do you want to do it with python? how is this file structured? is it a text file? – Paolo Tedesco Feb 13 '09 at 16:10
-
Is this a duplicate? See: [http://stackoverflow.com/questions/291740/how-do-i-split-a-huge-text-file-in-python](http://stackoverflow.com/questions/291740/how-do-i-split-a-huge-text-file-in-python) – quamrana Feb 13 '09 at 17:00
10 Answers
This one splits a file up by newlines and writes it back out. You can change the delimiter easily. This can also handle uneven amounts as well, if you don't have a multiple of splitLen lines (20 in this example) in your input file.
splitLen = 20 # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.
# This is shorthand and not friendly with memory
# on very large files (Sean Cavanagh), but it works.
input = open('input.txt', 'r').read().split('\n')
at = 1
for lines in range(0, len(input), splitLen):
# First, get the list slice
outputData = input[lines:lines+splitLen]
# Now open the output file, join the new slice with newlines
# and write it out. Then close the file.
output = open(outputBase + str(at) + '.txt', 'w')
output.write('\n'.join(outputData))
output.close()
# Increment the counter
at += 1
-
Might mention that for REALLY BIG FILES, open().read() chews a lot of memory and time. But mostly it's okay. – Sean Cavanagh Feb 13 '09 at 16:21
-
Oh, I know. I just wanted to throw together a working script quickly, and I normally work with small files. I end up with shorthand like that. – Feb 15 '09 at 21:06
-
This method is actually very fast. I split 1GB file with 7M lines in 28 sec using 1.5GB memory. Compared to this: http://stackoverflow.com/questions/20602869/batch-file-to-split-csv-file it is much faster. – keiv.fly Feb 25 '16 at 10:13
A better loop for sli's example, not hogging memory :
splitLen = 20 # 20 lines per file
outputBase = 'output' # output.1.txt, output.2.txt, etc.
input = open('input.txt', 'r')
count = 0
at = 0
dest = None
for line in input:
if count % splitLen == 0:
if dest: dest.close()
dest = open(outputBase + str(at) + '.txt', 'w')
at += 1
dest.write(line)
count += 1

- 325
- 1
- 4
- 8
-
1Careful when copying this code! It leaves open file handles for dest and input. Also, not a great idea to over-write the built-in method "input" – Dhara Nov 07 '19 at 10:52
Solution to split binary files into chapters .000, .001, etc.:
FILE = 'scons-conversion.7z'
MAX = 500*1024*1024 # 500Mb - max chapter size
BUF = 50*1024*1024*1024 # 50GB - memory buffer size
chapters = 0
uglybuf = ''
with open(FILE, 'rb') as src:
while True:
tgt = open(FILE + '.%03d' % chapters, 'wb')
written = 0
while written < MAX:
if len(uglybuf) > 0:
tgt.write(uglybuf)
tgt.write(src.read(min(BUF, MAX - written)))
written += min(BUF, MAX - written)
uglybuf = src.read(1)
if len(uglybuf) == 0:
break
tgt.close()
if len(uglybuf) == 0:
break
chapters += 1

- 519
- 1
- 7
- 13

- 19,847
- 9
- 124
- 140
def split_file(file, prefix, max_size, buffer=1024):
"""
file: the input file
prefix: prefix of the output files that will be created
max_size: maximum size of each created file in bytes
buffer: buffer size in bytes
Returns the number of parts created.
"""
with open(file, 'r+b') as src:
suffix = 0
while True:
with open(prefix + '.%s' % suffix, 'w+b') as tgt:
written = 0
while written < max_size:
data = src.read(buffer)
if data:
tgt.write(data)
written += buffer
else:
return suffix
suffix += 1
def cat_files(infiles, outfile, buffer=1024):
"""
infiles: a list of files
outfile: the file that will be created
buffer: buffer size in bytes
"""
with open(outfile, 'w+b') as tgt:
for infile in sorted(infiles):
with open(infile, 'r+b') as src:
while True:
data = src.read(buffer)
if data:
tgt.write(data)
else:
break

- 83,810
- 28
- 209
- 234

- 7,985
- 7
- 30
- 36
-
1There is a bug if `max_size` is integer times of 1024. `written <= max_size` should be `written < max_size`. I can't edit it because it's only remove a character. – yangsibai Jun 24 '15 at 07:37
-
@osrpt Note that this introduces a different off-by-one error where it creates an extra file with zero bytes if the second-to-last file reads all the remaining bytes (eg: if you split a file in half it creates two files and a third file with zero bytes). I suppose this problem isn't as bad. – NullUserException Jun 25 '15 at 01:11
import re
PATENTS = 'patent.data'
def split_file(filename):
# Open file to read
with open(filename, "r") as r:
# Counter
n=0
# Start reading file line by line
for i, line in enumerate(r):
# If line match with teplate -- <?xml --increase counter n
if re.match(r'\<\?xml', line):
n+=1
# This "if" can be deleted, without it will start naming from 1
# or you can keep it. It depends where is "re" will find at
# first time the template. In my case it was first line
if i == 0:
n = 0
# Write lines to file
with open("{}-{}".format(PATENTS, n), "a") as f:
f.write(line)
split_file(PATENTS)
As a result you will get:
patent.data-0
patent.data-1
patent.data-N

- 601
- 6
- 7
Sure it's possible:
open input file
open output file 1
count = 0
for each line in file:
write to output file
count = count + 1
if count > maxlines:
close output file
open next output file
count = 0

- 321,842
- 108
- 597
- 820

- 110,348
- 25
- 193
- 263
This is a late answer, but a new question was linked here and none of the answers mentioned itertools.groupby
.
Assuming you have a (huge) file file.txt
that you want to split in chunks of MAXLINES
lines file_part1.txt
, ..., file_partn.txt
, you could do:
with open(file.txt) as fdin:
for i, sub in itertools.groupby(enumerate(fdin), lambda x: 1 + x[0]//3):
fdout = open("file_part{}.txt".format(i))
for _, line in sub:
fdout.write(line)

- 143,923
- 11
- 122
- 252
import subprocess
subprocess.run('split -l number_of_lines file_path', shell = True)
For example if you want 50000 lines in one files and path is /home/data then you can run below command
subprocess.run('split -l 50000 /home/data', shell = True)
If you are not sure how many lines to keep in split files but knows how many split you want then In Jupyter Notebook/Shell you can check total number of Lines using below command and then divide by total number of split you want
! wc -l file_path
in this case
! wc -l /home/data
And Just so you know output file will not have file extension but its same extension as input file You can change it manually if Windows

- 306
- 2
- 11
All the provided answers are good and (probably) work However, they need to load the file into memory (as a whole or partially). We know Python is not very efficient in this kind of tasks (or at least is not as efficient as the OS level commands).
I found the following is the most efficient way to do it:
import os
MAX_NUM_LINES = 1000
FILE_NAME = "input_file.txt"
SPLIT_PARAM = "-d"
PREFIX = "__"
if os.system(f"split -l {MAX_NUM_LINES} {SPLIT_PARAM} {FILE_NAME} {PREFIX}") == 0:
print("Done:")
print(os.system(f"ls {PREFIX}??"))
else:
print("Failed!")
Read more about split
here: https://linoxide.com/linux-how-to/split-large-text-file-smaller-files-linux/

- 405
- 1
- 6
- 13