Split large files using python

Question

I have some trouble trying to split large files (say, around 10GB). The basic idea is simply read the lines, and group every, say 40000 lines into one file. But there are two ways of "reading" files.

1) The first one is to read the WHOLE file at once, and make it into a LIST. But this will require loading the WHOLE file into memory, which is painful for the too large file. (I think I asked such questions before) In python, approaches to read WHOLE file at once I've tried include:

input1=f.readlines()

input1 = commands.getoutput('zcat ' + file).splitlines(True)

input1 = subprocess.Popen(["cat",file],
                              stdout=subprocess.PIPE,bufsize=1)

Well, then I can just easily group 40000 lines into one file by: list[40000,80000] or list[80000,120000] Or the advantage of using list is that we can easily point to specific lines.

2)The second way is to read line by line; process the line when reading it. Those read lines won't be saved in memory. Examples include:

f=gzip.open(file)
for line in f: blablabla...

or

for line in fileinput.FileInput(fileName):

I'm sure for gzip.open, this f is NOT a list, but a file object. And seems we can only process line by line; then how can I execute this "split" job? How can I point to specific lines of the file object?

Thanks

When you think about it.. you can't. You can only know on which line are you only after you've read all the previous lines and counted line breaks (\n). (Ignoring special case that this is some strange file in which each line is of known length.) — rplnt, Nov 11 '11 at 16:05

yurib · Accepted Answer · 2017-01-23T16:01:32.640

21

NUM_OF_LINES=40000
filename = 'myinput.txt'
with open(filename) as fin:
    fout = open("output0.txt","wb")
    for i,line in enumerate(fin):
      fout.write(line)
      if (i+1)%NUM_OF_LINES == 0:
        fout.close()
        fout = open("output%d.txt"%(i/NUM_OF_LINES+1),"wb")

    fout.close()

edited Jan 23 '17 at 16:01

answered Nov 11 '11 at 16:08

yurib

8,043
3
30
55

If you want exactly 40,000 lines in the file, I think you should initialize `i` to `0`, not `1`. – martineau Nov 11 '11 at 16:27
what packages do you need? – L F Jan 23 '17 at 15:39
@LuisFelipe no external packages are needed, `fileinput` is a builtin package and not even required for this functionality, you could just as well use a plain `open()` – yurib Jan 23 '17 at 15:50
I tried the same code and it says "name 'filename' is not defined" – L F Jan 23 '17 at 15:52
1

@LuisFelipe `filename` is a variable that should contain the path to your input file – yurib Jan 23 '17 at 15:53
@yurib sorry, I was commenting about other question haha – L F Jan 23 '17 at 16:12

bgporter · Answer 2 · 2011-11-13T17:54:57.043

If there's nothing special about having a specific number of file lines in each file, the readlines() function also accepts a size 'hint' parameter that behaves like this:

If given an optional parameter sizehint, it reads that many bytes from the file and enough more to complete a line, and returns the lines from that. This is often used to allow efficient reading of a large file by lines, but without having to load the entire file in memory. Only complete lines will be returned.

...so you could write that code something like this:

# assume that an average line is about 80 chars long, and that we want about 
# 40K in each file.

SIZE_HINT = 80 * 40000

fileNumber = 0
with open("inputFile.txt", "rt") as f:
   while True:
      buf = f.readlines(SIZE_HINT)
      if not buf:
         # we've read the entire file in, so we're done.
         break
      outFile = open("outFile%d.txt" % fileNumber, "wt")
      outFile.write(buf)
      outFile.close()
      fileNumber += 1

-1 (1) you don't explicitly close the output files (2) reading in text mode and writing in binary mode is GUARANTEED to "mung things if we're on windows" — John Machin, Nov 11 '11 at 19:23
The 'hint' parameter is documented here: https://docs.python.org/3/library/io.html?highlight=readlines#io.IOBase.readlines - the link in the answer does not mention it (anymore?). — natka_m, Apr 29 '22 at 16:35

score 4 · Answer 3 · edited Jun 03 '22 at 11:14

4

The best solution I have found is using the library filesplit.

You only need to specify the input file, the output folder and the desired size in bytes for output files. Finally, the library will do all the work for you.

from fsplit.filesplit import Filesplit

def split_cb(f, s):
    print("file: {0}, size: {1}".format(f, s))

fs = Filesplit()
fs.split(file="/path/to/source/file", split_size=900000, output_dir="/pathto/output/dir", callback=split_cb)

edited Jun 03 '22 at 11:14

hc_dev

8,389
1
26
38

answered Jan 23 '21 at 23:51

rafaoc

586
7
21

Useful to monitor the generated files with their size using a printing callback. This seems old version [2.0 having the `Filesplit().split()` method](https://github.com/ram-jayapalan/filesplit/blob/2.0.0/fsplit/filesplit.py#L36). Current version [4.0 uses `Split().bysize()`](https://github.com/ram-jayapalan/filesplit/blob/master/src/split.py#L161). See [my answer](https://stackoverflow.com/a/72488739/5730279). – hc_dev Jun 03 '22 at 11:28

score 3 · Answer 4 · answered Nov 11 '11 at 16:07

For a 10GB file, the second approach is clearly the way to go. Here is an outline of what you need to do:

Open the input file.
Open the first output file.
Read one line from the input file and write it to the output file.
Maintain a count of how many lines you've written to the current output file; as soon as it reaches 40000, close the output file, and open the next one.
Repeat steps 3-4 until you've reached the end of the input file.
Close both files.

`if num_lines % 4000 == 0: avoid_writing_empty_file_at_end() # except when numlines == 0` — John Machin, Nov 11 '11 at 19:10

Jason Sundram · Answer 5 · 2011-11-11T19:23:53.340

3

chunk_size = 40000
fout = None
for (i, line) in enumerate(fileinput.FileInput(filename)):
    if i % chunk_size == 0:
        if fout: fout.close()
        fout = open('output%d.txt' % (i/chunk_size), 'w')
    fout.write(line)
fout.close()

edited Nov 11 '11 at 19:23

answered Nov 11 '11 at 16:15

Jason Sundram

12,225
19
71
86

You need to do `if fout: fout.close()` after you exit the loop – John Machin Nov 11 '11 at 19:18

score 0 · Answer 6 · answered May 19 '22 at 16:42

I created this small script to split the large file in a few seconds. It took only 20 seconds to split a text file with 20M lines into 10 small files each with 2M lines.

split_length = 2_000_000
file_count = 0
large_file = open('large-file.txt', encoding='utf-8', errors='ignore').readlines()

for index in range(0, len(large_file)):
    if (index > 0) and (index % 2000000 == 0):
        new_file = open(f'splitted-file-{file_count}.txt', 'a', encoding='utf-8', errors='ignore')
        split_start_value = file_count * split_length
        split_end_value = split_length * (file_count + 1)
        file_content_list = large_file[split_start_value:split_end_value]
        file_content = ''.join(line for line in file_content_list)
        new_file.write(file_content)
        new_file.close()
        file_count += 1
        print(f'created file {file_count}')

hc_dev · Answer 7 · 2022-06-03T11:31:12.393

To split a file line-wise:

group every, say 40000 lines into one file

You can use module filesplit with method bylinecount (version 4.0):

import os
from filesplit.split import Split

LINES_PER_FILE = 40_000  # see PEP515 for readable numeric literals 
filename = 'myinput.txt'
outdir = 'splitted/'  # to store split-files `myinput_1.txt` etc.

Split(filename, outdir).bylinecount(LINES_PER_FILE)

This is similar to rafaoc's answer which apparently used outdated version 2.0 to split by size.

score 0 · Answer 8 · answered Nov 11 '11 at 17:24

Obviously, as you are doing work on the file, you will need to iterate over the file's contents in some way -- whether you do that manually or you let a part of the Python API do it for you (e.g. the readlines() method) is not important. In big O analysis, this means you will spend O(n) time (n being the size of the file).

But reading the file into memory requires O(n) space also. Although sometimes we do need to read a 10 gb file into memory, your particular problem does not require this. We can iterate over the file object directly. Of course, the file object does require space, but we have no reason to hold the contents of the file twice in two different forms.

Therefore, I would go with your second solution.

Split large files using python

8 Answers8

Linked