0

I have a pretty big file (more than 20GB) and I'd like to split it into smaller ones, like multiple files of 2GB.

One thing is I have to split before a specific line:

I'm using Python, but if there another solution in shell for example, I'm up for it.

This is how the big file looks like:

bigfile.txt (20GB)

Recno:: 0
some data...

Recno:: 1
some data...

Recno:: 2
some data...

Recno:: 3
some data...

Recno:: 4
some data...

Recno:: 5
some data...

Recno:: x
some more data...

This is what I want:

file1.txt (2 GB +/-)

Recno::0
some data...

Recno:: 1
some data...

file2.txt (2GB +/-)

Recno:: 2
some data...

Recno:: 4
some data...

Recno:: 5
some data...

And so on, and so on...

Thanks !

Difender
  • 495
  • 2
  • 5
  • 18

2 Answers2

1

You could do something like this:

import sys

try:
    _, size, file = sys.argv
    size = int(size)
except ValueError:
    sys.exit('Usage: splitter.py <size in bytes> <filename to split>')

with open(file) as infile:
    count = 0
    current_size = 0
    # you could do something more
    # fancy with the name like use
    # os.path.splitext
    outfile = open(file+'_0', 'w+')
    for line in infile:
        if current_size > size and line.startswith('Recno'):
            outfile.close()
            count += 1
            current_size = 0
            outfile = open(file+'_{}'.format(count), 'w+')
        current_size += len(line)
        outfile.write(line)
    outfile.close()
Difender
  • 495
  • 2
  • 5
  • 18
Wayne Werner
  • 49,299
  • 29
  • 200
  • 290
-1

As comment above mentions you can use split in the bash shell:

split -b 20000m <path-to-your-file>
JoshuaBox
  • 735
  • 1
  • 4
  • 16
  • As I said I do not want to split ONLY on the size. I must split on the size but also on a given line. For example, each file has to start with a `Recno:: x` – Difender Jul 26 '16 at 13:25
  • you could monitor the file size in Python with `os.stat('/path/to/file/').st_size` in a while loop – JoshuaBox Jul 26 '16 at 13:38