0

I have read several posts including this one. but none helped.

Here is the python code that I have currently that splits the file

my input file size is 15G and I am splitting it into 128MB. my computer has 8G memory

import sys

def read_line(f_object,terminal_byte):
     line = ''.join(iter(lambda:f_object.read(1),terminal_byte))
     line+="\x01"
     return line

def read_lines(f_object,terminal_byte):
    tmp = read_line(f_object,terminal_byte)
    while tmp:
        yield tmp
        tmp = read_line(f_object,terminal_byte)

def make_chunks(f_object,terminal_byte,max_size):
    current_chunk = []
    current_chunk_size = 0
    for line in read_lines(f_object,terminal_byte):
        current_chunk.append(line)
        current_chunk_size += len(line)
        if current_chunk_size > max_size:
            yield "".join(current_chunk)
            current_chunk = []
            current_chunk_size = 0
    if current_chunk:
        yield ''.join(current_chunk)

inputfile=sys.argv[1]

with open(inputfile,"rb") as f_in:
    for i,chunk in enumerate(make_chunks(f_in, bytes(chr(1)),1024*1000*128)):
        with open("out%d.txt"%i,"wb") as f_out:
            f_out.write(chunk)

when I execute the script, I get the following error:

Traceback (most recent call last):
  File "splitter.py", line 30, in <module>
    for i,chunk in enumerate(make_chunks(f_in, bytes(chr(1)),1024*1000*128)):
  File "splitter.py", line 17, in make_chunks
    for line in read_lines(f_object,terminal_byte):
  File "splitter.py", line 12, in read_lines
    tmp = read_line(f_object,terminal_byte)
  File "splitter.py", line 4, in read_line
    line = ''.join(iter(lambda:f_object.read(1),terminal_byte))
MemoryError
brain storm
  • 30,124
  • 69
  • 225
  • 393
  • 1
    What's the terminal byte? Is it actually finding it before you use 8 gigabytes of memory? In other words, where are you expecting \x01'? – juanpa.arrivillaga Aug 25 '17 at 22:57
  • Also, your `max_size` is 131072000. But that is in *number of lines*, so, just *the list itself, without counting the contents* will be `1024*1000*128*1e-9*(8)` gigabytes, which is about 1.05 gigabytes... Again, that isn't counting the *actual objects contained in* the `current_chunk` list. A string the size of `"the quick brown fox jumped over the lazy dog"` is about 81 bytes, so that many strings averaging that size would take `1024*1000*128*1e-8*81` gigabytes, which is about 10.6 gigs! Your code is doomed to fail from the start... – juanpa.arrivillaga Aug 25 '17 at 23:15
  • 1
    Fundamentally, if you are trying to read/write in `128MB` chunks, then all of this seems unnecessary... You can just `f_out.write(f_in.read(128000))` In a loop... What is the rest of this rigmarole suppose to be accomplishing anyway? – juanpa.arrivillaga Aug 25 '17 at 23:19
  • https://stackoverflow.com/questions/45888081/reading-a-big-file-in-binary-with-custom-line-terminator-and-writing-in-smaller/45888380?noredirect=1#comment78740301_45888380 ... – Joran Beasley Aug 25 '17 at 23:29
  • @juanpa.arrivillaga if you read the link above, it explains everything. to explain, terminal byte is the line delimiter which is not newline but `\x01` in my case. so when I cannot read exactly 128MB since that would result in line breaking at arbitrary places. – brain storm Aug 26 '17 at 03:54
  • Ok. Well, how many lines are you expecting? Are you sure the terminal byte occurs before you run out of memory? You can scan the file and count. Also, like I explained, your `max_size` is too big. – juanpa.arrivillaga Aug 26 '17 at 06:00
  • the error happens almost after 70% of split files are generated. That is it did not complete for the entire input file. max_size is roughly 128MB. The split files generated so far were similar size – brain storm Aug 27 '17 at 03:45

1 Answers1

1

Question: splitting big file into smaller files

Instead of finding every single \x01 do this only in the Last chunk.
Either reset the Filepointer to offset+1 of Last found \x01 and continue or write up to offset in the Current Chunk File and the remaining Part of chunk in the next Chunk File.

Note: Your chunk_size should be io.DEFAULT_BUFFER_SIZE or a multiple of that.
You gain no speedup if you raise the chunk_size to high.
Read this relevant SO QA: Default buffer size for a file

My Example shows use of resetting the Filepointer, for instance:

import io

large_data = b"""Lorem ipsum\x01dolor sit\x01sadipscing elitr, sed\x01labore et\x01dolores et ea rebum.\x01magna aliquyam erat,\x01"""

def split(chunk_size, split_size):
    with io.BytesIO(large_data) as fh_in:
        _size = 0
        # Used to verify chunked writes
        result_data = io.BytesIO()

        while True:
            chunk = fh_in.read(chunk_size)
            print('read({})'.format(bytearray(chunk)))
            if not chunk: break

            _size += chunk_size
            if _size >= split_size:
                _size = 0
                # Split on last 0x01
                l = len(chunk)
                print('\tsplit_on_last_\\x01({})\t{}'.format(l, bytearray(chunk)))

                # Reverse iterate 
                for p in range(l-1, -1, -1):
                    c = chunk[p:p+1]
                    if ord(c) == ord('\x01'):
                        offset = l-(p+1)

                        # Condition if \x01 is the Last Byte in chunk
                        if offset == 0:
                            print('\toffset={} write({})\t\t{}'.format(offset, l - offset, bytearray(chunk)))
                            result_data.write(chunk)
                        else:
                            # Reset Fileppointer
                            fh_in.seek(fh_in.tell()-offset)
                            print('\toffset={} write({})\t\t{}'.format(offset, l-offset, bytearray(chunk[:-offset])))
                            result_data.write(chunk[:-offset])
                        break
            else:
                print('\twrite({}) {}'.format(chunk_size, bytearray(chunk)))
                result_data.write(chunk)

        print('INPUT :{}\nOUTPUT:{}'.format(large_data, result_data.getvalue()))   

if __name__ == '__main__':
    split(chunk_size=30, split_size=60)

Output:

read(bytearray(b'Lorem ipsum\x01dolor sit\x01sadipsci'))
    write(30) bytearray(b'Lorem ipsum\x01dolor sit\x01sadipsci')
read(bytearray(b'ng elitr, sed\x01labore et\x01dolore'))
    split_on_last_\x01(30)  bytearray(b'ng elitr, sed\x01labore et\x01dolore')
    offset=6 write(24)      bytearray(b'ng elitr, sed\x01labore et\x01')
read(bytearray(b'dolores et ea rebum.\x01magna ali'))
    write(30) bytearray(b'dolores et ea rebum.\x01magna ali')
read(bytearray(b'quyam erat,\x01'))
    split_on_last_\x01(12)  bytearray(b'quyam erat,\x01')
    offset=0 write(12)      bytearray(b'quyam erat,\x01')
read(bytearray(b''))
INPUT :b'Lorem ipsum\x01dolor sit\x01sadipscing elitr, sed\x01labore et\x01dolores et ea rebum.\x01magna aliquyam erat,\x01'
OUTPUT:b'Lorem ipsum\x01dolor sit\x01sadipscing elitr, sed\x01labore et\x01dolores et ea rebum.\x01magna aliquyam erat,\x01'

Tested with Python: 3.4.2

stovfl
  • 14,998
  • 7
  • 24
  • 51