0

I have a file that uses \x01 as line terminator. That is line terminator is NOT newline but the bytevalue of 001. Here is the ascii representation for it which ^A.

I want to split file to size of 10 MB each. Here is what I came up with

size=10000 #10 MB
i=0
with open("in-file", "rb") as ifile:
    ofile = open("output0.txt","wb")
    data = ifile.read(size)
        while data:
            ofile.write(data)
            ofile.close()
            data = ifile.read(size)
            i+=1 
            ofile = open("output%d.txt"%(i),"wb")


    ofile.close()

However, this would result in files that are broken at arbitrary places. I want the files to be terminated only at the byte value of 001 and next read resumes from the next byte.

brain storm
  • 30,124
  • 69
  • 225
  • 393

1 Answers1

1

if its just one byte terminal you can do something like

def read_line(f_object,terminal_byte): # its one line you could just as easily do this inline
    return "".join(iter(lambda:f_object.read(1),terminal_byte))

then make a helper function that will read all the lines in a file

def read_lines(f_object,terminal_byte):
    tmp = read_line(f_object,terminal_byte)
    while tmp:
        yield tmp
        tmp = read_line(f_object,terminal_byte)

then make a function that will chunk it up

def make_chunks(f_object,terminal_byte,max_size):
    current_chunk = []
    current_chunk_size = 0
    for line in read_lines(f_object,terminal_byte):
        current_chunk.append(line)
        current_chunk_size += len(line)
        if current_chunk_size > max_size:
            yield "".join(current_chunk)
            current_chunk = []
            current_chunk_size = 0
    if current_chunk:
        yield "".join(current_chunk)

then just do something like

with open("my_binary.dat","rb") as f_in:
    for i,chunk in enumerate(make_chunks(f_in,"\x01",1024*1000*10)):
        with open("out%d.dat"%i,"wb") as f_out:
            f_out.write(chunk)

there might be some way to do this with libraries (or even an awesome builtin way) but im not aware of any offhand

Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • It doesnt seem to split on the terminal_byte. The terminal byte I used is `bytes(chr(1))` – brain storm Aug 25 '17 at 20:37
  • I just noticed that terminal byte is not written in output file. I want to join on "\x01" – brain storm Aug 25 '17 at 20:48
  • I modified `"".join(iter(lambda:f_object.read(1),terminal_byte))` to `"\x01".join(iter(lambda:f_object.read(1),terminal_byte))` and `yield "\x01".join(current_chunk)`. but that is not working – brain storm Aug 25 '17 at 21:02
  • what does "that is not working" mean? you probably just want `"\x01".join(current_chunk)+"\x01"` – Joran Beasley Aug 25 '17 at 21:11
  • that did not work. however, this worked. `def read_line(f_object,terminal_byte): return ''.join(iter(lambda:f_object.read(1),terminal_byte)) + "\x01"` – brain storm Aug 25 '17 at 21:32
  • I posted another question which is offshoot of this. kindly take a look. https://stackoverflow.com/questions/45890338/memory-error-when-splitting-big-file-into-smaller-files-in-python – brain storm Aug 25 '17 at 22:55