1

I read a large file in the code below which has a special structure - among others two blocks that need be processed at the same time. Instead of seeking back and forth in the file I load these two blocks wrapped in memoryview calls

with open(abs_path, 'rb') as bsa_file:
    # ...
    # load the file record block to parse later
    file_records_block = memoryview(bsa_file.read(file_records_block_size))
    # load the file names block
    file_names_block = memoryview(bsa_file.read(total_file_name_length))
    # close the file
file_records_index = names_record_index = 0
for folder_record in folder_records:
    name_size = struct.unpack_from('B', file_records_block, file_records_index)[0]
    # discard null terminator below
    folder_path = struct.unpack_from('%ds' % (name_size - 1),
        file_records_block, file_records_index + 1)[0]
    file_records_index += name_size + 1
    for __ in xrange(folder_record.files_count):
        file_name_len = 0
        for b in file_names_block[names_record_index:]:
            if b != '\x00': file_name_len += 1
            else: break
        file_name = unicode(struct.unpack_from('%ds' % file_name_len,
            file_names_block,names_record_index)[0])
        names_record_index += file_name_len + 1

The file is correctly parsed, but as it's my first use of the mamoryview interface I am not sure I do it right. The file_names_block is composed as seen by null terminated c strings.

  1. Is my trick file_names_block[names_record_index:] using the memoryview magic or do I create some n^2 slices ? Would I need to use islice here ?
  2. As seen I just look for the null byte manually and then proceed to unpack_from. But I read in How to split a byte string into separate bytes in python that I can use cast() (docs ?) on the memory view - any way to use that (or another trick) to split the view in bytes ? Could I just call split('\x00') ? Would this preserve the memory efficiency ?

I would appreciate insight on the one right way to do this (in python 2).

Community
  • 1
  • 1
Mr_and_Mrs_D
  • 32,208
  • 39
  • 178
  • 361
  • I don't think memoryviews are getting you anything here; a memory view, like the `struct` module, has no specific facilities for null-terminated strings. – Martijn Pieters Oct 18 '16 at 16:04

1 Answers1

2

A memoryview is not going to give you any advantages when it comes to null-terminated strings as they have no facilities for anything but fixed-width data. You may as well use bytes.split() here instead:

file_names_block = bsa_file.read(total_file_name_length)
file_names = file_names_block.split(b'\00')

Slicing a memoryview doesn't use extra memory (other than the view parameters), but if using a cast you do produce new native objects for the parsed memory region the moment you try to access elements in the sequence.

You can still use the memoryview for the file_records_block parsing; those strings are prefixed by a length giving you the opportunity to use slicing. Just keep slicing bytes of the memory view as you process folder_path values, there's no need to keep an index:

for folder_record in folder_records:
    name_size = file_records_block[0]  # first byte is the length, indexing gives the integer
    folder_path = file_records_block[1:name_size].tobytes()
    file_records_block = file_records_block[name_size + 1:]  # skip the null

Because the memoryview was sourced from a bytes object, indexing will give you the integer value for a byte, .tobytes() on a given slice gives you a new bytes string for that section, and you can then continue to slice to leave the remainder for the next loop.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks indeed :) I guess I could use split then add `del file_names_block`. As for the one I think it should be shouldn't it be `file_records_block[name_size:]` ? I wonder how fast the memory (of the underlying buffer) is freed after this last operation but this is for the c guys :) – Mr_and_Mrs_D Oct 18 '16 at 17:17
  • 1
    @Mr_and_Mrs_D: you can use `file_records_block.release()` after the loop to free the `bytes` object. – Martijn Pieters Oct 18 '16 at 17:18
  • @Mr_and_Mrs_D: I'm not sure what you are asking about regarding `file_records_block[name_size:]` though.. – Martijn Pieters Oct 18 '16 at 17:20
  • ooooops - file_records_block[name_size+1:] sorry afternoon nap ! In all do you think my approach buys me anything ? Is it a good way of reading a structured binary file ? – Mr_and_Mrs_D Oct 18 '16 at 17:24
  • @Mr_and_Mrs_D: nope, it is `name_size` due to zero-based indexing. So when `name_size` is, say, 4, at index `0` there'll be the `\x04` byte, then indices 1, 2 and 3 are the character bytes. And at index 4 the next string starts. – Martijn Pieters Oct 18 '16 at 17:37
  • Yes in this particular file the name_size includes the null terminating byte - so in your example in position 4 there is this null byte - but case closed :) – Mr_and_Mrs_D Oct 18 '16 at 17:47
  • 1
    @Mr_and_Mrs_D: ah, misunderstood the structure there then. I thought they were Pascal strings (single-byte length prefix, then length-1 bytes for the string value). Then you do have to slice the next block at `name_size + 1`. – Martijn Pieters Oct 18 '16 at 17:53