1

I'm working on a project where I identify all the unique occurrences of fixed size blocks within a binary file and save then save the result to a binary file (it needs to work across multiple languages).

My approach is the following: I read each block of the file, hash, and store the unique hashes and binary code to a dictionary. Each time the program sees a repeated hash, it appends the position for later reconstruction. An examples of the resulting dictionary is represented below:

dict = {'d59fce39b5d8d4b278acbf2f5be0353c': [b'\xc5\xd7\x14\x84', 0, 1, 4],
        'bf937a85a0f950f431a4c9c1aeca8686': [b'\x08\xe7\x07\x8f', 2, 3, 5]}

Then, I'm using with open('out.data, 'wb') as f: to do save the file to disk (f.write(dict)), but I get the following error:

TypeError: a bytes-like object is required, not 'dict'

Other solutions I found here didn't help me. I tried passing the dictionary to a JSON object, as suggested here, but got:

new_dict = json.dumps(dict)

TypeError: Object of type 'bytes' is not JSON serializable

I'm working with arbitrary bytes, thus, encoding does not seem like a solution for this issue.

  • Those `bytes` strings like `b'\xc5\xd7\x14\x84` can easily be converted to hex, just like the hash values have been. But I'm curious, what's the connection between those bytes strings and the fixed size blocks in your binary file? There may be a more efficient way to store this data. How many repeated positions do you typically find for each block? – PM 2Ring Aug 04 '18 at 19:49
  • 3
    BTW, it's not a good idea to use `dict` as a variable name because that shadows the built-in `dict` type. – PM 2Ring Aug 04 '18 at 19:51
  • FWIW: "byte arrays" in JSON are often Base64-encoded. This is extra encoding is valid, if not always [sufficiently] efficient. – user2864740 Aug 04 '18 at 19:53

1 Answers1

4

Have you tried

import pickle

with open('out.pickle', 'wb') as f:
    pickle.dump(dict, f, protocol=pickle.HIGHEST_PROTOCOL)

with open('out.pickle', 'rb') as f:
    b_dict = pickle.load(f)

# This is to check that you saved the same dict in memory
print dict == b_dict
Kenan
  • 13,156
  • 8
  • 43
  • 50