0

I have an incredibly large dict of the shape:

{'rounding': (4, [(900, 1), (4406, 0), (5772, 1), (6210, 1)]), 'thee': (5, [(901, 1), (3452, 1), (3803, 1), (4178, 1), (5793, 1)]), 'hotdog': (13, [(902, 2), (902, 2), (996, 1), (1765, 1), (2602, 1), (3824, 1), (4701, 1), (4924, 1), (5544, 1), (5741, 1), (5984, 1), (6972, 1), (7236, 2), (7236, 2), (7469, 1)]), 'hotdogs': (9, [(902, 1), (1765, 2), (1765, 2), (4924, 0), (5110, 1), (5228, 1), (6883, 1), (7034, 1), (7236, 1), (8638, 1)]),} 

It continues on and on for about ~450k terms or so. It is a very large dict. I need to write this object to a binary file. I am following these resources:

The code that is resulting in an error is (the whole program is a few thousand lines):

inverted_index = {word:(document_frequency[word], d[word]) for word in d}
json_dict = json.dumps(inverted_index)
struct.pack(json_dict)

Which yields the error

  File "take_7.py", line 179, in <module>
    main()
  File "take_7.py", line 176, in main
    driver(sys.argv[1])
  File "take_7.py", line 164, in driver
    struct.pack('i', json_dict)
struct.error: bad char in struct format

I tried looking up struct documentation and then tried:

inverted_index = {word:(document_frequency[word], d[word]) for word in d}
json_dict = json.dumps(inverted_index)
binary_file = struct.pack('s', bytes(json_dict, 'utf-8'))

Which compiled. However:

print(binary_file) yields b'{'

And

print(struct.unpack('s', binary_file)) yields (b'{',)

How can I convert my dict (as described above) to a binary file, so that I can save it to disk, and later read it back from disk to be used?

artemis
  • 6,857
  • 11
  • 46
  • 99
  • `dumps` produces a large string (unicode in Py3). `struct.pack` with 's' is for **1** byte character. Why do you need a binary file? – hpaulj Sep 17 '19 at 19:40
  • I also tried using `p` but that still didn't work. – artemis Sep 17 '19 at 19:41
  • You are entirely misunderstanding what `struct` does - it only handles data of a fixed structure (which you must completely describe via the format string), and produces output of a fixed size. The JSON string is *exactly* what you want to write to a file, and easily read back in - "binary" offers you absolutely no benefits here. – jasonharper Sep 17 '19 at 19:44
  • The requirement is somewhat vague -- just that this `inverted_index` dict structure must be written `as a binary file` to disk (and `subsequently read back in as a binary file`). Saving as a `.json` does not meet that requirement. Looking through https://docs.python.org/3/library/struct.html I did not see any options for dict, so I tried following the examples others had posted. – artemis Sep 17 '19 at 19:46
  • Well, you could use `pickle` instead of `json`, and get a file that technically fulfills this arbitrary requirement. The disadvantage is that you can no longer simply *look at the file contents* to see what's in it, `pickle` is not human-readable at all. – jasonharper Sep 17 '19 at 19:49
  • Cannot use `pickle`, and it needs to be read back in. I have also examined `bytes` and `to_bytes` I think? But could not figure out how to implement. – artemis Sep 17 '19 at 19:51
  • I give up, then, I could suggest other reasonable solutions, but your actual requirements seem to be summed up as "no reasonable solutions allowed". – jasonharper Sep 17 '19 at 19:54
  • Do you consider "write a dict to binary without using pickle that can be read back from disk" as unreasonable? – artemis Sep 17 '19 at 19:55
  • You could pack your data, e.g. `packed = zlib.compress(json.dumps(inverted_index).encode('utf-8'))`, and write to a file in binary mode. – mportes Sep 17 '19 at 21:40
  • I've never heard of `zlib` - can you post example? – artemis Sep 17 '19 at 21:40
  • I can't, as the question has been marked as duplicate. [zlib](https://docs.python.org/3.7/library/zlib.html) is part of the standard library. It allows to compress bytes objects, and is not related to the core of your question (i.e. serialization of data structures). – mportes Sep 17 '19 at 22:13
  • Another problem is that `json.dumps` converts tuples to lists, so after subsequent `json.loads`, you will have a dictionary that is not equal to the original one. – mportes Sep 17 '19 at 22:17
  • I have another similar question, if you'd like to post there? https://stackoverflow.com/questions/57981513/how-to-convert-unicode-to-4-bit-binary-representation/57982155?noredirect=1#comment102373517_57982155 – artemis Sep 17 '19 at 22:18

0 Answers0