0

i have 30 gzip files needed to be de-serialized .i used following code as de-serializing code :

def deserialize(f):
    retval = {}
    while True:
        content = f.read(struct.calcsize('L'))
        if not content: break
        k_len = struct.unpack('L', content)[0]
        k_bstr = f.read(k_len)
        k = k_bstr.decode('utf-8')
        v_len = struct.unpack('L', f.read(struct.calcsize('L')))[0]
        v_bytes = os.io.BytesIO(f.read(v_len))
        v = numpy.load(v_bytes, allow_pickle=True)
        retval[k] = v.item()
    return retval


for i in range(0,26):

    with gzip.open('Files/company'+str(i)+'.zip','rb') as f:
        curdic1 = deserialize(f)
    n = 0
    for key in curdic1:
        n = n + 1
        company = curdic1[key]
        if (n % 10000 == 1):
            print(i, key)

but when it gives me following exception during deserializing:

k_bstr = f.read(k_len) File "/usr/lib/python3.5/gzip.py", line 274, in read return self._buffer.read(size) MemoryError

in addition, each file's size is less than 4 mb!. so what is the problem with this code?

Edited: sample file]

Edited this is serialize method if can help to clarify ...:

def serialize(f, content):
    for k,v in content.items():
        # write length of key, followed by key as string
        k_bstr = k.encode('utf-8')
        f.write(struct.pack('L', len(k_bstr)))
        f.write(k_bstr)
        # write length of value, followed by value in numpy.save format
        memfile = io.BytesIO()
        numpy.save(memfile, v)
        f.write(struct.pack('L', memfile.tell()))
        f.write(memfile.getvalue())
MSepehr
  • 890
  • 2
  • 13
  • 36
  • Probably unrelated to your problem, but the filenames suggest `zip` compression (requiring the `zipfile` module rather than `gzip`). – Seb Dec 20 '19 at 17:02
  • @Seb no its not the problem – MSepehr Dec 20 '19 at 17:05
  • Put in an `assert` sanity check after reading `k_len`. Maybe the input file is corrupt. – martineau Dec 20 '19 at 18:13
  • @martineau can u explain more? – MSepehr Dec 20 '19 at 18:49
  • `assert klen <= reasonable_upper_limit` – martineau Dec 20 '19 at 18:52
  • @martineau k_len is "3472330498737438728" – MSepehr Dec 20 '19 at 18:57
  • @martineau the problem is i dont know how can i read this file line by line and decode with utf-8 . one line can not conevrt to utf-8 – MSepehr Dec 20 '19 at 18:58
  • I have no idea what's in (or at least supposed to be in) the gzip file, so it's impossible to help. If you add a description of that to your question as well as a link to a small test file, then that maybe someone can help you further. – martineau Dec 20 '19 at 19:28
  • @martineau ok i will attach a file – MSepehr Dec 20 '19 at 19:31
  • I need a description of what's in the file. Is it just a gzipped text file? If not, please [edit] your question and add that information (not just the file). – martineau Dec 20 '19 at 19:34
  • @martineau its a dictionary structure – MSepehr Dec 20 '19 at 19:45
  • You can't store Python dictionaries directly in a file, so there must be more to it. Is it a gzipped `pickle` of a dictionary or something else. – martineau Dec 20 '19 at 19:49
  • @martineau i am not sure its a dictionary ! – MSepehr Dec 21 '19 at 00:27
  • You seem to have some notion of what's in the file. How is it being created? Knowing what you're dealing with is crucial especially if you're having problems—like right now. – martineau Dec 21 '19 at 00:30
  • @martineau i guss what is it but its not the problem the problem is in this line:k_bstr = f.read(k_len). it seems k_bstr can not save a string with k_len=3472330498737438728! i dont know how should i handle that – MSepehr Dec 21 '19 at 00:36
  • @martineau i add serialize method to clarify – MSepehr Dec 21 '19 at 00:39
  • Instead of `'utf-8'`, try using `'latin1'` when serializing and de-serializing. This is the proper way to handle binary data. See [Serializing binary data in Python](https://stackoverflow.com/questions/22621143/serializing-binary-data-in-python/22621777#22621777) – martineau Dec 21 '19 at 00:46
  • @martineau i dont have access to data previous the serialized. if i had this i never used deserialization .and my data is in persian language i cant serialize the data with latin1 – MSepehr Dec 21 '19 at 00:53
  • The `struct` module is generally used for read and writing binary data, so using it to read text encoded in utf-8 is weird. – martineau Dec 21 '19 at 00:58
  • I think the problem is your `deserialize()` function isn't reading what the `serialize()` function produces properly. My advice is work on standalone code that does nothing but round-trip (write and then read back) the data properly. – martineau Dec 21 '19 at 01:14
  • @martineau you mean this data did not serialized with this serialize code? – MSepehr Dec 21 '19 at 01:16
  • 1
    No, I meant the `deserialize()` function isn't properly doing what it is supposed to do (independent from the fact that the data is getting gzipped). So work on just getting that part working in isolation. Specifically I don't think the `serialize()` function is writing what `deserialize()` reads in as `content` — in other words what being written doesn't match what is being read which could cause the problem you're seeing. – martineau Dec 21 '19 at 01:58
  • 1
    The `serialize` function does produce output that can be read with this `deserialize` function. The problem is that the gzip files were produced on a 32-bit machine with a different size of `L`, so the data structure doesn't match up. Explicitly specifying ` – Seb Dec 21 '19 at 14:54

1 Answers1

1

I inspected your sample file and found that the length fields were not encoded as L but as <L. My guess is that they were serialised on a 32-bit platform were the native length of L is equal to the standard value of 4 bytes, whereas you are running the deserialising function on a 64-bit platform where the native length of L is 8 bytes. So the function should be:

import struct, io
import numpy as np

def deserialize(f):
    retval = {}
    while True:
        content = f.read(struct.calcsize('<L'))
        if not content: break
        k_len = struct.unpack('<L', content)[0]
        k_bstr = f.read(k_len)
        k = k_bstr.decode('utf-8')
        v_len = struct.unpack('<L', f.read(struct.calcsize('<L')))[0]
        v_bytes = io.BytesIO(f.read(v_len))
        v = np.load(v_bytes, allow_pickle=True)
        retval[k] = v.item()
    return retval

Part of the deserialised output of your sample file:

{'12000001': {'NID': '',
  'companyid': '12000001',
  'newspaperdate': '۱۳۸۵/۶/۲۰',
  'indikatornumber': '۱۸۹۶۲',
  'newsdate': None,
  'newstitle': 'آگهی تاسیس شرکت فنی مهندسی آریا\u200cپژوه گرمسار (سهامی خاص)',
  'persons': [],
  'subjects': ['انجام',
   'کلیه',
   'خدمات',
   'ترویج',
   'آموزش',
   [...]
Seb
  • 4,422
  • 14
  • 23