I have a set of 1000 documents - encoded and compressed - in an lsm-db stored on my computer. When I try to decompress and decode, I get an error that says "Incorrect Header Check".
This is what I'm doing:
for key in my_lsm_db.keys():
print key, zlib.decompress(my_lsm_db[key], zlib.MAX_WBITS|32).decode('utf-8')
After processing a few keys, the code throws an error. The error that I'm receiving is: error: Error -3 while decompressing data: incorrect header check
I want to remove all such error generating documents from the corpus. How can I identify the documents that generate the error, so I could remove them?
def remove_docs(my_lsm_db):
for key in my_lsm_db.keys():
## write code that identifies an error when generated
if <code that identifies document generating error>:
del my_lsm_db[key]
Here's some information on Zlib and MAX_WBITS
part of the code: Zlib Compression, Stack Overflow Answer for Zlib Automatic Header Detection