Getting rid of error generating documents from a corpus

Question

I have a set of 1000 documents - encoded and compressed - in an lsm-db stored on my computer. When I try to decompress and decode, I get an error that says "Incorrect Header Check".

This is what I'm doing:

for key in my_lsm_db.keys():
    print key, zlib.decompress(my_lsm_db[key], zlib.MAX_WBITS|32).decode('utf-8')

After processing a few keys, the code throws an error. The error that I'm receiving is: error: Error -3 while decompressing data: incorrect header check

I want to remove all such error generating documents from the corpus. How can I identify the documents that generate the error, so I could remove them?

def remove_docs(my_lsm_db):
    for key in my_lsm_db.keys():
        ## write code that identifies an error when generated
        if <code that identifies document generating error>:
            del my_lsm_db[key]

Here's some information on Zlib and MAX_WBITS part of the code: Zlib Compression, Stack Overflow Answer for Zlib Automatic Header Detection

score 0 · Accepted Answer · answered Apr 28 '17 at 15:39

0

I tried using a try/except block around my code to overcome such error generating documents. It works for not just the above code, but other stuff also.

try:
    <code to execute>
except (<list of errors>) as e:
    print e

answered Apr 28 '17 at 15:39

Minu

450
1
7
21

Getting rid of error generating documents from a corpus

1 Answers1