I have been trying to extract the text data from Common Crawl's wet files. I am currently using warc parser by Internet Archieve https://github.com/internetarchive/warc
import warc
w = warc.open(fileName)
for record in w:
text = record.payload.read()
But this method gives less than half data that is there in payload. Is there any other better method which can give all the data that is there in each of the payload in a file.