How can one extract every payload from warc.wet.gz?

Question

I have been trying to extract the text data from Common Crawl's wet files. I am currently using warc parser by Internet Archieve https://github.com/internetarchive/warc

import warc
w = warc.open(fileName)
for record in w:
  text = record.payload.read()

But this method gives less than half data that is there in payload. Is there any other better method which can give all the data that is there in each of the payload in a file.

Suppose there are in total 100K records, this method will only give us about 45K records — lorenzofeliz, May 09 '16 at 06:14
http://stackoverflow.com/questions/36173786/python-cannot-read-warc-gz-file-completely Check this out. I'm guessing you are facing the same problem as mentioned here too - https://github.com/internetarchive/warc/issues/21 — Derek Chia, May 09 '16 at 16:24
Nope, I never got around it. Instead I cleaned the raw file reading line by line. — lorenzofeliz, Aug 15 '17 at 17:27

score 0 · Answer 1 · answered Feb 17 '22 at 09:02

The warc library has a bug with its gzip handling which causes warc fail to read the entire WET file. To overcome the bug, you should use Python's gzip library to decompress the file stream on the fly as below:

import gzip
import warc
gzip_fobj = gzip.open(wet_file, "r")
warc_fobj = warc.WARCFile(fileobj=gzip_fobj, compress=False)

How can one extract every payload from warc.wet.gz?

1 Answers1