Read warc file with python

Question

I want to read a warc file and I wrote the follwoing code based on this page but nothing was printted!!

>>import warc
>>f = warc.open("01.warc.gz")
>>for record in f:
    print record['WARC-Target-URI'], record['Content-Length']

However, when I wrote the following command I got result

>>print f
<warc.warc.WARCFile instance at 0x0000000002C7DE88>

Note that my warc file is one of the file from Clueweb09 dataset. I mentioned it because of this page.

It looks like the accepted answer to the question you linked to has a solution. Did you try that? — cco, Oct 18 '16 at 05:30

score 2 · Answer 1 · edited Mar 16 '17 at 18:28

2

I had the same problem as you.

After some research on the module, I found a solution.

Try to use record.payload.read(), here is full example:

import warc
f = warc.open("01.warc.gz")
for record in f:
  print record.payload.read()

Also, I can say that you can not only read warc files, but wet too. Small cheat is to renaming it to name, that contains .warc

Kind regards

edited Mar 16 '17 at 18:28

AP.

8,082
2
24
33

answered Mar 16 '17 at 16:14

Oleg Mykolaichenko

627
6
13

score 1 · Answer 2 · answered Jan 21 '18 at 13:25

First of all, WARC, or Web ARChive, is an archival format for web pages. Reading a warc file is a bit tricky because it contains some special header. Assuming your warc file is of this format.

You can use the following code to load, parse and return a dictionary for every record containing the metadata and the content.

def read_header(file_handler):
    header = {}
    line = next(file_handler)
    while line != '\n':
        key, value = line.split(': ', 1)
        header[key] = value.rstrip()
        line = next(file_handler)
    return header


def warc_records(path):
    with open(path) as fh:
        while True:
            line = next(fh)
            if line == 'WARC/1.0\n':
                output = read_header(fh)
                if 'WARC-Refers-To' not in output:
                    continue
                output["Content"] = next(fh)
                yield output

You can access the dictionary as follow:

records = warc_records("<some path>')
>>> next_record = next(records)
>>> sorted(next_record.keys())
['Content', 'Content-Length', 'Content-Type', 'WARC-Block-Digest', 'WARC-Date', 'WARC-Record-ID', 'WARC-Refers-To', 'WARC-Target-URI', 'WARC-Type', 'WARC-Warcinfo-ID']
>>> next_record['WARC-Date']
'2013-06-20T00:32:15Z'
>>> next_record['WARC-Target-URI']
'http://09231204.tumblr.com/post/44534196170/high-res-new-photos-of-the-cast-of-neilhimself'
>>> next_record['Content'][:30]
'Side Effects high res. New pho'

Read warc file with python

2 Answers2