Extracting headers from WARC.gz file

Question

I have been searching through the site a lot, but could not really find what I need. I have web.warc.gz file with data in it and I need to extract WARC headers. I have installed Tomcat and Wayback (1.6) trying to derive that with ./warc-header script, which is provided by Wayback, but I keep getting an error message for the format I am using:

Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz \r\n\ 
~/Desktop/output.csv type \r\n
      USAGE: tgtWarc fieldsSrc id
        tgtWarc is the path to the target WARC.gz
          fieldsSrc is the path to the text of the record
    make sure each line is terminated by \r\n
    and that the file ends with a blank, \r\n terminiated line
id is the XXX in:
    Content-Description: Made from XXX by org.archive.wayback.util.WARCHeader
    of the header record... header...

Or another type of error:

   Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz 
    ~/Desktop/output.csv Content-Type
    java.io.IOException: End-Of-Stream before \r\n\r\n End-Of-ANVLRecord:

at org.archive.util.anvl.ANVLRecord.load(ANVLRecord.java:163)
at org.archive.wayback.util.WARCHeader.writeHeaderRecord(WARCHeader.java:43)
at org.archive.wayback.util.WARCHeader.main(WARCHeader.java:75)

I am quite sure it is a format I am writing in a command line, but I still can't get it right. Please help?

score 1 · Answer 1 · answered Apr 02 '15 at 11:23

1

You can get it using the below github project code:

https://github.com/Smerity/cc-warc-examples/blob/master/src/org/commoncrawl/examples/S3ReaderTest.java

answered Apr 02 '15 at 11:23

Vanaja Jayaraman

753
3
18

While this link may answer the question, it might be a good idea to put the relevant code (and an explanation) rather than relying on a link that could rot. – TheTechRobo the Nerd Apr 01 '22 at 23:00

Extracting headers from WARC.gz file

1 Answers1