Questions tagged [warc]

Use this tag for questions related to Web ARChive files.

WARC (Web ARChive) is an extension of the ARC file format

The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format

WIKI: https://en.wikipedia.org/wiki/Web_ARChive

Python: https://warc.readthedocs.io/en/latest/

58 questions

votes

3 answers

How I can parse a WARC file?

I download the ClueWeb09_English_Sample.warc file from this page then I write the data of the warc file on a text file by using the given code of the following web page. I want to parse the text file to achieve to the content of the pages in the…

java warc

asked Nov 26 '14 at 15:24

user3487667

votes

2 answers

open warc file with python

I'm trying to open a warc file with python using the toolbox from the following link: http://warc.readthedocs.org/en/latest/ When opening the file with: import warc f = warc.open("00.warc.gz") Everything is fine and the f object…

python-2.7 warc

asked Sep 11 '14 at 10:16

user3383348

votes

1 answer

how to read .webarchive file in android

I have a requirement like this. I want to read .webarchive File. I have one file with .webarchive extension and i have put that file in asset folder. I want to read that file on android webview. Is it possible? I googled and found some useful link.…

android git webview webarchive warc

asked Nov 28 '13 at 06:15

Tombeau

votes

0 answers

How can I convert a WARC file to a single page HTML file?

Is there a way to convert a WARC file to a single page HTML file similar to the end result of what monolith or SingleFile produce?

warc

asked Jan 18 '21 at 04:39

Nathan

7,627
11
46
80

votes

1 answer

Downloading a webpage and associated resources to a WARC in python

I'm interested in downloading for later analysis a bunch of webpages. There are two things that I would like to do: Download the page and associated resources (images, multiple pages associated with an article, etc) to a WARC file. change all…

python html scrape warc

asked Dec 17 '16 at 03:37

Andrew Spott

3,457
8
33
59

votes

0 answers

'Search for pattern exhausted' happens when processing WARC file in python3

I'm trying to fetch some plain text from a WARC dataset (yahoo!webscope L2), and keep meeting ValueError: Search for pattern exhausted when using load() function in python3 module warcat. Have tried some random WARC example files and everything…

python python-3.x warc

asked Feb 23 '16 at 14:31

Nriuam

votes

0 answers

Search a word in all Common Crawl WARC files

I want to search a word (for example a company name) in all the WARC files(nearly 36K warc files) from common crawl and get all the urls having that company name in its HTML source content. And I want to keep those WARC files in S3 itself. Just I…

amazon-s3 solr common-crawl warc large-data

asked Jun 23 '15 at 11:45

Vanaja Jayaraman

votes

1 answer

how to write a streaming mapreduce job for warc files in python

I am trying to write a mapreduce job for warc files using WARC library of python. Following code is working for me but i need this code for hadoop mapreduce jobs. import warc f = warc.open("test.warc.gz") for record in f: print…

python hadoop mapreduce hadoop-streaming warc

asked Jan 23 '14 at 06:53

zahid adeel

votes

2 answers

wget --warc-file --recursive, prevent writing individual files

I run wget to create a warc archive as follows: $ wget --warc-file=/tmp/epfl --recursive --level=1 http://www.epfl.ch/ $ l -h /tmp/epfl.warc.gz -rw-r--r-- 1 david wheel 657K Sep 2 15:18 /tmp/epfl.warc.gz $ find…

wget warc

asked Sep 02 '16 at 13:21

David Portabella

12,390
27
101
182

votes

1 answer

Streaming Pattern Matching using Regex

I'd want to parse a large text file formatted in Warc version 0.9. A sample of such text is here. If you take a look at it, you'll find the whole document consists of a list of following entries. [Warc Headers] [HTTP Headers] [HTML Content] I…

java regex warc

asked Jan 14 '16 at 16:34

frogatto

28,539
11
83
129

votes

1 answer

How can one extract every payload from warc.wet.gz?

I have been trying to extract the text data from Common Crawl's wet files. I am currently using warc parser by Internet Archieve https://github.com/internetarchive/warc import warc w = warc.open(fileName) for record in w: text =…

python common-crawl warc

asked Jan 05 '16 at 13:17

lorenzofeliz

votes

1 answer

Extracting headers from WARC.gz file

I have been searching through the site a lot, but could not really find what I need. I have web.warc.gz file with data in it and I need to extract WARC headers. I have installed Tomcat and Wayback (1.6) trying to derive that with ./warc-header…

python war warc

asked Feb 21 '14 at 00:30

spashuev

votes

1 answer

Common Crawl Request returns 403 WARC

I am trying to crawl some WARC files from the common crawls archives, but I do not seem to get successful requests through to the server. A minimal python example below is provided below to replicate the error. I tried adding the UserAgent in the…

python request common-crawl warc

asked Apr 30 '22 at 15:58

presa

votes

1 answer

Which block represents a WARC-Block-Digest?

At Line 09 below there is this line: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ Line 01: WARC/1.0 Line 02: WARC-Type: request Line 03: WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/ Line 04: Content-Type:…

common-crawl warc heritrix

asked Aug 13 '21 at 08:08

user16656944

votes

1 answer

Splitting a WARC file into chunks based on the header: WARC/1.0 Python

I'm new to programming and am trying to process a WARC file by splitting it into chunks and then storing each chunk in a dictionary. Each chunk should start with the WARC/1.0 header and is separated by 3 empty lines. I also would like to remove the…

python html dictionary file-processing warc

asked Oct 06 '20 at 05:49

Tylie

2 3 4 Next