Questions tagged [warc]

Use this tag for questions related to Web ARChive files.

WARC (Web ARChive) is an extension of the ARC file format

The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC File Format

WIKI: https://en.wikipedia.org/wiki/Web_ARChive

Python: https://warc.readthedocs.io/en/latest/

58 questions
8
votes
3 answers

How I can parse a WARC file?

I download the ClueWeb09_English_Sample.warc file from this page then I write the data of the warc file on a text file by using the given code of the following web page. I want to parse the text file to achieve to the content of the pages in the…
user3487667
  • 519
  • 5
  • 22
6
votes
2 answers

open warc file with python

I'm trying to open a warc file with python using the toolbox from the following link: http://warc.readthedocs.org/en/latest/ When opening the file with: import warc f = warc.open("00.warc.gz") Everything is fine and the f object…
user3383348
  • 87
  • 1
  • 7
5
votes
1 answer

how to read .webarchive file in android

I have a requirement like this. I want to read .webarchive File. I have one file with .webarchive extension and i have put that file in asset folder. I want to read that file on android webview. Is it possible? I googled and found some useful link.…
Tombeau
  • 353
  • 4
  • 14
4
votes
0 answers

How can I convert a WARC file to a single page HTML file?

Is there a way to convert a WARC file to a single page HTML file similar to the end result of what monolith or SingleFile produce?
Nathan
  • 7,627
  • 11
  • 46
  • 80
4
votes
1 answer

Downloading a webpage and associated resources to a WARC in python

I'm interested in downloading for later analysis a bunch of webpages. There are two things that I would like to do: Download the page and associated resources (images, multiple pages associated with an article, etc) to a WARC file. change all…
Andrew Spott
  • 3,457
  • 8
  • 33
  • 59
4
votes
0 answers

'Search for pattern exhausted' happens when processing WARC file in python3

I'm trying to fetch some plain text from a WARC dataset (yahoo!webscope L2), and keep meeting ValueError: Search for pattern exhausted when using load() function in python3 module warcat. Have tried some random WARC example files and everything…
Nriuam
  • 101
  • 9
4
votes
0 answers

Search a word in all Common Crawl WARC files

I want to search a word (for example a company name) in all the WARC files(nearly 36K warc files) from common crawl and get all the urls having that company name in its HTML source content. And I want to keep those WARC files in S3 itself. Just I…
Vanaja Jayaraman
  • 753
  • 3
  • 18
4
votes
1 answer

how to write a streaming mapreduce job for warc files in python

I am trying to write a mapreduce job for warc files using WARC library of python. Following code is working for me but i need this code for hadoop mapreduce jobs. import warc f = warc.open("test.warc.gz") for record in f: print…
zahid adeel
  • 123
  • 4
3
votes
2 answers

wget --warc-file --recursive, prevent writing individual files

I run wget to create a warc archive as follows: $ wget --warc-file=/tmp/epfl --recursive --level=1 http://www.epfl.ch/ $ l -h /tmp/epfl.warc.gz -rw-r--r-- 1 david wheel 657K Sep 2 15:18 /tmp/epfl.warc.gz $ find…
David Portabella
  • 12,390
  • 27
  • 101
  • 182
3
votes
1 answer

Streaming Pattern Matching using Regex

I'd want to parse a large text file formatted in Warc version 0.9. A sample of such text is here. If you take a look at it, you'll find the whole document consists of a list of following entries. [Warc Headers] [HTTP Headers] [HTML Content] I…
frogatto
  • 28,539
  • 11
  • 83
  • 129
3
votes
1 answer

How can one extract every payload from warc.wet.gz?

I have been trying to extract the text data from Common Crawl's wet files. I am currently using warc parser by Internet Archieve https://github.com/internetarchive/warc import warc w = warc.open(fileName) for record in w: text =…
lorenzofeliz
  • 597
  • 6
  • 11
3
votes
1 answer

Extracting headers from WARC.gz file

I have been searching through the site a lot, but could not really find what I need. I have web.warc.gz file with data in it and I need to extract WARC headers. I have installed Tomcat and Wayback (1.6) trying to derive that with ./warc-header…
spashuev
  • 29
  • 4
2
votes
1 answer

Common Crawl Request returns 403 WARC

I am trying to crawl some WARC files from the common crawls archives, but I do not seem to get successful requests through to the server. A minimal python example below is provided below to replicate the error. I tried adding the UserAgent in the…
presa
  • 85
  • 1
  • 5
2
votes
1 answer

Which block represents a WARC-Block-Digest?

At Line 09 below there is this line: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ Line 01: WARC/1.0 Line 02: WARC-Type: request Line 03: WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/ Line 04: Content-Type:…
user16656944
2
votes
1 answer

Splitting a WARC file into chunks based on the header: WARC/1.0 Python

I'm new to programming and am trying to process a WARC file by splitting it into chunks and then storing each chunk in a dictionary. Each chunk should start with the WARC/1.0 header and is separated by 3 empty lines. I also would like to remove the…
Tylie
  • 21
  • 1
1
2 3 4