Questions tagged [common-crawl]

Open crawl of the web that can be accessed and analyzed by everyone.

Common Crawl builds and maintains an open crawl of the web that can be accessed and analyzed by everyone. It is a non-profit organization that crawls and archives the web with the intent of providing access to everyone. The organization claims to respect nofollow and robot.txt policies.

Common Crawl makes available a 100 TB web archive of web page data from 2008 to 2012 of about 6 billion webpages. Web crawl data is kept in the Amazon public datasets S3 bucket and is freely downloadable. Common Crawl publishes an Open Source library for processing their data using Hadoop as well as their crawler.

Web site: http://commoncrawl.org/

71 questions
7
votes
3 answers

CommonCrawl: How to find a specific web page?

I am using CommonCrawl to restore pages I should have achieved but have not. In my understanding, the Common Crawl Index offers access to all URLs stored by Common Crawl. Thus, it should give me an answer if the URL is achieved. A simple script…
6
votes
2 answers

Unzipping a gz file in c# : System.IO.InvalidDataException: 'The archive entry was compressed using an unsupported compression method.'

I have followed Microsoft's recommended way to unzip a .gz file : https://learn.microsoft.com/en-us/dotnet/api/system.io.compression.gzipstream?view=netcore-3.1 I am trying to download and parse files from the CommonCrawl. I can successfully…
Burf2000
  • 5,001
  • 14
  • 58
  • 117
6
votes
1 answer

Common crawl - getting WARC file

I would like to retrieve a web page using common crawl but am getting lost. I would like to get the warc file for www.example.com. I see that this link…
MAB
  • 61
  • 7
6
votes
4 answers

Access a common crawl AWS public dataset

I need to browse and download a subset of common crawl's public data set. This page mentions where the data is hosted. How can I browse and possibly download the common crawl data hosted at s3://aws-publicdatasets/common-crawl/crawl-002/ ?
gibraltar
  • 1,678
  • 4
  • 20
  • 33
4
votes
1 answer

Download small sample of AWS Common Crawl to local machine via http

I'm interested in downloading raw text of a tiny subset, 10's of megs tops, of the AWS Common Crawl, as a corpus for information retrieval tests. The Common Crawl pages suggest I need an S3 account and/or Java program to access it, and then I'm…
Russ
  • 156
  • 12
4
votes
0 answers

Search a word in all Common Crawl WARC files

I want to search a word (for example a company name) in all the WARC files(nearly 36K warc files) from common crawl and get all the urls having that company name in its HTML source content. And I want to keep those WARC files in S3 itself. Just I…
Vanaja Jayaraman
  • 753
  • 3
  • 18
3
votes
1 answer

Common Crawl data search all pages by keyword

I am wondering if it is possible to lookup a key word using the common crawl api in python and retrieve pages that contain the key word. For example, if I lookup "stack overflow" it will find the pages which the key word - "stack overflow" is in the…
Python 123
  • 59
  • 1
  • 13
3
votes
0 answers

cld2 causing invalid utf-8 character in python

I have wriiten a small script in python 2.7. I have also installed cld2 module, used to find language type in given string. I have run it on 1 file of common crawl data, some thing it gaves following exception Traceback (most recent call last): …
Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
3
votes
0 answers

Get an WARC achive file with all files from a given domain, using from commoncrawl.org

Commoncrawl datasets are splitted by segments. How to extract a subset of the common-crawl data-set? I need a WARC archive file (or several archive files) with all the files from a given domain, such as example.com? Note: common_crawl_index allows…
David Portabella
  • 12,390
  • 27
  • 101
  • 182
3
votes
1 answer

How can one extract every payload from warc.wet.gz?

I have been trying to extract the text data from Common Crawl's wet files. I am currently using warc parser by Internet Archieve https://github.com/internetarchive/warc import warc w = warc.open(fileName) for record in w: text =…
lorenzofeliz
  • 597
  • 6
  • 11
3
votes
1 answer

How to open Commoncrawl.org WARC.GZ S3 Data in Spark

I want to access a commoncrawl file from the Amazon public dataset repository from the spark shell. The files are in WARC.GZ format. val filenameList =…
Philipp
  • 535
  • 1
  • 6
  • 16
2
votes
1 answer

Common Crawl requirement to power a decent search engine

Common Crawl releases massive dataloads every month, sizing nearly hundreds of terabytes. This has been going on for last 8-9 years. Are these snapshots independent (probably not)? Or do we have to combine all of them to be able to power a decent…
SexyBeast
  • 7,913
  • 28
  • 108
  • 196
2
votes
1 answer

Common Crawl Request returns 403 WARC

I am trying to crawl some WARC files from the common crawls archives, but I do not seem to get successful requests through to the server. A minimal python example below is provided below to replicate the error. I tried adding the UserAgent in the…
presa
  • 85
  • 1
  • 5
2
votes
1 answer

Which block represents a WARC-Block-Digest?

At Line 09 below there is this line: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ Line 01: WARC/1.0 Line 02: WARC-Type: request Line 03: WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/ Line 04: Content-Type:…
user16656944
2
votes
0 answers

Access Denied for accessing amazon s3 - common data crawl

I am trying out a sample common data crawl example based on https://engineeringblog.yelp.com/2015/03/analyzing-the-web-for-the-price-of-a-sandwich.html I am running this below command in my local windows PC based on the instructions. python…
Shamnad P S
  • 1,095
  • 2
  • 15
  • 43
1
2 3 4 5