Questions tagged [common-crawl]

Open crawl of the web that can be accessed and analyzed by everyone.

Common Crawl builds and maintains an open crawl of the web that can be accessed and analyzed by everyone. It is a non-profit organization that crawls and archives the web with the intent of providing access to everyone. The organization claims to respect nofollow and robot.txt policies.

Common Crawl makes available a 100 TB web archive of web page data from 2008 to 2012 of about 6 billion webpages. Web crawl data is kept in the Amazon public datasets S3 bucket and is freely downloadable. Common Crawl publishes an Open Source library for processing their data using Hadoop as well as their crawler.

Web site: http://commoncrawl.org/

71 questions

votes

3 answers

CommonCrawl: How to find a specific web page?

I am using CommonCrawl to restore pages I should have achieved but have not. In my understanding, the Common Crawl Index offers access to all URLs stored by Common Crawl. Thus, it should give me an answer if the URL is achieved. A simple script…

search-engine common-crawl

asked Aug 10 '16 at 09:43

Maximilian Böhm

votes

2 answers

Unzipping a gz file in c# : System.IO.InvalidDataException: 'The archive entry was compressed using an unsupported compression method.'

I have followed Microsoft's recommended way to unzip a .gz file : https://learn.microsoft.com/en-us/dotnet/api/system.io.compression.gzipstream?view=netcore-3.1 I am trying to download and parse files from the CommonCrawl. I can successfully…

c# gzip common-crawl

asked Apr 26 '20 at 19:26

Burf2000

5,001
14
58
117

votes

1 answer

Common crawl - getting WARC file

I would like to retrieve a web page using common crawl but am getting lost. I would like to get the warc file for www.example.com. I see that this link…

common-crawl

asked Sep 19 '17 at 18:41

MAB

votes

4 answers

Access a common crawl AWS public dataset

I need to browse and download a subset of common crawl's public data set. This page mentions where the data is hosted. How can I browse and possibly download the common crawl data hosted at s3://aws-publicdatasets/common-crawl/crawl-002/ ?

amazon-web-services amazon-s3 amazon-ec2 common-crawl

asked May 20 '13 at 12:27

gibraltar

1,678
4
20
33

votes

1 answer

Download small sample of AWS Common Crawl to local machine via http

I'm interested in downloading raw text of a tiny subset, 10's of megs tops, of the AWS Common Crawl, as a corpus for information retrieval tests. The Common Crawl pages suggest I need an S3 account and/or Java program to access it, and then I'm…

dataset information-retrieval corpus common-crawl

asked Apr 19 '19 at 13:02

Russ

votes

0 answers

Search a word in all Common Crawl WARC files

I want to search a word (for example a company name) in all the WARC files(nearly 36K warc files) from common crawl and get all the urls having that company name in its HTML source content. And I want to keep those WARC files in S3 itself. Just I…

amazon-s3 solr common-crawl warc large-data

asked Jun 23 '15 at 11:45

Vanaja Jayaraman

votes

1 answer

Common Crawl data search all pages by keyword

I am wondering if it is possible to lookup a key word using the common crawl api in python and retrieve pages that contain the key word. For example, if I lookup "stack overflow" it will find the pages which the key word - "stack overflow" is in the…

python api web-crawler keyword-search common-crawl

asked Mar 26 '21 at 04:26

Python 123

votes

0 answers

cld2 causing invalid utf-8 character in python

I have wriiten a small script in python 2.7. I have also installed cld2 module, used to find language type in given string. I have run it on 1 file of common crawl data, some thing it gaves following exception Traceback (most recent call last): …

python amazon-web-services utf-8 common-crawl

asked Jan 17 '17 at 13:06

Hafiz Muhammad Shafiq

8,168
12
63
121

votes

0 answers

Get an WARC achive file with all files from a given domain, using from commoncrawl.org

Commoncrawl datasets are splitted by segments. How to extract a subset of the common-crawl data-set? I need a WARC archive file (or several archive files) with all the files from a given domain, such as example.com? Note: common_crawl_index allows…

common-crawl

asked Sep 06 '16 at 09:45

David Portabella

12,390
27
101
182

votes

1 answer

How can one extract every payload from warc.wet.gz?

I have been trying to extract the text data from Common Crawl's wet files. I am currently using warc parser by Internet Archieve https://github.com/internetarchive/warc import warc w = warc.open(fileName) for record in w: text =…

python common-crawl warc

asked Jan 05 '16 at 13:17

lorenzofeliz

votes

1 answer

How to open Commoncrawl.org WARC.GZ S3 Data in Spark

I want to access a commoncrawl file from the Amazon public dataset repository from the spark shell. The files are in WARC.GZ format. val filenameList =…

amazon-ec2 amazon-s3 apache-spark common-crawl

asked Nov 16 '14 at 14:10

Philipp

votes

1 answer

Common Crawl requirement to power a decent search engine

Common Crawl releases massive dataloads every month, sizing nearly hundreds of terabytes. This has been going on for last 8-9 years. Are these snapshots independent (probably not)? Or do we have to combine all of them to be able to power a decent…

web-crawler common-crawl

asked May 23 '23 at 12:27

SexyBeast

7,913
28
108
196

votes

1 answer

Common Crawl Request returns 403 WARC

I am trying to crawl some WARC files from the common crawls archives, but I do not seem to get successful requests through to the server. A minimal python example below is provided below to replicate the error. I tried adding the UserAgent in the…

python request common-crawl warc

asked Apr 30 '22 at 15:58

presa

votes

1 answer

Which block represents a WARC-Block-Digest?

At Line 09 below there is this line: WARC-Block-Digest: sha1:CLODKYDXCHPVOJMJWHJVT3EJJDKI2RTQ Line 01: WARC/1.0 Line 02: WARC-Type: request Line 03: WARC-Target-URI: https://climate.nasa.gov/vital-signs/carbon-dioxide/ Line 04: Content-Type:…

common-crawl warc heritrix

asked Aug 13 '21 at 08:08

user16656944

votes

0 answers

Access Denied for accessing amazon s3 - common data crawl

I am trying out a sample common data crawl example based on https://engineeringblog.yelp.com/2015/03/analyzing-the-web-for-the-price-of-a-sandwich.html I am running this below command in my local windows PC based on the instructions. python…

python amazon-s3 boto3 common-crawl

asked Nov 27 '17 at 17:28

Shamnad P S

1,095
2
15
43

2 3 4 5 Next