Access a common crawl AWS public dataset

Question

I need to browse and download a subset of common crawl's public data set. This page mentions where the data is hosted.

How can I browse and possibly download the common crawl data hosted at s3://aws-publicdatasets/common-crawl/crawl-002/ ?

score 15 · Accepted Answer · edited Dec 28 '18 at 08:39

Just as an update, downloading the Common Crawl corpus has always been free, and you can use HTTP instead of S3. S3 allows you to use anonymous credentials to get access to the data.

If you want to download via HTTP, get one of the file locations, such as:

common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10-180-212-248.ec2.internal.warc.gz

and then add https://commoncrawl.s3.amazonaws.com/ to it, resulting in the link:

https://commoncrawl.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10-180-212-248.ec2.internal.warc.gz

To get a listing of all such files, refer to warc.paths.gz (or the equivalent for WET or WAT files) on the more recent crawls, or list the files using anonymous credentials using s3cmd or a similar tool.

This link will work and allow you to download the data without going through S3.

score 4 · Answer 2 · answered Jun 16 '15 at 09:59

General data access to Common Crawl crawls is discussed at: http://blog.commoncrawl.org/2015/05/april-2015-crawl-archive-available/

What I would consider a useful way to go about getting some trial data, is by using the new index over the archive: http://index.commoncrawl.org/CC-MAIN-2015-18

If you query for example for "www.cwi.nl", you find JSON structures about the segments that contain files from that domain.

{
 "urlkey": "nl,cwi)/", "timestamp": "20150505031358", 
 "status": "200", "url": "http://www.cwi.nl/", 
 "filename": "common-crawl/crawl-data/CC-MAIN-2015-18/segments/1430455222810.45/warc/CC-MAIN-20150501044022-00044-ip-10-235-10-82.ec2.internal.warc.gz", 
 "length": "5881", "mime": "text/html", "offset": "364108412", 
 "digest": "DLQQ4NMJMRRZFGXSXGSFPRO3YJBKVHN5"
}

Prefix the s3 info to it, and you can download the datafile that you can use as sample data: https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-18/segments/1430455222810.45/warc/CC-MAIN-20150501044022-00044-ip-10-235-10-82.ec2.internal.warc.gz

Have fun!

when clicking on the link you posted I get `This XML file does not appear to have any style information associated with it. The document tree is shown below` is that expected? — ℕʘʘḆḽḘ, May 07 '18 at 19:33

score 0 · Answer 3 · edited Nov 03 '14 at 13:54

0

To access the Common Crawl data, you need to run a map-reduce job against it, and, since the corpus resides on S3, you can do so by running a Hadoop cluster using Amazon’s EC2 service. This involves setting up a custom hadoop jar that utilizes our custom InputFormat class to pull data from the individual ARC files in our S3 bucket.

Source: http://commoncrawl.org/the-data/

Getting started: http://commoncrawl.org/the-data/get-started/

edited Nov 03 '14 at 13:54

Taryn

242,637
56
362
405

answered May 20 '13 at 15:33

David Levesque

22,181
8
67
82

1

That's one way to access the data -- but not the only way. – Greg Lindahl Oct 12 '18 at 20:26

score 0 · Answer 4 · answered Oct 12 '18 at 20:25

0

The other answers have some great informational urls, but for accessing the actual data, if you only want small parts of it, this client code is pretty good for looking at the index and downloading content:

https://github.com/cocrawler/cdx_toolkit

answered Oct 12 '18 at 20:25

Greg Lindahl

477
3
13

Access a common crawl AWS public dataset

4 Answers4