I need to browse and download a subset of common crawl's public data set. This page mentions where the data is hosted.
How can I browse and possibly download the common crawl data hosted at s3://aws-publicdatasets/common-crawl/crawl-002/
?
I need to browse and download a subset of common crawl's public data set. This page mentions where the data is hosted.
How can I browse and possibly download the common crawl data hosted at s3://aws-publicdatasets/common-crawl/crawl-002/
?
Just as an update, downloading the Common Crawl corpus has always been free, and you can use HTTP instead of S3. S3 allows you to use anonymous credentials to get access to the data.
If you want to download via HTTP, get one of the file locations, such as:
common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10-180-212-248.ec2.internal.warc.gz
and then add https://commoncrawl.s3.amazonaws.com/ to it, resulting in the link:
To get a listing of all such files, refer to warc.paths.gz (or the equivalent for WET or WAT files) on the more recent crawls, or list the files using anonymous credentials using s3cmd or a similar tool.
This link will work and allow you to download the data without going through S3.
General data access to Common Crawl crawls is discussed at: http://blog.commoncrawl.org/2015/05/april-2015-crawl-archive-available/
What I would consider a useful way to go about getting some trial data, is by using the new index over the archive: http://index.commoncrawl.org/CC-MAIN-2015-18
If you query for example for "www.cwi.nl", you find JSON structures about the segments that contain files from that domain.
{
"urlkey": "nl,cwi)/", "timestamp": "20150505031358",
"status": "200", "url": "http://www.cwi.nl/",
"filename": "common-crawl/crawl-data/CC-MAIN-2015-18/segments/1430455222810.45/warc/CC-MAIN-20150501044022-00044-ip-10-235-10-82.ec2.internal.warc.gz",
"length": "5881", "mime": "text/html", "offset": "364108412",
"digest": "DLQQ4NMJMRRZFGXSXGSFPRO3YJBKVHN5"
}
Prefix the s3 info to it, and you can download the datafile that you can use as sample data: https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-18/segments/1430455222810.45/warc/CC-MAIN-20150501044022-00044-ip-10-235-10-82.ec2.internal.warc.gz
Have fun!
To access the Common Crawl data, you need to run a map-reduce job against it, and, since the corpus resides on S3, you can do so by running a Hadoop cluster using Amazon’s EC2 service. This involves setting up a custom hadoop jar that utilizes our custom InputFormat class to pull data from the individual ARC files in our S3 bucket.
Source: http://commoncrawl.org/the-data/
Getting started: http://commoncrawl.org/the-data/get-started/
The other answers have some great informational urls, but for accessing the actual data, if you only want small parts of it, this client code is pretty good for looking at the index and downloading content: