4

I'm interested in downloading raw text of a tiny subset, 10's of megs tops, of the AWS Common Crawl, as a corpus for information retrieval tests.

The Common Crawl pages suggest I need an S3 account and/or Java program to access it, and then I'm looking at sifting through 100's Gb's of data when all I need is a few dozen megs.

There's some code here, but it requires an S3 account and access (although I do like Python).

Is there a way I can form an http(s) URL that will let me get a tiny cross-section of a crawl for my purposes? I believe I looked at a page that suggested a way to structure the directory with day, hour, minute, but I cannot seem to find that page again.

Thanks!

Russ
  • 156
  • 12

1 Answers1

10

It's quite easy: just choose randomly a single WARC (WAT or WET) file from any monthly crawl. The crawls are announced here: https://commoncrawl.org/connect/blog/

  1. take the latest crawl (eg. April 2019)
  2. navigate to the WARC file list and download it (same for WAT or WET)
  3. unzip the file and randomly select one line (file path)
  4. prefix the path with https://commoncrawl.s3.amazonaws.com/ (or since spring 2022: https://data.commoncrawl.org/ - there is a description in the blog post) and download it

You're down because every WARC/WAT/WET file is a random sample by its own. Need more data: just pick more files at random.

Sebastian Nagel
  • 2,049
  • 10
  • 10
  • But when I follow your method and I make the link on the navigator I got AccessDenied – ben othman zied Jan 08 '23 at 23:16
  • And using python, it doesn't recognize the gzip format of warc.gz ! could you give please a cleare python code to download a simple html page from commonCrawl, Please – ben othman zied Jan 08 '23 at 23:17
  • To pick a single WARC record (archived capture of a HTML page) and parse it, see here: https://stackoverflow.com/questions/74648756/extracting-the-payload-of-a-single-common-crawl-warc – Sebastian Nagel Jan 09 '23 at 08:03