CommonCrawl: How to find a specific web page?

Question

I am using CommonCrawl to restore pages I should have achieved but have not.

In my understanding, the Common Crawl Index offers access to all URLs stored by Common Crawl. Thus, it should give me an answer if the URL is achieved.

A simple script downloads all indices from the available crawls:

./cdx-index-client.py -p 4 -c CC-MAIN-2016-18 *.thesun.co.uk --fl url -d CC-MAIN-2016-18
./cdx-index-client.py -p 4 -c CC-MAIN-2016-07 *.thesun.co.uk --fl url -d CC-MAIN-2016-07
... and so on

Afterwards I have 112mb of data and simply grep:

grep "50569" * -r
grep "Locals-tell-of-terror-shock" * -r

The pages are not there. Am I missing something? The page were published in 2006 and removed in June 2016. So I assume that CommonCrawl should have achieved them?

Update: Thanks to Sebastian, two links are left... Two URLs are:

They even proposed a "URL Search Tool" which answers with a 502 - Bad Gateway...

Also tried without success: http://index.commoncrawl.org/CC-MAIN-2016-07-index?url=http://www.thesun.co.uk/sol/homepage/news/50569/Locals-tell-of-terror-shock.html&matchType=exact — Maximilian Böhm, Aug 10 '16 at 10:34
Looks like these two news articles are not in the Common Crawl archives. — Sebastian Nagel, Aug 10 '16 at 11:17
The URL seems to have changed. At least, this citation points to another source: https://afspot.net/forum/topic/256740-man-shot-in-terror-raid/. — Sebastian Nagel, Aug 10 '16 at 11:23
And this URL is available via Internet Archive's Wayback Machine: http://web.archive.org/web/20060619142942/http://www.thesun.co.uk/article/0,,2-2006250464,00.html — Sebastian Nagel, Aug 10 '16 at 11:29
So, my approach was the right way to go? Thanks for your hint to the other forum, I did not have the idea that the URL in 2006 could have been different.. — Maximilian Böhm, Aug 10 '16 at 13:22
Yes, since there is no fulltext index: there is no way other than checking for the full URL or a prefix via index.commoncrawl.org or download the index files and do a grep for parts of the URL. Of course, if the real URL is not known a URL index is not really sufficient. But a search over the WARC files would mean a lot more effort. — Sebastian Nagel, Aug 10 '16 at 15:51
If you answer the question instead of comment, I could accept it as answer! Thank you (Oder auch: Danke :) ) — Maximilian Böhm, Aug 10 '16 at 18:56

score 4 · Accepted Answer · answered Aug 20 '19 at 10:53

4

You can use AWS Athena to query Common crawl index like SQL to find the URL and then use the offset, length and filename to read the content in your code. See details here - http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/

answered Aug 20 '19 at 10:53

Vikash Rathee

1,776
2
25
43

score 2 · Answer 2 · answered May 02 '18 at 07:38

2

The latest version of search on CC index provides the ability to search and get results of all the urls from particular tld. In your case, you can use http://index.commoncrawl.org and then select index of your choice. Search for http://www.thesun.co.uk/*. Hope you get all the urls from tld and then you can filter the urls of your choice from json response.

answered May 02 '18 at 07:38

hitesh chavhan

126
5

and tld stands for Top-Level Domain, e.g. com in example.com i TLD – dzieciou Mar 05 '21 at 13:15

score 0 · Answer 3 · answered May 28 '19 at 01:44

AFAIK pages are crawled once and only once, so the pages you're looking for could be in any of the archives.

I wrote a small software that can be used to search all archives at once (here's also a demonstration showing how to do this). So in your case I searched all archives (2008 to 2019) and typed your URLs on the common crawl editor, and found these results for your first URL (couldn't find the second so I guess is not in the database?):

                           FileName                              Offset    Length  
 ------------------------------------------------------------- ---------- -------- 
  parse-output/segment/1346876860877/1346943319237_751.arc.gz    7374762    12162  
  crawl-002/2009/11/21/8/1258808591287_8.arc.gz                 87621562    20028  
  crawl-002/2010/01/07/5/1262876334932_5.arc.gz                 80863242    20075

Not sure why there're three results. I guess they do re-scan some URLs.

Of if you open any of these URLs on the application I linked you should be able to see the pages in a browser (this is a custom scheme that that includes the filename, offset and length in order to load HTML from the common crawl database):

crawl://page.common/parse-output/segment/1346876860877/1346943319237_751.arc.gz?o=7374762&l=12162&u=http%3A%2F%2Fwww.thesun.co.uk%2Fsol%2Fhomepage%2Fnews%2F50569%2FLocals-tell-of-terror-shock.html
crawl://page.common/crawl-002/2009/11/21/8/1258808591287_8.arc.gz?o=87621562&l=20028&u=http%3A%2F%2Fwww.thesun.co.uk%2Fsol%2Fhomepage%2Fnews%2F50569%2FLocals-tell-of-terror-shock.html
crawl://page.common/crawl-002/2010/01/07/5/1262876334932_5.arc.gz?o=80863242&l=20075&u=http%3A%2F%2Fwww.thesun.co.uk%2Fsol%2Fhomepage%2Fnews%2F50569%2FLocals-tell-of-terror-shock.html

CommonCrawl: How to find a specific web page?

3 Answers3