I am wondering if it is possible to lookup a key word using the common crawl api in python and retrieve pages that contain the key word. For example, if I lookup "stack overflow" it will find the pages which the key word - "stack overflow" is in the HTML file. I have looked at the apis but I can only do URL lookup - not key word. Thank you for any responses in advance!
Asked
Active
Viewed 1,047 times
1 Answers
1
I, if I were you, would not use CommonCrawl for this. To use CommonCrawl, you would have to iterate over the entire CommonCrawl-Dataset. That's 2.8 billion webpages!
My suggested alternative would be to use Microsoft's Bing WebSearch-API. You get an easy to use API with 1000 free uses per month.
Searching through this API would yield webpages containing the queried keyword. From there, you could download the html-source of the webpage and iterate through it again within python to find all uses of your keyword.

NameKhan72
- 717
- 4
- 11
-
I was going to try to avoid using existing search engines for this project but that is a start. – Python 123 Apr 01 '21 at 05:44
-
I can see where you're coming from... Since CommonCrawl unfortunately does not have this function, you pretty much only have one more option. You could download the whole dataset (200TB) and create an index of every word, that's used inside. Sure, you may need 200TB and a couple of weeks of compute time, but if you have these ressources, go for it! – NameKhan72 Apr 01 '21 at 12:34