Search a word in all Common Crawl WARC files

Asked Jun 23 '15 at 11:45

Active Sep 21 '17 at 22:50

Viewed 1,123 times

I want to search a word (for example a company name) in all the WARC files(nearly 36K warc files) from common crawl and get all the urls having that company name in its HTML source content.

And I want to keep those WARC files in S3 itself. Just I need the urls from those WARC files as result.

Is there any modules or pre-built packages available for this?

May I use Solr indexing? (but it may need more memory)

Thanks in Advance.

edited Sep 22 '17 at 17:44

Community

asked Jun 23 '15 at 11:45

Vanaja Jayaraman

2

If you just search the web for warc and Solr, you get at least one answer (e.g. [webarchive-discovery](https://github.com/ukwa/webarchive-discovery)). Have you tried that first? – Alexandre Rafalovitch Jun 25 '15 at 13:45
I will give a try.. Thank you – Vanaja Jayaraman Jun 26 '15 at 04:32
Using https://github.com/ukwa/webarchive-discovery we can index the WARC files which are stored in our local system but not in S3. Right? – Vanaja Jayaraman Jul 07 '15 at 05:15

Search a word in all Common Crawl WARC files

0 Answers0