4

I want to search a word (for example a company name) in all the WARC files(nearly 36K warc files) from common crawl and get all the urls having that company name in its HTML source content.

And I want to keep those WARC files in S3 itself. Just I need the urls from those WARC files as result.

Is there any modules or pre-built packages available for this?

May I use Solr indexing? (but it may need more memory)

Thanks in Advance.

Community
  • 1
  • 1
Vanaja Jayaraman
  • 753
  • 3
  • 18

0 Answers0