0

I am trying to run a common crawl example and extracting URL and emails from the Warc file. I have just one doubt. Whether the email I have extracted belongs to the URL or some other website, this is a confusing part.
Kindly, help me. How can I resolve this confusion?
What I have done is this: Using the common crawl example of WordCount, I have set a it to extract url and then email. After extraction it will store it in a file.

That's it a simple logic for extraction. But I would like to know how can I believe that the URL found and the email found are corresponding to each other?

Jaffer Wilson
  • 7,029
  • 10
  • 62
  • 139
  • Are you trying to collect emails to spam website owners? – Greg Lindahl Nov 20 '16 at 00:42
  • @GregLindahl No not at all. This I am doing just to know how I can extract complex data from a warc file. Because I want to make an analyzer for my business. There are various parameters that I will extract if this goes successful. So for no just I was willing to have my this small query resolved. – Jaffer Wilson Nov 21 '16 at 07:12

0 Answers0