Common crawl example having doubts

Asked Nov 18 '16 at 08:58

Active Nov 18 '16 at 08:58

Viewed 104 times

I am trying to run a common crawl example and extracting URL and emails from the Warc file. I have just one doubt. Whether the email I have extracted belongs to the URL or some other website, this is a confusing part.
Kindly, help me. How can I resolve this confusion?
What I have done is this: Using the common crawl example of WordCount, I have set a it to extract url and then email. After extraction it will store it in a file.

That's it a simple logic for extraction. But I would like to know how can I believe that the URL found and the email found are corresponding to each other?

asked Nov 18 '16 at 08:58

Jaffer Wilson

7,029
10
62
139

Are you trying to collect emails to spam website owners? – Greg Lindahl Nov 20 '16 at 00:42
@GregLindahl No not at all. This I am doing just to know how I can extract complex data from a warc file. Because I want to make an analyzer for my business. There are various parameters that I will extract if this goes successful. So for no just I was willing to have my this small query resolved. – Jaffer Wilson Nov 21 '16 at 07:12

Common crawl example having doubts

0 Answers0