I am trying to run a common crawl example and extracting URL and emails from the Warc file. I have just one doubt. Whether the email I have extracted belongs to the URL or some other website, this is a confusing part.
Kindly, help me. How can I resolve this confusion?
What I have done is this:
Using the common crawl example of WordCount, I have set a it to extract url and then email. After extraction it will store it in a file.
That's it a simple logic for extraction. But I would like to know how can I believe that the URL found and the email found are corresponding to each other?