I'm trying to use regex in scrapy to find all email addresses on a page.
I'm using this code:
item["email"] = re.findall('[\w\.-]+@[\w\.-]+', response.body)
Which works almost perfectly: it grabs all the emails and gives them to me. However what I want is this: that it doesn't give me a repeat before it actually parses, even if there are more than one of the same email address.
I'm getting responses like this (which is correct):
{'email': ['billy666@stanford.edu',
'cantorfamilies@stanford.edu',
'cantorfamilies@stanford.edu',
'cantorfamilies@stanford.edu',
'footer-stanford-logo@2x.png']}
However I want to only show the unique addresses which would be
{'email': ['billy666@stanford.edu',
'cantorfamilies@stanford.edu',
'footer-stanford-logo@2x.png']}
If you want to throw in how to only collect the email and not that
'footer-stanford-logo@2x.png'
that is helpful also.
Thanks everyone!