Erase duplicate emails

Question

I'm trying to use regex in scrapy to find all email addresses on a page.

I'm using this code:

    item["email"] = re.findall('[\w\.-]+@[\w\.-]+', response.body)

Which works almost perfectly: it grabs all the emails and gives them to me. However what I want is this: that it doesn't give me a repeat before it actually parses, even if there are more than one of the same email address.

I'm getting responses like this (which is correct):

{'email': ['billy666@stanford.edu',
           'cantorfamilies@stanford.edu',
           'cantorfamilies@stanford.edu',
           'cantorfamilies@stanford.edu',
           'footer-stanford-logo@2x.png']}

However I want to only show the unique addresses which would be

{'email': ['billy666@stanford.edu',
           'cantorfamilies@stanford.edu',
           'footer-stanford-logo@2x.png']}

If you want to throw in how to only collect the email and not that

'footer-stanford-logo@2x.png'

that is helpful also.

Thanks everyone!

Why are you using a regex to parse the response? Seems like it might be better suited to an xpath or css selector. Parsing HTML with a regex is usually a bad idea — Padraic Cunningham, Apr 16 '16 at 01:01
Because this is being used In a broad crawler where the data would be stored in different places. So no an xpath wouldnt work — Max Uland, Apr 17 '16 at 04:36

score 2 · Answer 1 · answered Apr 15 '16 at 23:38

2

item["email"] = set(re.findall('[\w\.-]+@[\w\.-]+', response.body))

answered Apr 15 '16 at 23:38

Joran Beasley

110,522
12
160
179

extra brownie points to ignore that `'footer-stanford-logo@2x.png'`. :) +1 though – idjaw Apr 15 '16 at 23:39
1

No need to escape a `.` inside a character class. And it really does not help ingore those PNGs. If this one or Thomas' is accepted, the question would be a dupe of [Returning unique matches using regex in python](http://stackoverflow.com/questions/32083145/returning-unique-matches-using-regex-in-python). @idjaw: check my answer where I suggest a way to ignore PNGs. – Wiktor Stribiżew Apr 15 '16 at 23:45
Thanks Wiktor and if it is a Dupe I'm very sorry i dont fully understand regex so if it was answered I apologize I must not have understood – Max Uland Apr 15 '16 at 23:56
Also not exactly sure And for the . part I had got this "section" of the code from someone on SO so if its incorrect then thanks for letting me know! – Max Uland Apr 16 '16 at 00:02

Wiktor Stribiżew · Accepted Answer · 2016-04-16T00:03:58.233

2

Here is how you can get rid of the dupes and 'footer-stanford-logo@2x.png'-like thingies in your output:

import re
p = re.compile(r'[\w.-]+@(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b')
test_str = "{'email': ['billy666@stanford.edu',\n           'cantorfamilies@stanford.edu',\n           'cantorfamilies@stanford.edu',\n           'cantorfamilies@stanford.edu',\n           'footer-stanford-logo@2x.png']}"
print(set(p.findall(test_str)))

See the Python demo

The regex will look like

[\w.-]+@(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^       ^^

See demo

The negative lookahead (?![\w.-]*\.(?:png|jpe?g|gif)\b) will disallow all matches with png, jpg, etc. extensions at the end of the word (\b is a word boundary, and in this case, it is a trailing word boundary).

Dupes can easily be removed with a set - it is the least troublesome part here.

FINAL SOLUTION:

item["email"] = set(re.findall(r'[\w.-]+@(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b', response.body))

edited Apr 16 '16 at 00:03

answered Apr 15 '16 at 23:40

Wiktor Stribiżew

607,720
39
448
563

Nice touch with the `(?:png|jpe?g|gif)` – idjaw Apr 15 '16 at 23:48
Not sure why but when I use this code it doesnt give any emails but it works with just item["email"] = set(re.findall('[\w\.-]+@[\w\.-]+', response.body)) which deletes duplicates. Although i'm very interested to know why it doesnt show in my results. Since I followed that demo page (AWESOME BTW) and it worked as expected :/ – Max Uland Apr 15 '16 at 23:59
Sorry, I have added `r` prefix to mark the string a raw string literal. Now, `\b` is treated as a word boundary, not as a backspace character. Use `item["email"] = set(re.findall(r'[\w.-]+@(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b', response.body))` – Wiktor Stribiżew Apr 16 '16 at 00:04
Got it! Thanks man!!!!! Good to know how that r affected it. Thank you for also explaining it. – Max Uland Apr 16 '16 at 00:07

score 1 · Answer 3 · answered Apr 15 '16 at 23:38

1

Can't you just use a set instead of a list?

item["email"] = set(re.findall('[\w\.-]+@[\w\.-]+', response.body))

And if you really want a list then:

item["email"] = list(set(re.findall('[\w\.-]+@[\w\.-]+', response.body)))

answered Apr 15 '16 at 23:38

Thomas Reynaud

966
3
8
19

Erase duplicate emails

3 Answers3