2

I'm trying to use regex in scrapy to find all email addresses on a page.

I'm using this code:

    item["email"] = re.findall('[\w\.-]+@[\w\.-]+', response.body)

Which works almost perfectly: it grabs all the emails and gives them to me. However what I want is this: that it doesn't give me a repeat before it actually parses, even if there are more than one of the same email address.

I'm getting responses like this (which is correct):

{'email': ['billy666@stanford.edu',
           'cantorfamilies@stanford.edu',
           'cantorfamilies@stanford.edu',
           'cantorfamilies@stanford.edu',
           'footer-stanford-logo@2x.png']}

However I want to only show the unique addresses which would be

{'email': ['billy666@stanford.edu',
           'cantorfamilies@stanford.edu',
           'footer-stanford-logo@2x.png']}

If you want to throw in how to only collect the email and not that

'footer-stanford-logo@2x.png'

that is helpful also.

Thanks everyone!

Peter David Carter
  • 2,548
  • 8
  • 25
  • 44
Max Uland
  • 87
  • 10
  • Why are you using a regex to parse the response? Seems like it might be better suited to an xpath or css selector. Parsing HTML with a regex is usually a bad idea – Padraic Cunningham Apr 16 '16 at 01:01
  • Because this is being used In a broad crawler where the data would be stored in different places. So no an xpath wouldnt work – Max Uland Apr 17 '16 at 04:36

3 Answers3

2
item["email"] = set(re.findall('[\w\.-]+@[\w\.-]+', response.body))
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • extra brownie points to ignore that `'footer-stanford-logo@2x.png'`. :) +1 though – idjaw Apr 15 '16 at 23:39
  • 1
    No need to escape a `.` inside a character class. And it really does not help ingore those PNGs. If this one or Thomas' is accepted, the question would be a dupe of [Returning unique matches using regex in python](http://stackoverflow.com/questions/32083145/returning-unique-matches-using-regex-in-python). @idjaw: check my answer where I suggest a way to ignore PNGs. – Wiktor Stribiżew Apr 15 '16 at 23:45
  • Thanks Wiktor and if it is a Dupe I'm very sorry i dont fully understand regex so if it was answered I apologize I must not have understood – Max Uland Apr 15 '16 at 23:56
  • Also not exactly sure And for the . part I had got this "section" of the code from someone on SO so if its incorrect then thanks for letting me know! – Max Uland Apr 16 '16 at 00:02
2

Here is how you can get rid of the dupes and 'footer-stanford-logo@2x.png'-like thingies in your output:

import re
p = re.compile(r'[\w.-]+@(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b')
test_str = "{'email': ['billy666@stanford.edu',\n           'cantorfamilies@stanford.edu',\n           'cantorfamilies@stanford.edu',\n           'cantorfamilies@stanford.edu',\n           'footer-stanford-logo@2x.png']}"
print(set(p.findall(test_str)))

See the Python demo

The regex will look like

[\w.-]+@(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^       ^^

See demo

The negative lookahead (?![\w.-]*\.(?:png|jpe?g|gif)\b) will disallow all matches with png, jpg, etc. extensions at the end of the word (\b is a word boundary, and in this case, it is a trailing word boundary).

Dupes can easily be removed with a set - it is the least troublesome part here.

FINAL SOLUTION:

item["email"] = set(re.findall(r'[\w.-]+@(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b', response.body))
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Nice touch with the `(?:png|jpe?g|gif)` – idjaw Apr 15 '16 at 23:48
  • Not sure why but when I use this code it doesnt give any emails but it works with just item["email"] = set(re.findall('[\w\.-]+@[\w\.-]+', response.body)) which deletes duplicates. Although i'm very interested to know why it doesnt show in my results. Since I followed that demo page (AWESOME BTW) and it worked as expected :/ – Max Uland Apr 15 '16 at 23:59
  • Sorry, I have added `r` prefix to mark the string a raw string literal. Now, `\b` is treated as a word boundary, not as a backspace character. Use `item["email"] = set(re.findall(r'[\w.-]+@(?![\w.-]*\.(?:png|jpe?g|gif)\b)[\w.-]+\b', response.body))` – Wiktor Stribiżew Apr 16 '16 at 00:04
  • Got it! Thanks man!!!!! Good to know how that r affected it. Thank you for also explaining it. – Max Uland Apr 16 '16 at 00:07
1

Can't you just use a set instead of a list?

item["email"] = set(re.findall('[\w\.-]+@[\w\.-]+', response.body))

And if you really want a list then:

item["email"] = list(set(re.findall('[\w\.-]+@[\w\.-]+', response.body)))
Thomas Reynaud
  • 966
  • 3
  • 8
  • 19