0

I'm new to Regex and currently writing a Scrapy crawler to collect e-mail addresses.

I want to be able to select different formats of e-mails when I crawl. Right now I just find anything with an @ sign - but want to be a little bit smarter.

How do I select e-mails with the following formats?

  • info@example.com
  • info [at] example [dot] com
  • info at example.com info
  • info at example dot com

Here is what I currently have:

item['mail'] = hxs.select('//body//text()').re(r'[\w.-]+@[\w.-]+')
wint_3r
  • 17
  • 1
  • 7
  • It's hard to write a correct regex for email addresses. See [Using a regular expression to validate an email address](http://stackoverflow.com/q/201323/1281433). Getting even more formats, like you're asking for will be even harder. Since people usually try formats like your last three to **avoid** scrapers (though it's not particularly effective), you may meet some resistance in this question. – Joshua Taylor Feb 23 '15 at 22:10
  • I know, that's why I need help with it - very new to this field and my research and attempts haven worked so far. That's why I need someone who is experienced at this. – wint_3r Feb 23 '15 at 22:12
  • Regarding the link above - I'm not trying to validate the e-mail. I never said that, I'm just trying to find a pattern that matches those above on the page and collect them. – wint_3r Feb 23 '15 at 22:14
  • Yes, but my point was that it's a very hard task to write a regex that matches all emails, and you're asking for something even more powerful than that. – Joshua Taylor Feb 23 '15 at 22:20
  • Makes sense, read more deeply into it - and it does seem like a complex issue. – wint_3r Feb 23 '15 at 22:30
  • It seems unethical to help you scrape email addresses from people who are trying to keep them private. – Jeremy Stein Feb 23 '15 at 22:52
  • That really depends on how you use it. I'm gathering it for my own purposes in a strategic manner. The reason they do that is so people don't spam them - not for people not to contact them at all. – wint_3r Feb 24 '15 at 00:49

1 Answers1

0

This is the best I can come up with, but I really don't know if it is going to work for you unless you provide more examples.

With the current examples in your question, it works. If you don't care about email addresses that are more complex than that, then this should be fine for you.

[\w.-]+ ?(?:@|\[?at]?) ?[\w.-]+(?: ?\[?dot]? ?[\w.-]+)?

So what did I do here?
I put in an alternation at the @ symbol so that it can either accept [at] or at by using optional spaces and brackets with the lazy quantifier: ?

 ?(?:@|\[?at]?) ?
^              ^
   lazy spaces

I did similar towards the end of the expression, but I made the entire non-capturing group optional, since it would mess with the first couple lines in your example emails otherwise.

https://regex101.com/r/aC4kW3/1

Vasili Syrakis
  • 9,321
  • 1
  • 39
  • 56
  • Thanks so much! I was close, I didn't use the ? in the places where you have them. Lesson learned. – wint_3r Feb 24 '15 at 00:50
  • It seems to work in the program you linked but not in practice. It was simply selecting words with at or those with spaces. – wint_3r Feb 24 '15 at 03:04
  • It all depends on your data, your options, what language you are using, etc... I'd need more details – Vasili Syrakis Feb 24 '15 at 05:26
  • Well they are basic pages with contact information in them. So for example. http://decorchick.com/advertise/ or http://lifeinsketch.com/advertise/ – wint_3r Feb 24 '15 at 13:19