0

I use the following code to find e-mail on the downloaded page:

page = urlfetch.Fetch(url = 'http://www.toyotabc.ru/vacancy/', deadline = 60)
if page.status_code == 200 and page.content:
    regexp = re.compile(
        r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*"
        r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"'
        r')@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$', re.IGNORECASE)
    email = regexp.findall(page.content)
    if email:
        email = email[0]
        self.response.out.write('e-mail found: %s<br>' % (email))

But it returns nothing (email is False), when email exists on the sample page given in the code. What is wrong with my code?

LA_
  • 19,823
  • 58
  • 172
  • 308
  • 1
    The expression `r'\w+@\w+\.\w{2,6}'` seems to work for me - will that work in your case? – RocketDonkey Jan 07 '13 at 07:33
  • @RocketDonkey, yes, it works. But it doesn't capture e-mails with dot - [example](http://pythonre.appspot.com/?pattern=\w%2B%40\w%2B\.\w{2%2C6}&string=test.email%40example.com&function=findall&flags=IGNORECASE). – LA_ Jan 07 '13 at 07:46
  • @LA_: It's simple to modify RocketDonkey's expression to use the right characters instead of `\w`. A lot easier than debugging that huge mess of a regexp you're starting with. Where did you get that from, and why all that stuff with control characters, etc.? – abarnert Jan 07 '13 at 07:49
  • 1
    Good point - I vote for @abarnert's answer anyway as it is very readable and should capture your targets. – RocketDonkey Jan 07 '13 at 08:03

1 Answers1

3

I'm not sure why you've started with an expression full of control characters and other stuff, or even what that expression is supposed to mean. Maybe if you told us where you got it, or explained it, we could help you debug it. But otherwise, it's much simpler to throw it away and give you a simpler one.

You say you took it from this answer, but the string in that answer is 29 characters longer than the one you gave, so apparently you copy-pasted it wrong, or modified it after the fact in some way. At any rate, according to the question, that regexp is intended to validate email addresses against a domain, not to find all email addresses. It also seems to handle quoted (maybe even encoded?) names. The fact that it starts with ^ and ends with $ is a clear sign that it can't be used to find addresses in the middle of a string, but only to match the entire string. So, it's not what you want. You can't just pick up a regexp from one problem and hope it works for a vaguely-related problem without understanding what it's doing.

You complained that RocketDonkey's doesn't work for email with dots in it. That's true, and it also doesn't handle a few other characters that are valid in an address. You could go read the appropriate RFCs, but it's a lot faster to do a quick search online for pre-made regular expressions for email addresses.

You may want to see this question, which includes a link to a fully RFC-822-compliant regexp, and explains how to get an RFC-5322-compliant one if you need to.

But depending on your uses, you may want something simpler, which can be tweaked to match not-valid-but-working addresses, or not match valid-but-useless addresses, or match native-Unicode instead of IDN-mangled Unicode, or…

Here's the first one I found in a Google search:

regexp=re.compile(r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}', re.IGNORECASE)

Is it correct? At a glance, it looks like it should handle all and only valid email addresses that use DNS names, but that's not all valid addresses. Maybe you need to handle dotted-IP mail domains, or pre-Internet email addresses, or you want to be looser in some ways or stricter in others, or whatever. If so, you'd have to explain what exactly you want. But you should be able to go from here yourself: Try it on your test cases and see. If it isn't right, it's very simple to read, and should be easy to modify.

Community
  • 1
  • 1
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • I've taken that regex from this answer - http://stackoverflow.com/a/2640791/604388, I've seen somewhere that this is taken from django source code. – LA_ Jan 07 '13 at 07:55
  • 1
    @LA_: Yes, according to the answer you linked, it's taken from the Django source code. But it's not used for extracting all addresses from an HTML page, it's used for validating a single address against a specific domain. That's not what you want to do, so it's not going to do any good. Even if you did copy it right, and remove the `^` and `$`. – abarnert Jan 07 '13 at 08:04