1
solution:[a-zA-Z0-9.!#$%&'*+-/=?\^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)* is a good choice

I am using a regular expression like the below to match email addresses in a file:

email = re.search('(\w+-*[.|\w]*)*@(\w+[.])*\w+',line)

When used on a file like the following, my regular expression works well:

mlk407289715@163.com    huofenggib  wrong in get_gsid
mmmmmmmmmm776@163.com   rouni816161 wrong in get_gsid

But when I use it on a file like below, my regular expression runs unacceptably slowly:

9b871484d3af90c89f375e3f3fb47c41e9ff22  mingyouv9gueishao@163.com
e9b845f2fd3b49d4de775cb87bcf29cc40b72529e   mlb331055662@163.com

And when I use the regular expression from this website, it still runs very slowly.

I need a solution and want to know what's wrong.

kuafu
  • 1,466
  • 5
  • 17
  • 28

3 Answers3

1

That's a problem with backtracking. Read this article for more information.

You might want to split the line and work with the part containing an @:

pattern = '(\w+-*[.|\w]*)*@(\w+[.])*\w+'
line = '9b871484d3af90c89f375e3f3fb47c41e9ff22  mingyouv9gueishao@163.com'
for element in line.split():
    if '@' in element:
        g = re.match(pattern, element)
        print g.groups()
Matthias
  • 12,873
  • 6
  • 42
  • 48
0

Generally when regular expressions are slow, it is due to catastrophic bactracking. This can happen in your regex because of the nested repetition during in the following section:

(\w+-*[.|\w]*)*

If you can work on this section of the regex to remove the repetition from within the parentheses you should see a substantial speed increase.

However, you are probably better of just searching for an email regex and seeing how other people have approached this problem.

Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
0

It's always a good idea to search StackOverflow to see if your question has already been discussed.

Using a regular expression to validate an email address

This one, from that discussion, looks like a good one to me:

[a-zA-Z0-9.!#$%&'*+-/=?\^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*
Community
  • 1
  • 1
steveha
  • 74,789
  • 21
  • 92
  • 117