0

I am trying to find all the emails for the given string

Here's the regex:

r"([a-zA-Z0-9_.+-]+(?:\@|\[(?:(?:at|AT|@))\])+[a-zA-Z0-9-]+(?:\.|\[(?:[dtoDTO0\.])+\])[a-zA-Z0-9]+)"

String length is too long(~2L). It's taking too long to find all the matching emails. I like to use timeout concept if regex takes too long. Any suggestions?

Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142
karthick
  • 1,273
  • 1
  • 10
  • 19
  • One option might be to use a better regex engine like re2. For timeouts, the best approach would probably be to spawn a process for the regex and then kill that process if too much time passes. –  May 01 '17 at 12:58
  • Could you share your input? – Thomas Ayoub May 01 '17 at 13:00
  • Instead of `findall` you can use `finditer`. What is the size of your input? – Laurent LAPORTE May 01 '17 at 13:10
  • @Chris: The accepted answer to that is less sophisticated than the OP's current solution, so I don't think it's a good dupe target. – zondo May 01 '17 at 13:36
  • How about using a simpler regex pattern (maybe `emails_re = re.compile(r'\b[a-zA-Z0-9_\.]+@[a-zA-Z0-9.]+\b')`) to first narrow down your text to only few lines and then use your regex to filter out the correct emails. Also try it using `finditer` instead of `findall`. – pratpor May 01 '17 at 15:15

1 Answers1

0

Side-Note: You know, it frustrates me to not be able to add "comments" as a newer user. I don't think my reply warrants a full answer but it might be helpful.

Reply: With regards to this I've noticed that one's OS is often at issue here. On each of my Windows systems I've noticed that text processing in Python is extremely slow. On my CentOS and Kali implementations it is immediate. Would you mind mentioning what OS you're using?

FailSafe
  • 482
  • 4
  • 12