1

I'm opening this thread that is really similar to another one but I cannot figure out an issue : I have an input field that allow a alphanumeric string with an optional unique space as a separator, then an optional other alphanumeric string etc.... I find this regex :

^([0-9a-zA-z]+ ?)*$

It works ! But the performance is really bad as soon as I have 2 consecutives spaces in a long sentence and those 2 spaces are located far in the sentence. In the example below, the result is ok in a half of second if I put the 2 spaces at the beginning of the sentence. But it lasts 10 seconds or more if located far.

dzdff5464zdiophjazdioj ttttttttt zoddzdffdziophjazdioj ttttttttt zoddzdffdzdff ttttt zoddzdfff ttttt zoddzdfff ttttt zoddzdfff ttttt zoddzdfff ttttt zoddzdfff ttttt zoddzdfff ttttt zoddzdfff ttttt zoddzdfff ttttt zo999 ddzdfff ttttt zoddzdfff ttttt zoddzdff

The 2 spaces are after the 999. Do you have any idea or suggestion to improve this regex ?

Thanks and regards

PF

ps: you can check the issue as soon as you enter an invalid character far in the string, not specifically 2 spaces.

EDIT : another example : 12345678901234567890' ==> 20 char. + 1 invalid char. => result is immediate Add 5 valid char. and it lasts 5 seconds to perform the regex ! 1234567890123456789012345'

Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
Pierrot
  • 85
  • 1
  • 1
  • 9

2 Answers2

1

I suggest changing the expression to something like this:

(?i)^[0-9a-z]+(?:\s[0-9a-z]+)*$

enter image description here

This is functionally similar in that it'll match all alphanumeric characters which are delimited by a single space. A major difference is that I moved the initial word check to the front of the expression, then made a non capture group (?:...) for the remaining space delimited words.

Non capture groups (?:...) are faster then capture groups (...) because the regex engine doesn't need to retain matched values. And by moving the space \s to the front of the word group on repeat words the engine doesn't need to validate the first character in the group is included in the character class.

You also have a typo in your character class [0-9a-zA-z] the last z should probably be upper case. This A-z format will likely have some odd unexpected results. In my expression I simply added a (?i) to the beginning to force the regex engine to shift into case insensitive mode, and I dropped the character class to [0-9a-z].

In my testing I see that your expression ^([0-9a-z]+ ?)*$ takes about 0.03 seconds to process your sample text with 2 extra spaces toward the end. My recommended expression completes the same test in about 0.000022 seconds. WOW that's an amazing delta.

Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • Hi Denomales, Amazing answer. I have to test it tomorrow at work. Your improvement is huge but the 0.03sec is still really low and I don't understand why I had 5 sec in javascript... BTW, could you tell me what software do you use to generate your picture ? Really interesting, Thanks ! – Pierrot Jun 20 '13 at 21:42
  • I couldn't wait so I tested in http://regexpal.com/. Your regex works perfectly and mine crashed my Chrome window!!! I have still to test at work but the question here is : how do you perform my regex in 0.03sec with my sample ? If I understand your explanation I can't understand why my regex is so bad : Browser crashed ? Are you kidding me ? I precise that I have the issue on 2 completely different computer (W7 vs MacOX MLion). – Pierrot Jun 20 '13 at 22:00
  • You know that's a great question. I think the problem is that when your capture group fails a match it needs to back track, and it might be going all the way back to the beginning. It must be stopping at each character in the string and testing all the characters afterwards until it reaches a match. IF it doesn't match then repeats the process iterating through all the characters in the string. I was just playing around with your expression and found that it runs noticeably faster if you include word break test `^(\b[0-9a-zA-z]+\s?)*$`. This forces the engine to not test every letter. – Ro Yo Mi Jun 21 '13 at 02:04
  • Sorry I completely missed answering your question of how. I like using this website http://www.myregextester.com/index.php#sourcetab. It is no frills and takes a bit to get used to, but I like the straight forward layout. When the you fill in the your expression and source text, on the results tab there is an execution time. If you click submit a several times in a row you can kinda gauge an average. – Ro Yo Mi Jun 21 '13 at 02:16
  • Thanks a lot! Your regex is perfect and I learned a lot by the way. One more question : how do you do your picture ? http://i.stack.imgur.com/H6A92.png – Pierrot Jun 21 '13 at 07:25
  • You can try [Regexper](http://www.regexper.com/), doesn't support the `(?i)`, try with `/^[0-9a-z]+(?:\s[0-9a-z]+)*$/` – Édouard Lopez Jun 21 '13 at 10:06
  • I'm using debuggex.com. Although it doesn't support lookbehinds, named capture groups, or atomic groups it's still handy for understanding the expression flow. There is also regexper.com. They do a pretty good job too, but it's not real time as you're typing. – Ro Yo Mi Jun 21 '13 at 12:01
  • Thanks both of you! I just have a look and it's really interesting. Have a great weekend :) – Pierrot Jun 22 '13 at 11:26
0

This is a simpler regex using \w (word class):

^([\w]+(\s*))$

Test

It's instantaneous in JavaSript

var input = "dzdff5464zdiophjazdioj ttttttttt zoddzdffdziophjazdioj ttttttttt  zoddzdffdzdff ttttt zoddzdfff ttttt zoddzdfff ttttt zoddzdfff ttttt  zoddzdfff ttttt zoddzdfff ttttt zoddzdfff ttttt zoddzdfff ttttt  zoddzdfff ttttt zo999  ddzdfff ttttt zoddzdfff ttttt zoddzdff";

var re = /([\w]+(\s*))/g;

console.log(input.replace(re, "boo"));
Édouard Lopez
  • 40,270
  • 28
  • 126
  • 178
  • Thanks Edouard but this regex doesn't fit my needs. I'm going to test directly in .Net to see if the performance is better. – Pierrot Jun 20 '13 at 16:06
  • you didn't specify any language in your post nor related-tag – Édouard Lopez Jun 20 '13 at 16:16
  • Caution should be used with the word class `\w` as it includes the expected `a-z0-9A-Z` but also the `_` character which is different then the defined class `[a-z0-9A-Z]`. – Ro Yo Mi Jun 20 '13 at 19:50
  • Edouard : thanks for your time. In fact I used it in Bind in XPATH through a framework build in my company in based on .Net. At the beginning I thought that XPATH was not performant enough but I realize that I have the same issue in javascript (see below). – Pierrot Jun 20 '13 at 21:55