18

I am tired of being frightened of regular expressions. The topic of this post is limited to PHP implementation of regular expressions, however, any generic regular expression advice would obviously be appreciated (i.e. don't confuse me with scope that is not applicable to PHP).

The following (I believe) will remove any whitespace between numbers. Maybe there is a better way to do so, but I still want to understand what is going on.

$pat="/\b(\d+)\s+(?=\d+\b)/";
$sub="123 345";
$string=preg_replace($pat, "$1", $sub);

Going through the pattern, my interpretation is:

  • \b A word boundary
  • \d+ A subpattern of 1 or more digits
  • \s+ One or more whitespaces
  • (?=\d+\b) Lookahead assertion of one or more digit followed by a word boundary?
  • Putting it all together, search for any word boundary followed by one or more digits and then some whitespace, and then do some sort of lookahead assertion on it, and save the results in $1 so it can replace the pattern?

Questions:

  • Is my above interpretation correct?
  • What is that lookahead assertion all about?
  • What is the purpose of the leading / and trailing /?
Ben Swinburne
  • 25,669
  • 10
  • 69
  • 108
user1032531
  • 24,767
  • 68
  • 217
  • 387
  • 9
    For reference and tutorials: http://regular-expressions.info – deceze Nov 30 '12 at 13:32
  • * See also [Open source RegexBuddy alternatives](http://stackoverflow.com/questions/89718/is-there) and [Online regex testing](http://stackoverflow.com/questions/32282/regex-testing) for some helpful tools, or the mentoned [RegExp.info](http://regular-expressions.info/). – mario Nov 30 '12 at 13:34
  • Thanks. Just looking over deceze's recommendation right now. Will look at the others. – user1032531 Nov 30 '12 at 13:37
  • 1
    Though based on ruby's regex(not sure if there's much difference), I've also found http://www.rubular.com/ to be one of the easier online regex testers to play with. – RhodriM Nov 30 '12 at 13:41
  • I'm highly recommend you to use [Kodos](http://kodos.sourceforge.net/) in order to make it easier to master your skills in regexp (it's also much easier to debug regular expressions and test different modifications of your regexp statements). – tulvit Nov 30 '12 at 13:43

2 Answers2

18

Is my above interpretation correct?

Yes, your interpretation is correct.

What is that lookahead assertion all about?

That lookahead assertion is a way for you to match characters that have a certain pattern in front of them, without actually having to match the pattern.

So basically, using the regex abcd(?=e) to match the string abcde will give you the match: abcd.

The reason that this matches is that the string abcde does in fact contain:

  1. An a
  2. Followed by a b
  3. Followed by a c
  4. Followed by a d that has an e after it (this is a single character!)

It is important to note that after the 4th item it also contains an actual "e" character, which we didn't match.

On the other hand, trying to match the string against the regex abcd(?=f) will fail, since the sequence:

"a", followed by "b", followed by "c", followed by "d that has an f in front of it"

is not found.

What is the purpose of the leading / and trailing /

Those are delimiters, and are used in PHP to distinguish the pattern part of your string from the modifier part of your string. A delimiter can be any character, although I prefer @ signs myself. Remember that the character you are using as a delimiter needs to be escaped if it is used in your pattern.

Asad Saeeduddin
  • 46,193
  • 6
  • 90
  • 139
  • But "a", followed by "b", followed by "c", followed by "d that has an e in front of it (not e not f) doesn't match. Can you please elaborate. – user1032531 Nov 30 '12 at 13:42
  • @user1032531 Sure, give me minute. Marten, sorry about that, I guess I accidentally overrode your suggested edit, but that information is present in the question now. – Asad Saeeduddin Nov 30 '12 at 13:43
  • Ah ha, delimiters! So I only need them if I have a modifier, correct? – user1032531 Nov 30 '12 at 13:46
  • 2
    look-aheads are 'zero-width' assertions, meaning they are not included in the match. So `abcd(?=e)` is saying match `abcd` **only** if it is followed by an `e`. – garyh Nov 30 '12 at 13:52
  • Regarding delimiters, preg_replace('[^a-zA-Z0-9_]', '','Hello[$&goodby') doesn't have any (unless the square brackets are acting as them?). Is it not valid? – user1032531 Nov 30 '12 at 13:52
6

It would be a good idea to watch this video, and the 4 that follow this: http://blog.themeforest.net/screencasts/regular-expressions-for-dummies/ The rest of the series is found here: http://blog.themeforest.net/?s=regex+for+dummies

A colleague sent me the series and after watching them all I was much more comfortable using Regular Expressions.

Another good idea would be installing RegexBuddy or Regexr. Especially RegexBuddy is very useful for understanding the workings of a regular expression.

Maarten00
  • 705
  • 8
  • 23