5

Okay so I'm working on a project where I need a regex that can match a * followed by 1-4 spaces or tabs and then followed by a row of text. Right now I'm using .* after the lookbehind for testing purposes. However I can get it to match explicitly 1, 2, or 4 spaces/tabs but not 1-4. I'm testing against the following block

*    test line here
*   Second test
*  Third test
* Another test

And these are the two patterns I'm testing (?<=(\*[ \t]{3})).* which works just as expected and matches the 2nd line, same if I replace 3 with 1, 2 or 4 however if I replace it with 1,4 forming the following pattern (?<=(\*[ \t]{1,4})).* it no longer matches any of the rows and I honestly can't understand why. I've tried googling without success. I'm using the g(lobal) flag.

Hultner
  • 3,710
  • 5
  • 33
  • 43

1 Answers1

5

PHP, like many flavors, doesn't support variable length lookbehind. The only support is alternation (|) at the top level of the lookbehind. Even a ? can break the pattern. An alternative is to use:

(?<=\*[ \t]|\*[ \t]{2}|\*[ \t]{3}|\*[ \t]{4}).*

Or better, abort the lookbehind for a group:

\*[ \t]{1,4}(.*)

This should work well for you, since it doesn't seem like you have overlapping of your matches anyway.

From the manual:

The contents of a lookbehind assertion are restricted such that all the strings it matches must have a fixed length. However, if there are several alternatives, they do not all have to have the same fixed length. Thus (?<=bullock|donkey) is permitted, but (?<!dogs?|cats?) causes an error at compile time. Branches that match different length strings are permitted only at the top level of a lookbehind assertion.

Source: http://www.php.net/manual/en/regexp.reference.assertions.php

Kobi
  • 135,331
  • 41
  • 252
  • 292
  • 1
    It might also be worth mentioning that the regex will still not do what the OP probably wants - it will gladly match more than 4 spaces because `.*` will match spaces just fine. – Tim Pietzcker Feb 10 '11 at 11:53
  • 1
    @Tim - That's a good point, but I think `.*` is just a simplified example of what the OP sees as an odd behavior - the interesting part is the look behind. – Kobi Feb 10 '11 at 11:56
  • Thanks, I overlooked that. By the way, RegexBuddy doesn't complain about `{1,4}` (it balks at infinite quantifiers, but not at this finite quantifier). – Tim Pietzcker Feb 10 '11 at 12:00
  • After some testing with alternation it seem like it always will match only * followed by one space thus resulting in the matched area being starting with tabs or spaces, but I guess I could just abort the lookbehind group as you said and then with string manipulation remove the unnecessary space. I guess I could use substring to remove the first character and then ltrim. And yeah `.*` were just simplification because the lookbehind was what I wanted help with. – Hultner Feb 10 '11 at 12:01
  • @Tim - I suppose it depends on the implementation: `{1,4}` can be expanded to a legal alternation, but PHP doesn't do it (which is better, of course, it might create a monstrosity). I check my PHP patterns at http://www.pagecolumn.com/tool/pregtest.htm , and sometimes on ideone, which I guess are closer to the real thing :) – Kobi Feb 10 '11 at 12:04
  • @Hultner - note the capturing group I've added on the second regex - you can still get the line without the prefix. Try using `preg_match_all("/\*[ \t]{1,4}(.*)/", $str, $matches, PREG_SET_ORDER);` , and `$matches` will contain an array of arrays. The second item in each is the line without the asterisk and the leading spaces. – Kobi Feb 10 '11 at 12:06
  • Hmm now I think i missed something. How'd php know that it should remove the asterisk and the leading spaces and not something else? Right now I'm using `preg_match_all("/^[*][ \t]{1,4}[ \w]{3,50}$/m")` – Hultner Feb 10 '11 at 12:44
  • Never mind now I understand, by setting PREG_SET_ORDER it puts all the groups in their own element in the array. Thats actually really smart, that way everything outside of the group kinda works like lookbehinds and lookaheads. – Hultner Feb 10 '11 at 13:06
  • @Hultner - Great to see you figured this up before I even had time to comment. Good luck! `:)` – Kobi Feb 10 '11 at 13:14