6

I cannot seem to be able to find a way to not return a match if a string exists but not immediately before another string.

I am able to not return a match if a string exists immediately before another string, with the following.

$string = 'Stackoverflow hello world foobar test php';

$regex = "~(Stackoverflow).*?(?<!(test\s))(php)~i";

if(preg_match_all($regex,$string,$match))
    print_r($match);

In this example, we want to return a match if we have the word Stackoverflow and php but only if the word test(with a space character) does not exist before the word php.

This doesn't return any result which is good.

Lets now say I want to match php but only if the word foobar doesn't exist somewhere between Stackoverflow and php, I assumed I could do the following.

$string = 'Stackoverflow hello world foobar test php';

$regex = "~(Stackoverflow).*?(?<!(foobar)).*?(php)~i";

if(preg_match_all($regex,$string,$match))
    print_r($match);

(I have changed the negative look behind string to (foobar), and added .*? after)

I would also like to say that I cannot always know what words will exist between foobar and php, sometimes there will be none, sometimes 200, but I do have some positioning information (after Stackoverflow and before php).

GraphicsMuncher
  • 4,583
  • 4
  • 35
  • 50
cecilli0n
  • 457
  • 2
  • 9
  • 20
  • 1
    Either of the `.*?` will be able to bypass the assertion. You need to mask every possible position of the `.` anything placeholder with a negative lookahead. – mario Mar 14 '14 at 00:35
  • possible duplicate of [Regular expression to match string not containing a word?](http://stackoverflow.com/questions/406230/regular-expression-to-match-string-not-containing-a-word) – mario Mar 14 '14 at 00:36
  • @mario So I must repeat this (?<!(foobar)) for every character after Stackoverflow? Doesn't seem like a good option. The answer in the other thread you linked to seems like it works, is this what you was referring to as masking every possible position of the .? – cecilli0n Mar 14 '14 at 00:49
  • "This doesn't return any result which is good." == "a too fast assumption" – Casimir et Hippolyte Mar 14 '14 at 00:55
  • @CasimiretHippolyte I thought I could possibly be wrong when originally writing that statement, I will have to watch out for declarative statements next time. However as you seem to have noticed, what is the problem at that referenced point exactly? Thanks – cecilli0n Mar 14 '14 at 00:59
  • 3
    Yes, basically you split the period out of `.*?` into a masked match-all `((?!foobar).)` and have that repeatedly tested with `((?!xxx).)*?`. It's often considered wasteful because the assertion runs for every character in between. But for simple cases it's quite workable and PCRE optimizes it. – mario Mar 14 '14 at 00:59
  • @mario Thanks for clarifying. atleast the fix is still clean and readable, luckily I do not have to run this regex often. If you want to put this method in a answer, I will mark it as accepted. – cecilli0n Mar 14 '14 at 01:06

2 Answers2

1

Your second regex works because "foobar" can just occur as part of one .*?. Specifically, the first .*? will match the empty string "", and the second one will match " hello world foobar test ", which is indeed not preceded by "foobar"!

To obtain the desired result, one way would be to look at every character and make sure that it isn't a "f", or if it is an "f" that is isn't followed by an "o", or if it is an "f" followed by an "o" that it isn't followed by another "o", etc.

This will leave you with:

$string = 'Stackoverflow hello world foobar test php';

$regex = "~(Stackoverflow)(?:[^f]|f[^o]|fo[^o]|foo[^b]|foob[^a]|fooba[^r])*?(php)~i";

if(preg_match_all($regex,$string,$match))
    print_r($match);

Performance update

I have benchmarked my suggestion and Ron's and found that, while there is no significant difference in Perl, his is faster by almost 50% in PCRE.

scozy
  • 2,511
  • 17
  • 34
  • This is a interesting answer, but I think the answer in the thread mario linked to maybe more practical, however I am still testing it to make sure I haven't overlooked anything. Thanks. – cecilli0n Mar 14 '14 at 00:54
  • Hello, I tested your answer and see that it works, but I don't understand why it works. After (Stackoverflow) I thought we would have to tell regex that some unknown characters may appear, so directly after (Stackoverflow) we would place ".*". I don't understand how regex is skipping " hello world " inbetween (SO).... and (FB).... I am wrong in my assumption, but I don't know why. – cecilli0n Mar 15 '14 at 21:19
  • Nevermind, I found the reasoning to why its working from another user. Thanks – cecilli0n Mar 15 '14 at 21:58
1

I would use a negative lookahead to ensure the string 'foobar.*php' does not exist after 'stackoverflow' And since you wanted to capture php, I'd put that into a capturing group. Something like:

Stackoverflow(?:(?!foobar.*php).)*(php)

Note that this results in checking after each character

Ron Rosenfeld
  • 53,870
  • 7
  • 28
  • 60