1

I have content like foo == 'bar test baz' and test.asd = "buz foo". I need to match the "identifiers", the ones on the left that are not within double/single quotes. This is what I have now:

preg_replace_callback('#([a-zA-Z\\.]+)#', function($matches) {
    var_dump($matches);
}, $subject);

It now matches even those within strings. How would I write one that does not match the string ones?

Another example: foo == 5 AND bar != 'buz' OR fuz == 'foo bar fuz luz'. So in essence, match a-zA-Z that are not inside strings.

Tower
  • 98,741
  • 129
  • 357
  • 507
  • 1
    why not `explode(" =",$subject);`? – k102 Dec 08 '11 at 08:40
  • @k102: It's not that simple. I can't make up every possible variation, but the subject can vary a lot in structure. For example: `foo = 'bar' AND baz = 'foo'`. – Tower Dec 08 '11 at 09:29

3 Answers3

1
/^[^'"=]*/

would work on your examples. It matches any number of characters (starting at the start of the string) that are neither quotes nor equals signs.

/^[^'"=\s]*/

additionally avoids matching whitespace which may or may not be what you need.

Edit:

You're asking how to match letters (and possibly dots?) outside of quoted sections anywhere in the text. This is more complicated. A regex that can correctly identify whether it's currently outside of a quoted string (by making sure that the number of quotes, excluding escaped quotes and nested quotes, is even) looks like this as a PHP regex:

'/(?:
 (?=      # Assert even number of (relevant) single quotes, looking ahead:
  (?:
   (?:\\\\.|"(?:\\\\.|[^"\\\\])*"|[^\\\\\'"])*
   \'
   (?:\\\\.|"(?:\\\\.|[^"\'\\\\])*"|[^\\\\\'])*
   \'
  )*
  (?:\\\\.|"(?:\\\\.|[^"\\\\])*"|[^\\\\\'])*
  $
 )
 (?=      # Assert even number of (relevant) double quotes, looking ahead:
  (?:
   (?:\\\\.|\'(?:\\\\.|[^\'\\\\])*\'|[^\\\\\'"])*
   "
   (?:\\\\.|\'(?:\\\\.|[^\'"\\\\])*\'|[^\\\\"])*
   "
  )*
  (?:\\\\.|\'(?:\\\\.|[^\'\\\\])*\'|[^\\\\"])*
  $
 )
 ([A-Za-z.]+) # Match ASCII letters/dots
)+/x'

An explanation can be found here. But probably a regex isn't the right tool for this.

Community
  • 1
  • 1
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
1

You could also try this:

preg_match_all('/[\w.]+(?=(?:[^\'"]|[\'"][^\'"]*["\'])*$)/', $subject, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
    # Matched text = $result[0][$i];
}

To match all letters, digits and _ and dots outside your quotes. You can extend your allowable characters by adding them into [\w.]

FailedDev
  • 26,680
  • 9
  • 53
  • 73
  • I'm doing a preg_replace_callback, because I need to replace those matches. If I just do a match, how do I know I'm replacing the right content (e.g., `foo = 'foo'` should match the first `foo` and replace it with my custom stuff, but it should not affect the one in the strings)? – Tower Dec 08 '11 at 09:46
  • @rFactor You could use the same regex with preg_replace_callback. Point is, with your edited question, it is unclear to me which characters you want to catch. – FailedDev Dec 08 '11 at 09:52
1

The trick I use here is to force the regex to branch whenever it encounters a quote, then later on we ignore this branch.

$subject = <<<END
foo == 'bar test baz' and test.asd = "buz foo"
foo == 5 AND bar != 'buz' OR fuz == 'foo bar fuz luz'
END;

$regexp = '/(?:["\'][^"\']+["\']|([a-zA-Z\\.]+\b))/';

preg_replace_callback($regexp, function($matches) {;
    if( count($matches) >= 2 ) {
        print trim($matches[1]).' ';
    }
}, $subject);

// Output: 'foo and test.asd foo AND bar OR fuz '

The main part of the regexp is

(?: anything between quotes | any word consisting of a-zA-Z )
Marijn van Vliet
  • 5,239
  • 2
  • 33
  • 45
  • Sorry if I was too imprecise. I believe that does not exactly match the case I added in the question? Or maybe I don't fully understand the regex, but by the look of it, it assumes there are no characters besides a-z and quotes? – Tower Dec 08 '11 at 09:41
  • I've modified my answer to only match the a-zA-Z case. Ask if the regexp is unclear to you. – Marijn van Vliet Dec 08 '11 at 14:21