0

I have this regex pattern:

/(?J){% *(?P<tag>[a-zA-Z_]+) *(?P<args>[a-zA-Z0-9 _-]+) *%}(?P<block>.*){% *end(?P<tag>[a-zA-Z_]+) *%}/s

And this search string:

{% import add %}{% endimport %}
{% extends base.html %}{% endextends %}
{%       block              title %}
Changed
{% endblock %}
{% block content %}
Yay!
{% endblock %}

When running this through preg_match_all, it's returning the full search string rather than the first {% import add %}{% endimport %}. Why, and how do I fix it?

IMSoP
  • 89,526
  • 13
  • 117
  • 169
Munavir Chavody
  • 489
  • 4
  • 16
  • Parsing these strings with a single regex is something I'd never recommend. Even if you [change `.*` to `.*?`](https://regex101.com/r/lFtzEo/1) you will most likely encounter other situations where the pattern will match incorrectly. Using a dedicated parser for this format is a better idea. – Wiktor Stribiżew Jan 24 '18 at 12:33
  • @WiktorStribiżew What if it's proprietary and he's building the parser? ;) – SamWhan Jan 24 '18 at 12:35
  • @ClasG Then the recommendation against a *single* regex would still apply. A full parser would iterate through the string matching tokens and then work on this token stream, rather than matching whole chunks in one go. – IMSoP Jan 24 '18 at 12:41
  • @ClasG I have already tried my hand at such string parsing, I am really against writing single regexes that handle such difficult mark-up/code. Using `.*` inside such patterns is almost always a bottleneck leading to issues later. – Wiktor Stribiżew Jan 24 '18 at 12:52

2 Answers2

0

You have a named pattern: (?P<block>.*).

Change it to (?P<block>.*?) (add ? after the star).

General remark: Patterns like .* (greedy version) should be used with extreme care, as they are likely to consume far too much.

You can also make further improvements to your regex:

  • Change the second instance of (?P<tag>[a-zA-Z_]+) to (?P=tag) - a backreference to the tag group used for the first time. I assume that after end there should be the same text, which captured the first tag group.
  • Then you can remove (?J), as no named pattern occurs multiple times any more.
  • Maybe you should also change (?P<args>[a-zA-Z0-9 _-]+) to (?P<args>[a-zA-Z0-9\. _-]+) (add literal dot to the set of allowed characters). Or change the list of allowed chars to [^%]. Then this pattern will match also {% extends base.html %}{% endextends %} (the first row of your sample).
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
0

Regular expressions are "greedy" by default - they take the longest possible match, not the shortest.

In this case, your problem appears to be the .* token, which basically translates to "match anything at all". This will operate by immediately matching the entire remainder of the string, and then back-tracking until the subsequent part of the regex can be satisfied. The result is that everything up to the last {% something %} tag is considered your final match.

The simplest solution to this is to just use .*?, which means "match anything, but don't be greedy about it". This will start by matching nothing, and then work its way forwards until the pattern can be matched, likely giving you the result you wanted.

However, as noted in comments, a tokenising parser might be more appropriate for this kind of task: track through the string, dividing it up into a sequence of tag, not-tag, tag, not-tag, then match up the tags afterwards. This will allow you more flexibility in your syntax, and less head-scratching with complexities like nested tags, or detecting incorrectly formatted input.

IMSoP
  • 89,526
  • 13
  • 117
  • 169