12

I have user input where some tags are allowed inside square brackets. I've already wrote the regex pattern to find and validate what's inside the brackets.

In user input field opening-bracket could ([) be escaped with backslash, also backslash could be escaped with another backslash (\). I need look-behind sub-pattern to avoid odd number of consecutive backslashes before opening-bracket.

At the moment I must deal with something like this:

(?<!\\)(?:\\\\)*\[(?<inside brackets>.*?)]

It works fine, but problem is that this code still matches possible pairs of consecutive backslashes in front of brackets (even they are hidden) and look-behind just checks out if there's another single backslash appended to pairs (or directly to opening-bracket). I need to avoid them all inside look-behind group if possible.

Example:

my [test] string is ok
my \[test] string is wrong
my \\[test] string is ok
my \\\[test] string is wrong
my \\\\[test] string is ok
my \\\\\[test] string is wrong
...
etc

I work with PHP PCRE

Wh1T3h4Ck5
  • 8,399
  • 9
  • 59
  • 79
  • 1
    Is there a finite limit to how many odd ones? Would 1,3,5,and 7 be enough to avoid? I assume you will let through 2,4,6,8 though? – tchrist Mar 08 '12 at 06:04
  • 1
    @tchrist Unfortunately no, it's almost infinite. I found some examples in my database with 40+ consecutive slashes. Some guys are using them to make ASCII 'drawings' then use tags to color some elements or make hyperlinks. – Wh1T3h4Ck5 Mar 08 '12 at 06:39

2 Answers2

12

Last time I checked, PHP did not support variable-length lookbehinds. That is why you cannot use the trivial solution (?<![^\\](?:\\\\)*\\).

The simplest workaround would be to simply match the entire thing, not just the brackets part:

(?<!\\)((?:\\\\)*)\[(?<inside_brackets>.*?)]

The difference is that now, if you're using that regex in a preg_replace, you gotta remember to prefix the replacement string by $1, to restore the backslashes being there.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
Etienne Perot
  • 4,764
  • 7
  • 40
  • 50
  • +1 I found in manual that there are some limitations inside look-behind sub-pattern so I guess you're right about variable-length. Matching entire string and pulling out just what's inside the brackets is not a problem. I'm doing that at the moment. Some REGEX flavors allow full-pattern in look behinds, such as .NET but I was wonderingis it possible in PCRE. Btw, i'm using that pattern in preg_match_all(). However, thanks for your answer. – Wh1T3h4Ck5 Mar 08 '12 at 06:56
  • No, it is not possible in PCRE; the whole-string-matching thing is simply a workaround for that. It provides the same functionality, at the cost of having to re-add those characters yourself, and excluding the extra matched region from the possible matches. This is not a problem here since the part of the string in question can only contain backslashes, so there cannot be a brackets match there. – Etienne Perot Mar 08 '12 at 07:03
  • @Wh1T3h4Ck5: The regex you accepted `(?<![^\\])` was incorrect. It was doing a *negative* lookbehind for a *negated* character class (containing a backslash), thereby making it a *positive* lookbehind for a backslash. You need to use `(?<!\\)` instead! I took the liberty to edit this answer. – Tim Pietzcker Mar 08 '12 at 11:24
  • @TimPietzcker - yes, I saw that earlier, but I accepted this answer because there's no solution for my problem in PCRE and opening sentence of this answer explains why. – Wh1T3h4Ck5 Mar 08 '12 at 11:36
  • @Tim, `(?<![^\\])` is not equivalent to `(?<=\\)`. The former will match at the beginning of the string if there's a match to be had there, while the latter requires the presence of at least one intervening character (i.e., a backslash). And yes, I know you're actually using `(?<!\\)` and not `(?<=\\)` (and correctly so, IMHO), but I couldn't let that remark go unchallenged. ;) – Alan Moore Mar 08 '12 at 22:01
0

You could do it without any look-behinds at all (the (\\\\|[^\\]) alternation eats anything but a single back-slash):

^(\\\\|[^\\])*\[(?<brackets>.*?)\] 
Scott Weaver
  • 7,192
  • 2
  • 31
  • 43
  • I need backslashes as part of look-behind group. I already have plenty of solutions w/out lookbehinds, there's one of them which works perfectly posted in question above. I don't need alternatives how to do same job in another way. Etienne Perot in his answer says that what I'm looking for is impossible with PCRE, so I have solution to believe he's wrong (which I highly doubt) or to rewrite entire project using .NET because so far .NET only uses REGEX flavor which supports full-pattern in look-behind. – Wh1T3h4Ck5 Mar 08 '12 at 11:19
  • btw, your example has two huge mistakes... 1. anchor ^ searches at the begining of the string only, 2. group (\\\\|[^\\]) requires at least one character before opening-bracket and that doesn't work if document starts with tag. – Wh1T3h4Ck5 Mar 08 '12 at 11:33
  • @Wh1T3h4Ck5: Change the + to an * asterisk, and it works at the start of the string too. Pretty obvious. – Scott Weaver Mar 08 '12 at 21:26
  • @Wh1T3h4Ck5: And the answer posted above DOES have lookbehinds in it, what do you think this is: (?<!\\\\) ? – Scott Weaver Mar 08 '12 at 21:32
  • Yes mate, it's one of reasons why I accepted that answer. Btw, pattern from that answer is exact copy of one I've originally posted in the question. Look this example `"This [is] my [test][string]"` and tell me does your pattern matches all tags - `is`, `this` and `string`? Also, my question says **"I need to avoid them all (backslashes) inside look-behind group if possible"** and your answer just doesn't do that. According to original question I expected answer like "Yes, that's possible followed w/ look-behind pattern" or "No, it's not possible". Simple as that. – Wh1T3h4Ck5 Mar 08 '12 at 22:12
  • My pattern would've matched the word "is" if run on that test string, which is what exactly what it tries to do - I can only read what you actually wrote down, not what's in your head. And another thing: my answer is not a critique of Etienne's answer in any way, nor a proposal that you should not use look-behinds, but merely to offer a different perspective. – Scott Weaver Mar 08 '12 at 22:34
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/8689/discussion-between-sweaver2112-and-wh1t3h4ck5) – Scott Weaver Mar 08 '12 at 22:43