0

I have a simple string:

$string = '--#--%--%2B--';

I want to percent-encode all characters (inclusive the "lonely" %), except the - character and the triplets of the form %xy. So I wrote the following pattern alternatives:

$pattern1 = '/(?:[\-]+|%[A-Fa-f0-9]{2})(*SKIP)(*FAIL)|./us';
$pattern2 = '/(?:[\-]+)(*SKIP)(*FAIL)|(?:%[A-Fa-f0-9]{2})(*SKIP)(*FAIL)|./us';

Please notice the use of (multiple) (*SKIP)(*FAIL) and of (?:).

The result of matching and replacing is the same - and the correct one too:

--%23--%25--%2B--

I would like to ask:

  • Are the two patterns equivalent? If not, which one whould be the proper one to use for url-encoding? Could you please explain in few words, why?
  • Would you suggest other alternatives (implying backtracking control verbs), or are my patterns a good choice?
  • Can I apply only one (?:) around the whole (chosen) pattern, even if the (multiple) (*SKIP)(*FAIL) will be inside it?

I know that I request a little too much from you by asking more questions at once. Please accept my apology! Thank you very much.


P.S: I've tested with the following PHP code:

$result = preg_replace_callback($patternX, function($matches) {
    return rawurlencode($matches[0]);
}, $string);
echo $result;
  • 1
    The patterns are based on my former suggestion (so, yeah, a good choice :)). They both work, but you do not need a grouping construct at all if you are not quantifying it, nor have any alternation inside, remove the `(?:...)` from the second pattern as they are redundant. – Wiktor Stribiżew Nov 11 '17 at 17:31
  • @WiktorStribiżew Indeed, I based my patterns on your suggestion. Thanks again -I am happy that you gave me this option! I found it very elegant, so I researched and studied it. Unfortunately, multiple skip-fail are not to find anywhere... By grouping construct you mean the `(?:)`, right? If that's the case: If I chose pattern 1 (beeing a bit easy to understand), I should keep the `(?:)` in, because there's an alternation inside, right? And If I choose pattern 2, I should remove both `(?:)`, because they have no alternation inside, right?... –  Nov 11 '17 at 17:49
  • @WiktorStribiżew ...Could you please tell me, which one you'd recommend? Thank you very much. P.S: Would you write an answer too? It would be great, because the other users would benefit from your suggestions too. –  Nov 11 '17 at 17:49
  • You are right. Use either of them. But regarding an answer, isn't [this](https://stackoverflow.com/a/47190672/3832970) the *real* answer? – Wiktor Stribiżew Nov 11 '17 at 17:51
  • @WiktorStribiżew Then I will choose the first pattern... Well, that answer is a correct answer too. I tested it thoroughly and it really does what I was asking. An eventual answer based on your suggestion would have been correct too. And I saw that some frameworks are using a third correct method. So, it seems that there are multiple answers for the same question. I didn't want to give up until I understand your recommendation - backtracking verbs solution. So I studied it, I liked it, and I decided to choose it for my PSR-7. So yes, an answer from you would be really welcomed for all users. –  Nov 11 '17 at 18:07
  • 1
    @aendeerei: to be clear, you can do the same thing like this: `$result = rawurlencode(rawurldecode($str));` – Casimir et Hippolyte Nov 11 '17 at 18:10
  • Hi, @CasimiretHippolyte. I appreciate. Actually, I know that it works, but the problem is, that I'm... somehow afraid of the "indistinct" characters which appear right after decoding with `rawurldecode` and before passing them to the `rawurlencode`. I have the feeling, that there can appear some effects of which I'm not yet aware. By using the pattern based method, I discovered in my tests, that I can control with them all possible... strange situations. That's the only reason in my choice of the method presented here. –  Nov 11 '17 at 18:20

1 Answers1

3

First of all, both the patterns leverage the SKIP-FAIL PCRE verb sequence that is quite a well-known "trick" to match some text and skip it. See How do (*SKIP) or (*F) work on regex? for some more details.

The two patterns yield the same results, (?:[\-]+|%[A-Fa-f0-9]{2})(*SKIP)(*FAIL) matches either [\-]+ or %[A-Fa-f0-9]{2} and then skips the match, and (?:[\-]+)(*SKIP)(*FAIL)|(?:%[A-Fa-f0-9]{2})(*SKIP)(*FAIL) first tries to match [\-]+ and skips it if found, and then tries to match %[A-Fa-f0-9]{2} and skips the match if it is found. The (?:...) non-capturing groups in the second pattern are redundant as there is no alternation inside and the groups are not quantified. You may use any number of (*SKIP)(*FAIL) in your pattern, just make sure you use them before the | to skip the relevant match.

SKIP-FAIL technique is used when you want to match some text in specific context, when a char should be skipped/"avoided" if it is preceded and followed with some chars, or when you need to "avoid" matching a whole sequence of chars, like in this scenario, thus, the SKIP-FAIL is good to use.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563