1

I want to replace all characters in a string with their percent-encoding representation (%xy), but only the ones that are not already percent-encoded.

For example, in the string abc#%2Bdef, the %2B part is already a percent-encoded representation. So it should not be re-encoded. The correct result after encoding should be: abc%23%2Bdef.

This is what I have tried - but the result is still abc#%2Bdef:

// Pattern: read all characters except the percent-encoded ones (%xy).
$pattern = '/(?!%[a-fA-F0-9]{2})/';
$string = 'abc#%2Bdef';

$result = preg_replace_callback($pattern, function($matches) {
    return rawurlencode($matches[0]);
}, $string);

var_dump($result);

I think it's just the $patternvalue that should be changed, but I'm not sure. And with the current pattern the rawurlencode() inside the callback is not called.

Encoding legend: %23 -> #, %2B -> +

I tried many hours today to find the right pattern form. And it seemed very simple in the beginning... I really appreciate any advice or solution.

Thank you very much.

  • 1
    This doesn't really make sense. What if the percents are intended to be literal, not part of a percent-encoding? How are you getting a string where parts of it are url-encoded, not all of it? – Barmar Nov 08 '17 at 22:10
  • 1
    Try https://ideone.com/4o9ceO – Wiktor Stribiżew Nov 08 '17 at 22:11
  • Thank you, @Barmar. Well, you asked me a good question. Give me a little time to think about it. –  Nov 08 '17 at 22:12
  • Oh, thanks @WiktorStribiżew. Let me check what it does - that pattern values (`SKIP` and `F`) are unknown to me. I'll look into the docs right now. –  Nov 08 '17 at 22:15
  • 1
    Just see https://stackoverflow.com/questions/24534782/how-do-skip-or-f-work-on-regex – Wiktor Stribiżew Nov 08 '17 at 22:16
  • @Barmar I'm providing an URI string to my UriInterface implementation (e.g. to Uri class) - as part of PSR-7. In order to correctly process the URI string, there is specified the following: _If a value in a key/value pair of the query string should include an ampersand ("&") not intended as a delimiter between values, that value MUST be passed in encoded form (e.g., "%26") to the instance_. So, if I pass a `%` to the Uri instance, followed by 2 hex chars ([0-9A-Fa-f]), then it's seen as encoded value of some char. Otherwise not, and it will be encoded. –  Nov 08 '17 at 22:39
  • 1
    @aendeerei I think you should just be able to call `urlencode()` all the time. If the input to your code is already encoded, it's because you're trying to pass a nested URL-encoded string through. For instance, a URL parameter could be another URL, and that inner URL might included URL-encoded parameters. They need to be double-encoded. – Barmar Nov 08 '17 at 22:46
  • The main point is that you shouldn't make any assumptions about the input, just treat it as raw data that needs to pass through the API. You can't treat URL-encoded input different, because you don't know whether they intend it to pass through literally or be decoded. – Barmar Nov 08 '17 at 22:56
  • I can't really commit to a detailed chat like that. Let me give a simple example. When you go to a web page, it will often redirect you to a login page, and after you login it redirects back to the original URL. This is frequently done by redirecting you to something like `http://signon.company.com?url=`. If the original URL contained percent-encoded parameters, the parameter in `url=` will have them double-encoded, so that when it redirects back to there the parameters will still be percent-encoded. – Barmar Nov 08 '17 at 23:14
  • @Barmar I understand. No problem. I just wanted to ask you what you meant exactly with the content of your comments. But I think I'll understand all, based on your last example too. Again, thank you for your patience! Good luck. –  Nov 08 '17 at 23:26

1 Answers1

2

The easy way would be decoding previous encoded characters first, and then re-encoding all the string.

$string = 'abc#%2Bdef';
$string = rawurlencode(rawurldecode($string));

This would give you the expected result.

abc%23%2Bdef
Chin Leung
  • 14,621
  • 3
  • 34
  • 58
  • Chin, it will take more time than expected. I don't think that I'll manage to check it today anymore. May I give you tomorrow my feedback? Thank you. –  Nov 08 '17 at 23:16
  • Hello Chin. I researched in web. Your solution is elegant and intutitiv. Unfortunately can not be applied in all situations. For exampe, if it's applied on a string coming from `$_GET` variable, which is already decoded by default, can have unexpected results. Here is it specified: [Notes](http://php.net/manual/en/function.urldecode.php#refsect1-function.urldecode-notes). Also, in [RFC3986](https://tools.ietf.org/html/rfc3986#section-2.4) is stated, that –  Nov 09 '17 at 14:33
  • @aendeerei I think the unexpected result is that it decodes the `+` to space in already decoded strings. What if you use `rawurldecode` instead of `urldecode`? Wouldn't it be fine? – Chin Leung Nov 09 '17 at 14:37
  • "_Implementations must not decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string._". So, in principle, one must know beforehand if the string is already decoded, if he/she wants to use urldecode on it. So, the only option remaining is to "manually" parse the string via regex and to discover the chars which are to be interpreted as beeing part of an precent-encoded octet substring (`%xy`). –  Nov 09 '17 at 14:39
  • Hm, good question. I tested it. It works perfect with the rawurldecode/rawurlencode with +. I tested it on other chars and %xy pairs too. I discovered that the decode gives funny chars back. The encode though transforms that funny results back to good chars. That said, your solution can solve my problem without a doubt. It's not clear for me yet, if it fits in the specs of my project as a whole, but it does what I ask you to help me with. Thank you very much for your time, Chin! I appreciate it. Good luck ;-) –  Nov 09 '17 at 15:01
  • I missed your rawurlencode/decode part of your question :-)))) Sorry. I reedited my last comment. Thanks. You too! –  Nov 09 '17 at 15:03