0

Well - the title basically says it.
I want to look at the URLs query, and look for specific values (singular character, or a small string).

I can do this successfully - so long as I'm only looking for "normal" characters (those that are often termed as "safe" [a-zA-Z0-9-_.~] ).
As soon as I start looking for 'unsafe' or 'foreign' characters - it gets ugly.
I've spent the entire day (and part of yesterday too) attempting to figure this out.
I've read tons ... RFCs, php.net for encode stuff, detect encode etc. to.
I've even attempted to set the encode/charset at the top of the script etc. too.
I've gone through various encode options, setting dynamicsally, manually etc.
Nothing has worked.

Try the little script below.
slap it into a file and access it - and append the query path below;
?q=a1-.<^舆

See what resutls you get.

function curPageURL() {
    $pageURL = 'http';
    if ($_SERVER["HTTPS"] == "on") {$pageURL .= "s";}
    $pageURL .= "://";
    if ($_SERVER["SERVER_PORT"] != "80") {
        $pageURL .= $_SERVER["SERVER_NAME"].":".$_SERVER["SERVER_PORT"].$_SERVER["REQUEST_URI"];
    } else {
        $pageURL .= $_SERVER["SERVER_NAME"].$_SERVER["REQUEST_URI"];
    }
    return $pageURL;
}


$needles = array(
    needle1 => 'a', 
    needle2 => '1', 
    needle3 => '-', 
    needle4 => '.', 
    needle5 => '<',
    needle6 => '^',
    needle7 => 'Ë',
    needle8 => 'à',
    needle9 => 'Ü'
);

$haystack = parse_url(curPageURL(), PHP_URL_QUERY);


if (!empty($haystack)) {
    $needlelist = implode(' | ',$needles);

    echo "We are looking for some needles ( ".$needlelist." ) in a haystack    (".$haystack.")<br/>";

    foreach ($needles as $key=>$needle) {

        echo "We are looking for ".$key."<br/>";
        $check = strpos($haystack,$needle);
        if ($check !== false) {
            echo " - Yes : we found a needle (".$needle.") in the haystack";
        } else {
            echo " - No : we failed to find the needle (".$needle.") in the haystack";
        }
        echo "<br/>";

}



echo "--------------<br/>now lets try it with a little basing?<br/>";



foreach ($needles as $key=>$needle) {

    echo "We are looking for ".$key."<br/>";

    // Basing - encode the searched for value, and replace any double-encoded % chars
    $needle = str_replace('%25','%',rawurlencode($needle));

    $check = strpos($haystack,$needle);
    if ($check !== false) {
        echo " - Yes : we found a needle (".$needle.") in the haystack";
    } else {
        echo " - No : we failed to find the needle (".$needle.") in the haystack";
    }
    echo "<br/>";

}
}

I don't know about you, but instead of the strange characters, or their correct hex codes (as per the various lists/tables for urlencoded chars), I get the following ([Searched for] (1st results) (2nd results));

/a a a
/1 1 1
/- - -
/. . .
/< < %3C
/^ ^ %5E
/Ë Ã‹ %C3%8B
/à Ã %C3%A0
/Ü Ãœ %C3%9C

(/ added to prevent line insertion + the encoding here makes this Very difficult to post!)

the problem is - for example, the last one ... Ü should become %DC (as far as I can tell) - so why the paired hex?

I've tried reading up on multibyte stuff ... but I fail to see how the Browsers are encoding the chars in the URL, but the script won't.

So - anyone see what I'm doing wrong, or not doing, or figured this out already?

.

For the sake of Clarity...
... I am NOT asking how to replace the characters (I do Not want to turn Ü into U). Simply take a given string and see if it is in the URL (straight, or encoded for the URL).

Thanks, and I hope someone can help.

Ibu
  • 42,752
  • 13
  • 76
  • 103
theclueless1
  • 123
  • 1
  • 1
  • 11
  • Similar solved http://stackoverflow.com/questions/1371216/php-explode-using-special-characters/1371221 – Havenard May 09 '11 at 17:10
  • Similar question: [Url encoding PHP](http://stackoverflow.com/questions/5921090/url-encoding-php) – Gumbo May 09 '11 at 17:12

1 Answers1

0

The different results are due to different character encodings. Todays browsers usually use UTF-8 to encode text when entered directly into the location bar and Ü, encoded in UTF-8 with 0xC39C, is encoded with %C3%9C as both 0xC3 and 0x9C are not valid bytes in URLs. And if you interpret 0xC39C with a single-byte character encoding like Windows-1252, you’ll get the two characters à (0xC3) and œ (0x9C).

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • That's brilliant - and I almsot understood it :D Thank you. I don't suppose you have a solution as to how to search for chars like "Ü" and match them the same as the URL encoding? (URL encodes "Ü" as "%DC" - and nothing I do seems to get that result) Thanks again. – theclueless1 May 09 '11 at 17:46
  • @theclueless1: Check what character encoding might have been used and convert it if necessary. – Gumbo May 09 '11 at 17:48
  • As I've said - I've tried detecting the encode type etc.... I get either ASCII (safe chars) or UTF-8 (unsafe/foreign). -------- But ------- I may have jsut had a success... see answer below? – theclueless1 May 09 '11 at 17:58
  • I may have just had a breaktrhough. Using the little script supplied above, in the 2nd half ... change // Basing - encode the searched for value, and replace any double-encoded % chars $needle = str_replace('%25','%',rawurlencode($needle)); to $needle = urlencode(mb_convert_encoding($needle, "ISO-8859-1", "UTF-8")); that seems to work. – theclueless1 May 09 '11 at 18:04
  • I haven't seen anywhere that I need to convert to a completely different encode type, then re-encode it ... nor how/why ... I'm esp. confused by the fact I have to break it away from UTF-8 ... as I thought that is what URLs were meant to be? But - Thank You -Gumbo- ... would never have managed it (nor thought of it) – theclueless1 May 09 '11 at 18:05
  • @theclueless1: I recommend you to use UTF-8 in your PHP files as well and convert/redirect the URI or respond with an error if a different character encoding is detected. You can try `mb_detect_encoding` to do so (e.g. `mb_detect_encoding($query, 'UTF-8,ISO-8859-1,Windows-1252,auto', true)`). – Gumbo May 09 '11 at 18:13
  • I'm confused. Urls seem to permit various chars (such as ":" / "/" / "?" / "#" / "[" / "]" / "@" etc.). These are reserved for special purposes. Am I meant to encode those as well? will they function if converted to %nn or do I need to try and figure if they are not being used for the purpose intended (instead as usual values and not deliminators etc.)? I've gone over the RFCs etc. - and I'm not seeing it mentioned. -- A fine example is the "!" character. That does Not seem to be encoded by browsers. I can use it in a filepath, a filename or in the query - it remains unencoded? – theclueless1 May 10 '11 at 14:29
  • Unless ... I only apply the check/encode to specific parts (parse/breakdown the URL first, and apply the smallest/specific parts - thus I encode the Parameters and the Values, Not the Query)? – theclueless1 May 10 '11 at 15:18
  • @theclueless1: Although these characters are delimiters and special characters in certain contexts, they may appear in plain in other contexts. But you can encode any character using the percent-encoding although you don’t need it. For example, `/%66%6F%6F` is equivalent to `/foo`. – Gumbo May 10 '11 at 17:30
  • @Gumbo : thanks for the reply. what's confusing (and concerning me) is detecting Valid uses of some chars, and invalid uses of others. According to the Specs (RFCs etc.) things like "(:)" are a form of reserved char - and can be used for special meaning. If there are Not being used for that specific function - they should be encoded. - so, my problem is - how do I detect if it is being used correctly or not? -- OR -- does it not matter, and even if encoded, it would be understood? (Example = "/" cannot be encoded in the Filepath of a URL and function as a directory indicator). – theclueless1 May 11 '11 at 12:12
  • @theclueless1: You can encode any character and it’s getting interpreted as the character it represents: `/%66%6F%6F` is equivalent to `/foo`. Only plain characters can be interpreted as delimiters: a URI reference `http://example.com/` is clearly an absolute URI but `http%3A//example.com/` is a relative URI (i.e. URI path). – Gumbo May 11 '11 at 12:23
  • @Gumbo : see - this is why I'm confused. I can use (:) in a Query. it can have a special meaning, (sub paramter/value?)(?q=(this:that)&y=(12:x) ?) ... or it could be part of a basic string,say an article about football scores "wolves-vs-redgiants(1:12)". So - if I understand it correctly - it would be application dependant as to how to interpret the data in the URL, and in such cases, a generic URL examination is basically useless? ++ Thanks for the additional info ;) – theclueless1 May 11 '11 at 12:31
  • @theclueless1: You can use `:` in the [query](http://tools.ietf.org/html/rfc3986#section-3.4): `query = *( pchar / "/" / "?" )` and *pchar* is `ALPHA / DIGIT / "-" / "." / "_" / "~" / "%" HEXDIG HEXDIG / "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" / ":" / "@"`. – Gumbo May 11 '11 at 12:45
  • @Gumbo: okay ... that looks like the stuff from the RFC. So, should I be doign a pregmatch to see if the char is Not one of them, then encode if not ... or should I simply encode any non-standard char? (that's basically the issue I'm facing. (1)should I be handling chars like ":" as ":" or (2)automatically converting them to "%nn" or (3)attempting to figure their usage, and only encode to "%nn" if they seem to be non-operators, else leave them as is if they seem to be operators ??? (I hope I'm being clear and not simply confusing more) – theclueless1 May 11 '11 at 13:03
  • [http://www.faqs.org/rfcs/rfc3986.html] -6.2.2.2. Percent-Encoding Normalization- ::::: In addition to the case normalization issue noted above, some URI producers percent-encode octets that do not require percent-encoding, resulting in URIs that are equivalent to their non-encoded counterparts. These URIs should be normalized by decoding any percent-encoded octet that corresponds to an unreserved character, as described in Section 2.3. ::::: This suggests that I should hex-encode the characters you listed earlier, but ensure that ALPHA / DIGIT / "-" / "." / "_" / "~" are Not – theclueless1 May 11 '11 at 13:42