2

I get some users requesting pages with encoded URLs that just don't make it through the $_GET[tag] decoding.

The worst offender in my mind is %5Cu003d but there are others. In this example page.php?tag%5Cu003d44 should be page.php?tag=44 as %5C is / so /u003D is unicode 003D or "="

I have no idea what website has encoded this URL but I am trying to give people what they want without manually decoding the thing. Is there some switch or way to do this so that $_GET works? Probably not huh?

I tried sending this header per another discussion on SO but it didn't help. header ('Content-type: text/html; charset=utf-8');

EDIT*****************************

Here are more examples of bad URLs:

page.php?lat=25.79&amp%3Blon=-80.16
page.php?lat=41.46u0026lon%3D-82.71
page.php?lat%5Cu003d30.31%5Cu0026lon%5Cu003d-89.33
page.php?lat=28.94-89.4&lon
Allen Edwards
  • 1,488
  • 1
  • 27
  • 44
  • I think we will need to see a collection of your mangled querystrings to get a better understanding of what you are dealing with. What framework is being used? Are there any redirects happening? With only one sample string, this is the best I can offer https://3v4l.org/8umOP but I am reluctant to post an answer because there are just too many ways that your data can vary. – mickmackusa Sep 02 '20 at 09:00
  • @mickmackusa Thank you for your reply. I was looking for a solution where $_GET or equivalent can do the decoding by running some header or turning on some switch or using a different function. That said, I will edit my post with more examples. – Allen Edwards Sep 02 '20 at 13:55
  • Yes, having to perform surgery on your url is not a stable/professional solution. You shouldn't need to do any of this monkey business. Please be clearer and more consistent about the urls that you are receving. The first one that you posted starts with `page.php`, the second one starts with what I would expect to see after the `?` (start of the query string). Again... any frameworks or redirect in play? What are the potential sources of these urls? If they are from within your application, then you need to be able to trace them back to a js or html source. – mickmackusa Sep 02 '20 at 14:21
  • I wish I knew what the source of the URL's was. Any yes, the examples are after the ? in the URL. Sorry for the confusion. I have tried to search for them unsuccessfully. There is nothing returned by getenv( "HTTP_REFERER" ) and they are certainly not coming from where they should, which is another page from my site. (they are php images) It is not unreasonable to ignore them, but I was wondering if there was a simple solution. – Allen Edwards Sep 02 '20 at 15:20

2 Answers2

1

If this was my project, I probably wouldn't be dignifying these urls -- even if stakeholders asked nicely. It really is a mess and there is a high likelihood that data will get corrupted in the decoding process. ...but if you want to have a go, you can start with something like this:

Code: (Demo)

// this is hack until you can manage to resolve the encoding issue in a more professional manner
// use $_SERVER['QUERY_STRING'] to extract the query string from the url

$queryStrings = [
    'lat=25.79&amp%3Blon=-80.16',
    'lat=41.46u0026lon%3D-82.71',
    'lat%5Cu003d30.31%5Cu0026lon%5Cu003d-89.33',
    'lat=28.94-89.4&lon',
    'tag%5Cu003d44'
];

foreach ($queryStrings as $queryString) {

    // replace unicode-like substrings
    $queryString = preg_replace_callback('/u([\da-f]{4})/i', function ($match) {
        return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
    }, urldecode($queryString));
    // courtesy of Gumbo: https://stackoverflow.com/a/2934602/2943403
    
    // replace ampersands and remove backslashes
    $queryString = strtr($queryString, ['&' => '&', '\\' => '']);
    
    // parse the decoded query string back into the GET superglobal so that regular processing can resume
    parse_str($queryString, $_GET);
    var_export($_GET);
    echo "\n";
}

Output:

array (
  'lat' => '25.79',
  'lon' => '-80.16',
)
array (
  'lat' => '41.46',
  'lon' => '-82.71',
)
array (
  'lat' => '30.31',
  'lon' => '-89.33',
)
array (
  'lat' => '28.94-89.4',    // <-- I guess you'll need to massage this into the correct shape too
  'lon' => '',
)
array (
  'tag' => '44',
)
mickmackusa
  • 43,625
  • 12
  • 83
  • 136
  • 1
    I am going to accept this answer and the advice "If this was my project, I probably wouldn't be dignifying these urls -- even if stakeholders asked nicely." Therefore I won't be changing my code but just offering a nice message instead about how to click on the setup page to get a proper URL. If they came from somewhere else on the web, too bad. – Allen Edwards Sep 02 '20 at 17:05
0

I decided to try and decode the bad URLs because also for an unknown reason they were showing as coming from my page. I was worried some device was encoding the calls, perhaps Android, perhaps some new browser. I have no idea what is encoding them but as some seem to be coming from my website I thought I should fix them. Just to clarify, this is a php image embedded in one of my sites. So far this has caught all the instances over the last few days. The idea is to take the query string and slowly decode it and then manually get the two variables but only if they have not been successfully decoded using the normal process. That way I am only dealing with calls I would have otherwise rejected so any unintended consequences would be minor.

<?
$latitude = trim(strip_tags($_GET['lat']));
$longitude = trim(strip_tags($_GET['lon']));
$request = getenv("QUERY_STRING");
$request = urldecode($request);// get rid of %5C type conversions
$request = unicode_decode($request);// with the %5c stuff removed, convert any unicode
$i = strpos($request,"lon");
$j = strpos($request,"lat");
// only decode things that didn't work with normal $_GET
if ($i != "" && $longitude == "") $longitude = substr($request,$i+4) + 0;
if (($j != "" || $j == 0) && $latitude == "") $latitude = substr($request,$j+4) + 0;
?>
Allen Edwards
  • 1,488
  • 1
  • 27
  • 44