7

I've found that PHP function basename(), as well as pathinfo() have a strange behaviour with multibyte utf-8 names. They remove all non-Latin characters until the first Latin character or punctuation sign. However, after that, subsequent non-Latin characters are preserved.

basename("àxà"); // returns "xà", I would expect "àxà" or just "x" instead
pathinfo("àyà/àxà", PATHINFO_BASENAME); // returns "xà", same as above

but curiously the dirname part of pathinfo() works fine:

pathinfo("àyà/àxà", PATHINFO_DIRNAME); // returns "àyà"

PHP documentation warns that basename() and pathinfo() functions are locale aware, but this does not justify the inconsistency between pathinfo(..., PATHINFO_BASENAME) and pathinfo(..., PATHINFO_DIRNAME), not to mention the fact that identical non Latin characters are being either discarded or accepted, depending on their position relative to Latin characters.

It sounds like a PHP bug.

Since "basename" checks are really important for security concerns to avoid directoy traversal, is there any reliable basename filter that works decently with unicode input?

Demis Palma ツ
  • 7,669
  • 1
  • 23
  • 28

1 Answers1

8

I've found that changing the locale fixes everything.

While Apache by default runs with "C" locale, cli scripts by default run with an utf-8 locale instead, such as "en_US.UTF-8" (or in my case "it_IT.UTF-8"). Under these conditions, the problem does not occur.

Therefore, the workaround on Apache consists in changing the locale from "C" to "C.UTF-8" before calling these functions.

setlocale(LC_ALL,'C.UTF-8');
basename("àxà"); // now returns "àxà", which is correct
pathinfo("àyà/àxà", PATHINFO_BASENAME); // now returns "àxà", which is correct

Or even better, if you want to backup the current locale and restore it once done:

$lc = new LocaleManager();
$lc->doBackup();
$lc->fixLocale();
basename("àxà/àyà");
$lc->doRestore();


class LocaleManager
{
    /** @var array */
    private $backup;


    public function doBackup()
    {
        $this->backup = array();
        $localeSettings = setlocale(LC_ALL, 0);
        if (strpos($localeSettings, ";") === false)
        {
            $this->backup["LC_ALL"] = $localeSettings;
        }
        // If any of the locales differs, then setlocale() returns all the locales separated by semicolon
        // Eg: LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=C;...
        else
        {
            $locales = explode(";", $localeSettings);
            foreach ($locales as $locale)
            {
                list ($key, $value) = explode("=", $locale);
                $this->backup[$key] = $value;
            }
        }
    }


    public function doRestore()
    {
        foreach ($this->backup as $key => $value)
        {
            setlocale(constant($key), $value);
        }
    }


    public function fixLocale()
    {
        setlocale(LC_ALL, "C.UTF-8");
    }
}
Demis Palma ツ
  • 7,669
  • 1
  • 23
  • 28
  • 2
    doesn't work for me. use this instead: setlocale(LC_ALL,'en_US.UTF-8'); – Amir Surnay Jul 20 '20 at 12:49
  • Keep in mind that, for the split second you have your locale changed, entire process' locale will be different, and any other scripts running in the same process will report different locale. (Since 99,(9)% of web servers run a non-thread-safe PHP. And not only Apache mod_php, FPM is affected too.) – AnrDaemon Mar 16 '21 at 06:45
  • Perfect, You deserve lots of + – ucMedia Apr 29 '21 at 15:37