I've found that PHP function basename(), as well as pathinfo() have a strange behaviour with multibyte utf-8 names. They remove all non-Latin characters until the first Latin character or punctuation sign. However, after that, subsequent non-Latin characters are preserved.
basename("àxà"); // returns "xà", I would expect "àxà" or just "x" instead
pathinfo("àyà/àxà", PATHINFO_BASENAME); // returns "xà", same as above
but curiously the dirname part of pathinfo() works fine:
pathinfo("àyà/àxà", PATHINFO_DIRNAME); // returns "àyà"
PHP documentation warns that basename() and pathinfo() functions are locale aware, but this does not justify the inconsistency between pathinfo(..., PATHINFO_BASENAME)
and pathinfo(..., PATHINFO_DIRNAME)
, not to mention the fact that identical non Latin characters are being either discarded or accepted, depending on their position relative to Latin characters.
It sounds like a PHP bug.
Since "basename" checks are really important for security concerns to avoid directoy traversal, is there any reliable basename filter that works decently with unicode input?