2

I'm trying to create a function which removes all none English characters (except spaces,dots and hyphens) from a string. For this I tried using preg_replace, but the function produces strange results.

I have a file called "example-נידדל.jpg"

Here is what I'm getting when trying to sanitize the file name:

echo preg_replace('/[^A-Za-z0-9\.]/','','example-נידדל.jpg');

The above produces: example.jpg as expected.

But when I try to pull the file name from a $_FILES array after uploading it to the server I get:

echo preg_replace('/[^A-Za-z0-9\.]/','',$_FILES['file_upload']["name"]);

The above produces example-15041497149114911500.jpg

The numbers I'm getting are in fact the HTML numbers of the characters which were suppose to be removed, see the following for character reference: http://realdev1.realise.com/rossa/phoneme/listCharactors.asp?start=1488&stop=1785&rows=297&page=1

I can't figure out why doesn't the preg_replace work with file names.

Can anyone help?

Thanks,

Roy

Roy Peleg
  • 1,000
  • 2
  • 8
  • 25
  • What's the character set on the page and on the form? – Mark Eirich Jun 18 '11 at 18:01
  • 1
    just `echo $_FILES['file_upload']["name"]` and see the results. – lovesh Jun 18 '11 at 18:07
  • I've renamed an image to the name above and tried uploading it. Regardless of whether or not I specify an `accept-charset` for the form or add a `charset` meta tag, I always got it to return example.jpg. Are you sure the file you're uploading has isn't in fact named example-15041497149114911500.jpg before you upload it? – Francois Deschenes Jun 18 '11 at 18:18
  • I checked now and the numbers I'm getting correlate to the respected HTML character that was suppose to be replaced: http://realdev1.realise.com/rossa/phoneme/listCharactors.asp?start=1488&stop=1785&rows=297&page=1 – Roy Peleg Jun 18 '11 at 20:35
  • @Roy Peleg - Please see if my answer solves your problem. – Francois Deschenes Jun 18 '11 at 20:45

2 Answers2

2

What about using mb_convert_encoding to convert the HTML entities back into UTF-8 before the preg_replace?

echo preg_replace('/[^A-Za-z0-9\.]/', '', mb_convert_encoding($_FILES['file_upload']["name"], 'UTF-8', 'HTML-ENTITIES'));
Francois Deschenes
  • 24,816
  • 4
  • 64
  • 61
1

I would use a combination of regular expressions and iconv to transliterate it.

Update: Prior transliteration/filtering the filename mabye needs to be urldecoded:

$path = urldecode($path); // convert triplets to bytes.

Here is a code example from here that does something very similar to your question:

function pathauto_cleanstring($string)
{
    $url = $string;
    $url = preg_replace('~[^\\pL0-9_]+~u', '-', $url); // substitutes anything but letters, numbers and '_' with separator
    $url = trim($url, "-");
    $url = iconv("utf-8", "us-ascii//TRANSLIT", $url); // TRANSLIT does the whole job
    $url = strtolower($url);
    $url = preg_replace('~[^-a-z0-9_]+~', '', $url); // keep only letters, numbers, '_' and separator
    return $url;
}

It expects your into to be UTF-8 encoded.

Reference

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836