0

Ok, I am really strugggeling with this one for a while . I have thousands files with wrong characters that were wrongly extracted by the server from a zip file , producing names converted by the server in this manner :

The original file name ( example ) is

QQ图片20160314173435.jpg

the files now presented on the server took the shape of

QQ#U56fe#U724720160314173435.jpg

where

图 = #U56fe

and

片= #U7247

All the files have the same 2 characters with diffeerent numbering only..

I have tried any function I can think of , including iconv family, mb_ family , str_raplace and even htmlentities_de/encode etc .. etc .

Each would either not work, or would produce other strange characters .

my code as for now is :

// iconv_set_encoding('input_encoding','GB18030');
// print_r($enc);
if ($handle = opendir('./')) {
    while (false !== ($fileName = readdir($handle))) {
        $ext = pathinfo($fileName, PATHINFO_EXTENSION);
        echo $ext .PHP_EOL;
        if ( $ext == 'jpg' ){
            echo "========" . mb_detect_encoding($fileName).PHP_EOL . "\r\n";
            $newName = mb_convert_encoding($fileName, "UTF-8",mb_detect_encoding($fileName));

        // $newName = str_replace("#","\\",$fileName);
        // $newName = str_replace("#U56fe",iconv("UTF-8","GB2312","图"),$newName);
        // $newName = html_entity_decode($newName,ENT_NOQUOTES,"GB2312");

        // $newName = urlencode($newName);
        // $newName = urldecode($newName);
        //
        // Tried //GB2312 // GB18030
        // $newName = iconv(mb_detect_encoding($newName, mb_detect_order(), true), "GB18030", $newName);
        // echo $newName .PHP_EOL;

        // $newName = iconv("UTF-8", "GB18030", $fileName);
        // $newName = iconv("GB18030", "UTF-8", $fileName);
        // $newName = iconv("ISO-8859-9//TRANSLIT", "UTF-8", $fileName);
        // echo $newName .PHP_EOL;
        // $newName = mb_convert_encoding($fileName, 'UTF-8', 'HTML-ENTITIES');


        // tried both  copy and rename+unlink
        //rename($fileName, $newName);
        copy ($fileName,$newName);
        }
    }
    closedir($handle);
}

I left some of the failed attempts just to show what was already tried , but actually I tried even more ( including iconv_set_encoding at the beginning ).

I have tried the script both on local ( win7 / xampp ) and on live server ( centos / Cpanel ) .

After so many failures I am not even sure whether the names are ASCII, UTF-8 or some unicode substitution represented in UTF-8.

Not that the problem is not with creating new files or folders - that I can do without a problem. Thee problem is renaming existing files with PHP only . Any other method of renaming actually works .

The strange thing is that I have tested the same script on another local machine ( UBUNTU ) - which was working well -is of course suggests that is somehow OS / PHP settings that are responsible - but how ?

And also - there must be some way to tell a script how to use codepages / encoding and dynamically change that ..

Obmerk Kronen
  • 15,619
  • 16
  • 66
  • 105
  • Possible duplicate of [How do I use filesystem functions in PHP, using UTF-8 strings?](http://stackoverflow.com/questions/1525830/how-do-i-use-filesystem-functions-in-php-using-utf-8-strings) – roeland Sep 01 '16 at 03:10
  • The gist is: PHP assumes the bytes you pass on to the file system functions have a certain encoding (probably your local ANSI code page), and if your file name cannot be encoded in that code page, you're out of luck. – roeland Sep 01 '16 at 03:12
  • @roland . Hehe ..I am drowning and you are describing the waters :-) .. just kidding . :-) . The problem is understood , but the solution ? Php has some functions that are supposed to change encoding . and they DO work ( but not as expected - producing different characters ) . Further more - the live server is a Hong Kong server ( CentOs) which fully supports Chinese from my experience with other scripts. – Obmerk Kronen Sep 01 '16 at 03:19
  • Yes, PHP strings are 8-bit strings, and you can encode Unicode text to bytes with any possible encoding and put those bytes in a PHP string. Usually this works as long as you know at all times what encoding is being used. *However* the file system functions is one of these places where PHP *assumes* the bytes in that string are using a certain encoding. – roeland Sep 01 '16 at 03:38

1 Answers1

-1

On a GNU/Linux system, using an sh-compatible shell (like bash), you can get a preview of the renaming like this:

for f in `find . -type f`; do
  g=`echo "$f" | sed -e 's/#U/\\\\u/g'`
  h=`/usr/bin/printf "$g"`
  if test "$h" != "$f"; then
    echo mv "$f" "$h"
  fi
done

If you are satisfied with the proposed renamings, actually do them, by removing the word 'echo' in the statement above:

for f in `find . -type f`; do
  g=`echo "$f" | sed -e 's/#U/\\\\u/g'`
  h=`/usr/bin/printf "$g"`
  if test "$h" != "$f"; then
    mv "$f" "$h"
  fi
done
Bruno Haible
  • 1,203
  • 8
  • 8