2

I created a ".zip" in Windows with this structure :

myfile.zip
    - trénsfèst
        - file1.png
        - file2.png
        - file3.png

With PHP I send a shell_exec to put myfile.zip on my server. And in my shell file I need to unzip this file to get the structure in a specific folder. When I executed unzip myfile.zip all the accent have are not interpreted :

Archive:  myfile.zip
creating: tr?n'sf?rt/
inflating: tr?n'sf?rt/file1.png
inflating: tr?n'sf?rt/file2.png
inflating: tr?n'sf?rt/file3.png

When I try to remove the folder there is some squares in replacement of the accents. Is there a solution to unzip my folder with all the accents

Thanks

Ruslan Osmanov
  • 20,486
  • 7
  • 46
  • 60
Cracs
  • 425
  • 1
  • 8
  • 29
  • Can you share the file somehow? I guess I know how to fix it, but I would like to check the solution before posting an answer. – Ruslan Osmanov Dec 12 '16 at 10:00
  • Just create on windows a folder with accent (with files in or not) and zip it with winrar or 7zip. The file isn't specific. – Cracs Dec 12 '16 at 10:03
  • the problem is that the filename encodings within Zip depend on the system locale. The results may be different on different Windows setups. If you want your problem fixed quickly, please share the file. – Ruslan Osmanov Dec 12 '16 at 10:08
  • You will fix the problem on my windows but all the users of my app can put .zip so I can't send you the file. And yes i think the problem came from the lcoale. – Cracs Dec 12 '16 at 10:14

2 Answers2

1

Windows usually encodes filenames depending on the locale. For example, for a Russian setup it usually encodes filenames in CP866. The filenames are put into Zip in the same locale, i.e. the locale depending on the system on which the archive is created.

Detecting Encoding

I tried to solve this problem some years ago, and I came to conclusion that in general there is no way to detect encoding reliably. In PHP you can try with ZipArchive and mb_detect_encoding:

$zip = new ZipArchive;
$filename = $argv[1];

if (! $zip->open($filename))
  die("failed to open $filename\n");

for ($i = 0; $i < $zip->numFiles; ++$i) {
  $encoding = mb_detect_encoding($zip->getNameIndex($i), 'auto');
  if (! $encoding) {
    trigger_error("Failed to detect encoding for " . $zip->getNameIndex($i), E_USER_ERROR);
    exit(1);
  }
  $zip->renameIndex($i, iconv($encoding, 'UTF-8', $zip->getNameIndex($i)));
}
$zip->extractTo('/home/ruslan/tmp/unzippped/');
$zip->close();

But from my experience, mb_detect_encoding is not very accurate.

You can try to detect encoding with enca tool as follows:

ls -1 folder | enca -L ru

where ru is the language code (all language codes are available through enca --list languages). But that requires you to guess the language. To actually convert the filenames from one encoding to UTF-8 you can use enconv, e.g.:

ls -1 folder | enconv -L russian -x UTF-8

But, again, you need to guess the language.

So I would recommend trying to detect the encoding with one of the methods above, and ask the user to pick encoding from a list of all available encodings. The auto-detected encoding might be selected in the list by default. Personally, I have opted to let the user to pick the encoding without the smart auto-detection.

When you know the source encoding

Unzip supports pipe streaming with -p option. But it works just for bulk data. That is, it doesn't separate stream into files passing all uncompressed content to the program:

unzip -p foo | more => send contents of foo.zip via pipe into program more

Parsing the raw stream is obviously a difficult task. One way is to extract files into a directory, and then convert filenames with a script like this:

$path = $argv[1];
$from_encoding = isset($argv[2]) ? $argv[2] : 'CP866';

if ($handle = opendir($path)) {
  while ($file = readdir($handle)) {
    rename($file, iconv($from_encoding, 'UTF-8', $file));
  }
  closedir($handle);
}

Sample usage:

php script.php directory Windows-1252

Alternatively, use ZipArchive as follows.

$zip = new ZipArchive;

$filename = $argv[1];
$from_encoding = isset($argv[2]) ? $argv[2] : 'CP866';

$zip->open($filename) or die "failed to open $filename\n";

for ($i = 0; $i < $zip->numFiles; ++$i) {
  $zip->renameIndex($i, iconv($from_encoding,'UTF-8', $zip->getNameIndex($i)));
}
$zip->extractTo('/target/directory/');

$zip->close();

Sample usage:

php script.php file.zip Windows-1252
Ruslan Osmanov
  • 20,486
  • 7
  • 46
  • 60
1

Thanks Ruslan Osmanov but I found a solution. After unzip my zip file I use convmv so here is my process :

unzip myfile.zip
convmv --notest -r -f WINDOWS-1252 -t utf8

Thank to this post : Windows-1252 to UTF-8 encoding

Community
  • 1
  • 1
Cracs
  • 425
  • 1
  • 8
  • 29
  • My answer contains convmv btw. Also, you can't assert that it will always be Windows-1252 as it depends on the source locale. Finally, as your question is tagged php, my solution with ziparchive and iconv is more appropriate. – Ruslan Osmanov Dec 13 '16 at 01:48