Windows usually encodes filenames depending on the locale. For example, for a Russian setup it usually encodes filenames in CP866. The filenames are put into Zip in the same locale, i.e. the locale depending on the system on which the archive is created.
Detecting Encoding
I tried to solve this problem some years ago, and I came to conclusion that in general there is no way to detect encoding reliably. In PHP you can try with ZipArchive
and mb_detect_encoding
:
$zip = new ZipArchive;
$filename = $argv[1];
if (! $zip->open($filename))
die("failed to open $filename\n");
for ($i = 0; $i < $zip->numFiles; ++$i) {
$encoding = mb_detect_encoding($zip->getNameIndex($i), 'auto');
if (! $encoding) {
trigger_error("Failed to detect encoding for " . $zip->getNameIndex($i), E_USER_ERROR);
exit(1);
}
$zip->renameIndex($i, iconv($encoding, 'UTF-8', $zip->getNameIndex($i)));
}
$zip->extractTo('/home/ruslan/tmp/unzippped/');
$zip->close();
But from my experience, mb_detect_encoding
is not very accurate.
You can try to detect encoding with enca
tool as follows:
ls -1 folder | enca -L ru
where ru
is the language code (all language codes are available through enca --list languages
). But that requires you to guess the language. To actually convert the filenames from one encoding to UTF-8 you can use enconv
, e.g.:
ls -1 folder | enconv -L russian -x UTF-8
But, again, you need to guess the language.
So I would recommend trying to detect the encoding with one of the methods above, and ask the user to pick encoding from a list of all available encodings. The auto-detected encoding might be selected in the list by default. Personally, I have opted to let the user to pick the encoding without the smart auto-detection.
When you know the source encoding
Unzip supports pipe streaming with -p
option. But it works just for bulk data. That is, it doesn't separate stream into files passing all uncompressed content to the program:
unzip -p foo | more => send contents of foo.zip via pipe into program more
Parsing the raw stream is obviously a difficult task. One way is to extract files into a directory, and then convert filenames with a script like this:
$path = $argv[1];
$from_encoding = isset($argv[2]) ? $argv[2] : 'CP866';
if ($handle = opendir($path)) {
while ($file = readdir($handle)) {
rename($file, iconv($from_encoding, 'UTF-8', $file));
}
closedir($handle);
}
Sample usage:
php script.php directory Windows-1252
Alternatively, use ZipArchive
as follows.
$zip = new ZipArchive;
$filename = $argv[1];
$from_encoding = isset($argv[2]) ? $argv[2] : 'CP866';
$zip->open($filename) or die "failed to open $filename\n";
for ($i = 0; $i < $zip->numFiles; ++$i) {
$zip->renameIndex($i, iconv($from_encoding,'UTF-8', $zip->getNameIndex($i)));
}
$zip->extractTo('/target/directory/');
$zip->close();
Sample usage:
php script.php file.zip Windows-1252