1

I have the following problem: I extracted a zip file via SSZipArchive (in a Swift app) and there are some file names with "invalid" characters.
I think the reason is that I zipped the files under Windows and so the names are now coded in ANSI.

Is there a way to convert all the "corrupted" folder and file names during the unzip process?
Or later? It would be no problem if I have to iterate over the folder tree and rename the files.
But I have no idea how to find out which names are set in ANSI and I also don't know how to correct the charset.

Cœur
  • 37,241
  • 25
  • 195
  • 267
altralaser
  • 2,035
  • 5
  • 36
  • 55

2 Answers2

1

The official spec says that the path should be either encoded in Code Page 437 MS-DOS Latin US or UTF-8 (if Bit 11 of the general purpose field is set):

D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437. This limits storing file name characters to only those within the original MS-DOS range of values and does not properly support file names in other character encodings, or languages. To address this limitation, this specification will support the following change.

D.2 If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification. The Unicode Standard is published by the The Unicode Consortium (www.unicode.org). UTF-8 encoded data stored within ZIP files is expected to not include a byte order mark (BOM).

I recently released a Swift open source implementation of the ZIP file format called ZIPFoundation. It conforms to the standard and should be able to detect Windows path names and decode them properly.

Thomas Zoechling
  • 34,177
  • 3
  • 81
  • 112
0

Probably fixed in latest SSZipArchive (currently 2.1.1). I've implemented support for non-Unicode filenames in a way similar to the code below, so you can reuse it to process your filenames yourself if you want.

OK, it's in Objective-C, but as SSZipArchive has the fix in itself already, you shouldn't need it anymore. Otherwise, either make a bridging header to include the objective-c code to your swift app, or convert it to Swift (should be easy).

@implementation NSString (SSZipArchive)

+ (NSString *)filenameStringWithCString:(const char *)filename size:(uint16_t)size_filename
{
    // unicode conversion attempt
    NSString *strPath = @(filename);
    if (strPath) {
        return strPath;
    }

    // if filename is non-unicode, detect and transform Encoding
    NSData *data = [NSData dataWithBytes:(const void *)filename length:sizeof(unsigned char) * size_filename];
    // supported encodings are in [NSString availableStringEncodings]
    [NSString stringEncodingForData:data encodingOptions:nil convertedString:&strPath usedLossyConversion:nil];
    if (strPath) {
        return strPath;
    }

    // if filename encoding is non-detected, we default to something based on data
    // note: hexString is more readable than base64RFC4648 for debugging unknown encodings
    strPath = [data hexString];
    return strPath;
}
@end

@implementation NSData (SSZipArchive)

// initWithBytesNoCopy from NSProgrammer, Jan 25 '12: https://stackoverflow.com/a/9009321/1033581
// hexChars from Peter, Aug 19 '14: https://stackoverflow.com/a/25378464/1033581
// not implemented as too lengthy: a potential mapping improvement from Moose, Nov 3 '15: https://stackoverflow.com/a/33501154/1033581
- (NSString *)hexString
{
    const char *hexChars = "0123456789ABCDEF";
    NSUInteger length = self.length;
    const unsigned char *bytes = self.bytes;
    char *chars = malloc(length * 2);
    // TODO: check for NULL
    char *s = chars;
    NSUInteger i = length;
    while (i--) {
        *s++ = hexChars[*bytes >> 4];
        *s++ = hexChars[*bytes & 0xF];
        bytes++;
    }
    NSString *str = [[NSString alloc] initWithBytesNoCopy:chars
                                                   length:length * 2
                                                 encoding:NSASCIIStringEncoding
                                             freeWhenDone:YES];
    return str;
}
@end
Cœur
  • 37,241
  • 25
  • 195
  • 267