Cannot Read Files From ZIPs When Accented Character Is Present In File Path

Question

I am having problems uploading a ZIP file to a MarkLogic-based XQuery application and extracting its contents. The problem concerns accented characters in the filepath of one of the files in the ZIP.

Here is a sample ZIP that demonstrates the issue. It has an e-acute in the folder name:

Note that the ZIP was prepared on a Windows system, and the XQuery code is also running on Windows, within MarkLogic 10.

My website can upload the ZIP and read the manifest, but the manifest says the character is actually "‚", and when I try to extract the specific file in that folder I get a "file not found" error.

I figured out that of course 201A is NOT an e-acute in Unicode. So I tried converting "&#x201a" to "©", and that still gave the same error, and I even tried "é" with the same result.

I am now thinking this is completely broken, and there appears to be no way I can extract files from a ZIP when there is an accented character in the path to the file (or the filename too I expect).

Can anybody help? I do not mind having to FIX paths if needed, but as I show above I have not even been able to achieve that.

Neil.

https://lwn.net/Articles/729835/ "The reason is simple: stupid file formats. There are no specs for file name encoding in ZIPs. There's no file name encoding indicator either." — Mads Hansen, Sep 28 '21 at 15:34
And yet the same ZIP file can be opened without problem using other tools, including the "unzip" tool on LINUX. It is only MarkLogic that seems to have a problem with this. — Neil Bradley, Sep 28 '21 at 15:45
Create a zip file like this: `xdmp:zip-create( test with é char.txt , text{"test with é char"} ) => xdmp:save("C:/temp/marklogic-test.zip", $zip)` and then see what the filename looks like in Windows WinZip (or whatever) vs. reading the manifest in MarkLogic: `xdmp:document-get("C:/temp/marklogic-test.zip") => xdmp:zip-manifest()` — Mads Hansen, Sep 28 '21 at 15:49
Apparently, determining which encoding is used is a bit of a guessing game: https://stackoverflow.com/questions/13261347/correctly-decoding-zip-entry-file-names-cp437-utf-8-or and MarkLogic doesn't do any guessing. — Mads Hansen, Sep 28 '21 at 15:52
OK, I can accept that. But surely it IS a bug that the path reported by the manifest disagrees with the path I need to extract that file. Currently I have a file in the ZIP that I am completely unable to extract from the ZIP because that path that is reported in the manifest is not found in the ZIP. I could cope with some letters in the path being corrupted, but it is a bigger problem that I cannot even extract the file at all. — Neil Bradley, Oct 02 '21 at 13:30
I don't see an option that would affect how it's parsed. You could file a support case and see if they decide to create an enhancement request or bug to change the behavior in a future release. — Mads Hansen, Oct 02 '21 at 13:32

Cannot Read Files From ZIPs When Accented Character Is Present In File Path

0 Answers0