1

I am running an Apache/PHP/MySQL server (xampp) on my local machine under Windows 7. There I have installed the MediaWiki-Software, together with many extensions. My aim is to download some pages from Wikipedia and show them locally. Everything runs fine, except for one big problem:

The image files in the German Wikipedia contain German Umlaute (ä, ö, ü) in their file names. This cannot be changed, because the articles link to the names with the Umlaute.

When I try to import these images (via the maintenance/importImages.php script), it does not work. I traced the code and figured out why:

When PHP scans the directory for files, it reads the file names as ANSI strings. MediaWiki internally requires that all strings are utf-8. So the Umlaut in the file name is interpreted as part of a (non-existing) unicode character, which breaks the script.

If I manually add a call to utf8_encode() into the script, the name is fine then, and is correctly added to the database. But the file actually written to the "images" directory has a broken name - two special characters instead of the umlaut. The reason is that the PHP script sends utf-8 strings to the filesystem functions ("copy", ...), but the operating system expects ANSI strings there. I could manually add a call to utf8_decode() before each file system call, but there are thousands of them!

In short form again: The OS works in ANSI (this cannot easily be changed for windows) and the MediaWiki software inside the PHP Server works in utf-8 (also cannot be changed). Is there a way to automatically encode/decode file name strings everytime they go into/out of the PHP server?

I was already playing around with mb_internal_encoding() and mb_http_output(), but this did not change anything: MediaWiki uses hard-coded functions which only work on utf-8 strings.

j0k
  • 22,600
  • 28
  • 79
  • 90
  • Where can you make changes? Example: Can you change the utf-8 umlaute into the letter a, o, or u and still create a working solution? – Ray Paseur Dec 28 '12 at 14:11
  • Maybe the answer here might be of use -> http://stackoverflow.com/questions/1089966/utf8-filenames-in-php-and-different-unicode-encodings – Crisp Dec 28 '12 at 14:24
  • **Yeah, I solved it.** The idea with the Apache RewriteRule was great. Needs no changes in the PHP code. I added the following rewrite rule to my httpd.conf (just for Umlaut: ä) RewriteEngine On RewriteRule ^(/?mywiki\/images/.*/[^/]*?)\xC3\xA4([^/]*?\xC3\xA4[^/]*)$ $1\%C3\%83\%C2\%A4$2 [NE,N] RewriteRule ^(/?mywiki\/images/.*/[^/]*?)\xC3\xA4([^/\xC3\xA4]*)$ $1\%C3\%83\%C2\%A4$2 [NE,R=301] What it does is just replacing each 'ä' by '%C3%83%C2%A4' in URLs which point to /mywiki/images/*. So MediaWiki can access the files with the corrupted names in the Windows file system. – Brehministrator Dec 28 '12 at 18:12

1 Answers1

1

You need to rename all the files on the filesystem before you import them so they match the data that is inside the database.

Just ensure when the UTF-8 encoded binary sequence of the filename hits the filesystem, the file is found.

$fileANSI; // you have this
$fileUTF8 = ut8_encode($fileANSI); // you do this already
// insert etc, when MW is ready do:
rename($fileANSI, $fileUTF8);

So you need to rename each file from it's current name to the binary sequence when hit.

For your webserver you might need as well to introduce a rewrite-rule to take care of the incomming HTTP requests as the webserver might use some other file-system handling than PHP itself.

Also check the system configuration of your file-system which codepage is used. That can differ.

hakre
  • 193,403
  • 52
  • 435
  • 836
  • I was already thinking about it. Inside MediaWiki, everything looks fine with this solution. In Windows, the file names then contain unreadable characters, but this is fine. But the problem remains: MediaWiki creates Links to the files without special characters, but with Umlaute (because the Browser understands uft-8). The browser can not find that file, because it is looking for the file with Umlaut, but on the file system, there are only the files with strange symbols. Any additional hint? – Brehministrator Dec 28 '12 at 16:27
  • Ok, maybe I can solve this with an Apache rewrite rule. I am not experienced with creating rewrite rules. Can somebody give me a clue what I need inside a rewrite rule to turn Umlaute into the two special characters from utf-8? – Brehministrator Dec 28 '12 at 16:56
  • @user1934614: You catch these URLs and then, they should work for PHP's file-system functions and you then should be able to `readfile` those. Alternatively you need to provide the binary sequence in form of a redirect, but I would got for the `readfile` variant first for debugging purposes. – hakre Dec 28 '12 at 18:18