I am running an Apache/PHP/MySQL server (xampp) on my local machine under Windows 7. There I have installed the MediaWiki-Software, together with many extensions. My aim is to download some pages from Wikipedia and show them locally. Everything runs fine, except for one big problem:
The image files in the German Wikipedia contain German Umlaute (ä, ö, ü) in their file names. This cannot be changed, because the articles link to the names with the Umlaute.
When I try to import these images (via the maintenance/importImages.php
script), it does not work. I traced the code and figured out why:
When PHP scans the directory for files, it reads the file names as ANSI strings. MediaWiki internally requires that all strings are utf-8. So the Umlaut in the file name is interpreted as part of a (non-existing) unicode character, which breaks the script.
If I manually add a call to utf8_encode()
into the script, the name is fine then, and is correctly added to the database. But the file actually written to the "images" directory has a broken name - two special characters instead of the umlaut. The reason is that the PHP script sends utf-8 strings to the filesystem functions ("copy", ...), but the operating system expects ANSI strings there. I could manually add a call to utf8_decode()
before each file system call, but there are thousands of them!
In short form again: The OS works in ANSI (this cannot easily be changed for windows) and the MediaWiki software inside the PHP Server works in utf-8 (also cannot be changed). Is there a way to automatically encode/decode file name strings everytime they go into/out of the PHP server?
I was already playing around with mb_internal_encoding()
and mb_http_output()
, but this did not change anything: MediaWiki uses hard-coded functions which only work on utf-8 strings.