1

I am creating a very simple file search, where the search database is a text file with one file name per line. The database is built with PHP, and matches are found by grepping the file (also with PHP).

This works great in Linux, but not on Mac when non-ascii characters are used. It looks like names are encoded differently on HFS+ (MacOSX) than on e.g. ext3 (Linux). Here's a test.php:

<?php
$mystring = "abcóüÚdefå";
file_put_contents($mystring, "");
$h = dir('.');
$h->read(); // "."
$h->read(); // ".."
$filename = $h->read();

print "string: $mystring and filename: $filename are ";

if ($mystring == $filename) print "equal\n";
else print "different\n";

When run MacOSX:

$ php test.php
string: abcóüÚdefå and filename: abcóüÚdefå are different
$ php test.php |cat -evt
string: abcóü?M-^Zdefå$ and filename: abco?M-^Au?M-^HU?M-^Adefa?M-^J are different$

When run on Linux (or on a nfs-mounted ext3 filesystem on MacOSX):

$ php test.php
string: abcóüÚdefå and filename: abcóüÚdefå are equal
$ php test.php |cat -evt
string: abcM-CM-3M-CM-<M-CM-^ZdefM-CM-% and filename: abcM-CM-3M-CM-<M-CM-^ZdefM-CM-% are equal$

Is there a way to make this script return "equal" on both platforms?

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
neu242
  • 15,796
  • 20
  • 79
  • 114
  • What encoding are you using in your PHP file? – Gumbo Apr 21 '09 at 17:06
  • Did you take a look at the decomposition table I linked to? You could implement just that mapping to solve the problem as it seems that HFS+ doesn’t convert other characters. – Gumbo Apr 21 '09 at 20:03
  • Yeah, that could be a good back-up solution. The PECL extension solution is better for me, though, since I won't have to include that table in the source code. – neu242 Apr 21 '09 at 20:53

3 Answers3

5

MacOSX uses normalization form D (NFD) to encode UTF-8, while most other systems use NFC.

NFC vs NFD

(from unicode.org)

There are several implementations on NFD to NFC conversion. Here I've used the PHP Normalizer class to detect NFD strings and convert them to NFC. It's available in PHP 5.3 or through the PECL Internationalization extension. The following amendment will make the script work:

...
$filename = $h->read();
if (!normalizer_is_normalized($filename)) {
   $filename = normalizer_normalize($filename);
}
...
neu242
  • 15,796
  • 20
  • 79
  • 114
3

It seems that Mac OS X/HFS+ is using character combinations instead of single characters. So the ó (U+00F3) is instead encoded as o (U+006F) + ´ (U+CC81, COMBINING ACUTE ACCENT). See also Apple’s Unicode Decomposition Table.

Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • 1
    Yes, it looks like MacOSX uses normalization form D (NFD) to encode UTF-8, while most other systems use NFC: http://www.j3e.de/linux/convmv/man/#hfs__on_os_x___darwin – neu242 Apr 21 '09 at 18:15
0

Have you checked that both systems use the same locale?

What encoding is the PHP script using on both systems?

I would also try using strcmp instead of the equals operator. I'm not sure if the equals operator uses strcmp internally, but it's a simple thing to test out in your case.

Ben S
  • 68,394
  • 30
  • 171
  • 212
  • I use strpos in the actual implementation. I've also tried strcmp and mb_strpos. I think I might need to convert the strings to a common format before I compare them. – neu242 Apr 21 '09 at 17:18
  • Also, I tried nfs mounting an ext3 file system on the Mac. The script returned "equal" when working on this file system, so I guess I can't blame the Mac locale. – neu242 Apr 21 '09 at 17:21