2

I have a file browser and I'm trying to find which file names contain a given query.The code goes like this:

$query = (isset($_POST['s']))? mb_strtolower($_POST['s'],'UTF-8') : ''; 
$res = opendir($dir); 
    while(false!== ($file = readdir($res))) { 
if(mb_strpos(mb_strtolower($file,'UTF-8'),mb_strtolower($query,'UTF-8'),0,'UTF-8')!== false) {
    echo $file;
}}

For English words this works fine,but when the text is in Greek,the results are not as expected,meaning that it works for some but not all of Greek words.Could anyone help me solve this?

Vegeta
  • 135
  • 6
  • *"but when **query** is in Greek"* - This tells me you're using a database. Try passing UTF-8 to your connection before querying, **if** that is the case. See this page http://stackoverflow.com/questions/279170/utf-8-all-the-way-through – Funk Forty Niner Mar 24 '15 at 12:53
  • 1
    I'm not using a database,the query is the text that the user types for searching. – Vegeta Mar 24 '15 at 12:58
  • Did you look at http://php.net/manual/en/function.mb-internal-encoding.php - There's a reference about it in http://php.net/manual/en/function.mb-strpos.php – Funk Forty Niner Mar 24 '15 at 13:01
  • yes I'm using that in my code. – Vegeta Mar 24 '15 at 13:05
  • Is your file saved as UTF-8, all your files including the one that contains the words you're looking for? – Funk Forty Niner Mar 24 '15 at 13:06
  • Yes, all files are saved as utf-8. – Vegeta Mar 24 '15 at 13:09
  • Why don't you check what's different about the query and the filename? That might give you a clue. Simply echo them in a HTML comment and then look at the source code of your HTML page. – KIKO Software Mar 24 '15 at 13:09
  • I have already done that :) the query and the file name are exactly the same. – Vegeta Mar 24 '15 at 13:11
  • So your problem is solved then? – KIKO Software Mar 24 '15 at 13:11
  • No the problem is not solved. – Vegeta Mar 24 '15 at 13:13
  • Well then, show us what the two strings look like when it doesn't work. Please use this: `echo '';` exactly, and copy from the source code of the HTML page. And make sure the page encoding is UTF-8. – KIKO Software Mar 24 '15 at 13:16
  • What exactly is `$file`? What exactly is `$query` for that matter? – deceze Mar 24 '15 at 13:40
  • I echo the query and the file names and it looks like this: query ->"παράρτημα" filenames->"παράρτημα α.doc" "παράρτημα β.pdf". – Vegeta Mar 24 '15 at 13:49
  • I edited the post so you can see what $file is. – Vegeta Mar 24 '15 at 14:23
  • Well those are clearly the same, but regretably it's not a copy of what I gave you, nor is it clear that you copied it from from the source code. These things matter! For instance `παράρτημα` and `παράρτημα` are very different, but look the same in a browser. This is just an example, your strings might be very different. – KIKO Software Mar 24 '15 at 14:25
  • 1
    File names are very often *not* stored as UTF-8. It very much depends on the underlying file system. Working with non-ASCII file names is quite a pain. – deceze Mar 24 '15 at 14:26
  • 1
    please see this http://stackoverflow.com/questions/2887909/working-with-japanese-filenames-in-php-5-3-and-windows-vista – Sharky Mar 24 '15 at 14:38

1 Answers1

2

The graphemes may render the same or similar but they are not represented the same way. For example:

These were copied directly from your comment.


In order to compare them you should first use normalizer_normalize() on both strings to obtain them in their normalized forms. Which type of normalization form to use is ultimately up to you. There are four:

  1. NFD (Canonical Decomposition)
  2. NFC (Canonical Decomposition, followed by Canonical Composition)
  3. NFKD (Compatibility Decomposition)
  4. NFKC (Compatibility Decomposition, followed by Canonical Composition)

Because this normalization is being used completely internally just ignore NFC and NFKC, there's no need to recompose. This leaves you with the option of either NFD or NFKD - canonical or compatible. The names give you a bit of a clue on how strict they are regarding equivalence.


1.1 Canonical and Compatibility Equivalence:

Canonical equivalence is a fundamental equivalency between characters or sequences of characters that represent the same abstract character, and when correctly displayed should always have the same visual appearance and behavior.

Compatibility equivalence is a weaker equivalence between characters or sequences of characters that represent the same abstract character, but may have a different visual appearance or behavior.


For searching I would go with the latter.

Example:

$foo = "παράρτημα";
$bar = "παράρτημα";
var_dump($foo === $bar);
var_dump(
    normalizer_normalize($foo, Normalizer::FORM_KD) ===
    normalizer_normalize($bar, Normalizer::FORM_KD)
);

Output:

bool(false)
bool(true)
Community
  • 1
  • 1
user3942918
  • 25,539
  • 11
  • 55
  • 67