I'm building search engine to find words in database. The main MySql query is using FULL TEXT SEARCH in BOOLEAN mode. Let say we have polish word snake - wąż. How to make other variations of this word written only partially with diacritics. I need to get from wąż - waz, wąz, waż. The same thing I need to do for example with polish word turtle - żółw. Variation of this word will be - zółw, zołw, zolw, zólw, żólw. Imho that's all variations How to do it for this two examples? I want to do it also for other latin-alphabet languages. For example we may have romanian word jig - gîgă, or spanish word pencil - lápiz. How to write in PHP universal procedure for converting single word of any latin-alphabet language into all variation of this word written without pressing ALT on QWERTY keyboard.
This example working OK.
Part of sql query with $keywords variable is most important:
$keywords ="+Terrarium +wąż";
"SELECT * FROM item WHERE lower (adv_status) REGEXP '{$advert_status}' AND ( (MATCH(name) AGAINST('$keywords' IN BOOLEAN MODE) ) OR (MATCH(description) AGAINST('$keywords' IN BOOLEAN MODE)) OR name LIKE '$keywords%')";
I want to add to $keywords variable another variations of word snake (wąż).
$keywords ="+Terrarium +wąż +waz +wąz +waż";
How to write in PHP universal procedure to converting single word of any latin-alphabet language into all variation of this word written without pressing ALT on QWERTY keyboard.
Is it possible to use regular expression with preg_replace() or preg_match() type of functions? Thanks for any help. Sorry but English is my second language.
Edit:
I have written short procedure in PHP.
<?php
$string = "żółw wąż żółtodziób";
$words = explode(' ', $string);
$number = count($words);
$regex = '/[^a-zA-Z]/';
$chars_oryginal = array();
$chars_ascii = array();
$new_words = array();
$k=0; // index słowa w tablicy $new_words
for ($i=0;$i<$number;$i++)
{
if (!preg_match($regex,$words[$i]))
{
// OK FOR A-Z words
$new_words[$k]=$words[$i];
$k++;
}
else
{
//
$chars_oryginal[$i] = mb_str_split($words[$i]);
$length = count($chars_oryginal[$i]);
////$length = strlen($words[$i]);
for ($j=0;$j<$length;$j++)
{
// wykryj znak czy azAZ
if (!preg_match($regex,$chars_oryginal[$i][$j]))
{
$char_unicode_value = mb_ord($chars_oryginal[$i][$j]); // ASCII VALUE
echo " + $j) " . $char_unicode_value . " = " . $chars_oryginal[$i][$j] . ", ";
$new_words[$k]=$words[$i];
$k++;
}
else // wykrywa znak czy diakrytyczny
{
$char_unicode_value = mb_ord($chars_oryginal[$i][$j]); // ASCII VALUE
echo " - $j) " . $char_unicode_value . " = " . $chars_oryginal[$i][$j] . " " ;
$chars_ascii[$j] = preg_replace($regex.'i', '',iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $chars_oryginal[$i][$j]));
echo $chars_ascii[$j] . ", ";
if (isset($chars_ascii[$j]))
{
$pos[$i][$j] = mb_strpos($chars_oryginal[$i][$j],$words[$i]);
$new_words[$k] = str_replace($chars_oryginal[$i][$j],$chars_ascii[$j],$words[$i]);
$k++;
}
}
if ($j>= $length-1) $k++;
}
}
}
$new_words = array_unique($new_words); // wyelimunuj powtarzajace sie elementy
print_r($new_words);
?>
In result we may get from these 3 words in $string variable some variations of this words. But the list of variations is not full.
- 0) 380 = ż z, - 1) 243 = ó o, - 2) 322 = ł l, + 3) 119 = w, + 0) 119 = w, - 1) 261 = ą a, - 2) 380 = ż z, - 0) 380 = ż z, - 1) 243 = ó o, - 2) 322 = ł l, + 3) 116 = t, + 4) 111 = o, + 5) 100 = d, + 6) 122 = z, + 7) 105 = i, - 8) 243 = ó o, + 9) 98 = b,
Array ( [0] => zółw [1] => żołw [2] => żólw [3] => żółw [5] => wąż [6] => waż [7] => wąz [9] => zółtodziób [10] => żołtodziob [11] => żóltodziób [12] => żółtodziób )
First word żółw should be in 6 variations but we have 4. Second word żółw is correctly in 3 variations. Third word żółtodziób should be imho i 8 variations but we have only 4.
This is a pretty good result for me considering that @Shadow said this must we very difficult to achieve. What is wrong with this procedure? How to fix that to get all results? Remember - I don't want to create UTF-8/UNICODE chars of any latin-alphabet language, just to create ASCII chars from UTF-8/UNICODE and add this variations of single word to the array $new_words. Any help appreciated. Thanks.