How to create all non full ASCII nor full UTF-8 variations of single word when properly should be written with diacritics?

Question

I'm building search engine to find words in database. The main MySql query is using FULL TEXT SEARCH in BOOLEAN mode. Let say we have polish word snake - wąż. How to make other variations of this word written only partially with diacritics. I need to get from wąż - waz, wąz, waż. The same thing I need to do for example with polish word turtle - żółw. Variation of this word will be - zółw, zołw, zolw, zólw, żólw. Imho that's all variations How to do it for this two examples? I want to do it also for other latin-alphabet languages. For example we may have romanian word jig - gîgă, or spanish word pencil - lápiz. How to write in PHP universal procedure for converting single word of any latin-alphabet language into all variation of this word written without pressing ALT on QWERTY keyboard.

This example working OK.

Part of sql query with $keywords variable is most important:

$keywords ="+Terrarium +wąż";

"SELECT * FROM item WHERE lower (adv_status) REGEXP '{$advert_status}' AND ( (MATCH(name) AGAINST('$keywords' IN BOOLEAN MODE) ) OR (MATCH(description) AGAINST('$keywords' IN BOOLEAN MODE)) OR name LIKE '$keywords%')";

I want to add to $keywords variable another variations of word snake (wąż).

$keywords ="+Terrarium +wąż +waz +wąz +waż";

How to write in PHP universal procedure to converting single word of any latin-alphabet language into all variation of this word written without pressing ALT on QWERTY keyboard.

Is it possible to use regular expression with preg_replace() or preg_match() type of functions? Thanks for any help. Sorry but English is my second language.

Edit:

I have written short procedure in PHP.

<?php
$string = "żółw wąż żółtodziób";
$words = explode(' ', $string);
$number = count($words);
$regex = '/[^a-zA-Z]/';
                  
$chars_oryginal = array();
$chars_ascii = array();
$new_words = array();
$k=0;    // index słowa w tablicy $new_words
            

for ($i=0;$i<$number;$i++)
                      
                  
    {   
        if (!preg_match($regex,$words[$i]))
            {
                                                
            // OK FOR A-Z words
                                    
                $new_words[$k]=$words[$i];
                $k++;
                                        
        
                }
                
                else
                    
                    {
                        //
                
                    $chars_oryginal[$i] = mb_str_split($words[$i]);
                    $length = count($chars_oryginal[$i]);
                    
                    ////$length = strlen($words[$i]);
                    for ($j=0;$j<$length;$j++)
                    {   
                    // wykryj znak czy azAZ
                    if (!preg_match($regex,$chars_oryginal[$i][$j]))
                
                    {
                     $char_unicode_value =  mb_ord($chars_oryginal[$i][$j]);   // ASCII VALUE
                        
                    echo " + $j) " . $char_unicode_value . " = " . $chars_oryginal[$i][$j] . ", ";
                    
                    
                    $new_words[$k]=$words[$i];
                    $k++;
                
                    }
                
                    else   // wykrywa znak czy diakrytyczny
                    
                    {
                    $char_unicode_value =  mb_ord($chars_oryginal[$i][$j]);   // ASCII VALUE
                        
                    echo " - $j) " . $char_unicode_value . " = " . $chars_oryginal[$i][$j] . " " ;
                    
                    $chars_ascii[$j] = preg_replace($regex.'i', '',iconv('UTF-8', 'ASCII//TRANSLIT//IGNORE', $chars_oryginal[$i][$j]));
                    echo $chars_ascii[$j] . ", ";
                         
                        if (isset($chars_ascii[$j]))
                            {
                        
                            $pos[$i][$j] = mb_strpos($chars_oryginal[$i][$j],$words[$i]);
                            $new_words[$k] = str_replace($chars_oryginal[$i][$j],$chars_ascii[$j],$words[$i]);
                            $k++;
                        
                    
                            }
                         
                         
                    }
                    
                    if ($j>= $length-1) $k++;
                    
                    }         
        
                    }
         }
         
                 $new_words = array_unique($new_words);  // wyelimunuj powtarzajace sie elementy
                 print_r($new_words);
                
                      
    ?>

In result we may get from these 3 words in $string variable some variations of this words. But the list of variations is not full.

- 0) 380 = ż z, - 1) 243 = ó o, - 2) 322 = ł l, + 3) 119 = w, + 0) 119 = w, - 1) 261 = ą a, - 2) 380 = ż z, - 0) 380 = ż z, - 1) 243 = ó o, - 2) 322 = ł l, + 3) 116 = t, + 4) 111 = o, + 5) 100 = d, + 6) 122 = z, + 7) 105 = i, - 8) 243 = ó o, + 9) 98 = b,
Array ( [0] => zółw [1] => żołw [2] => żólw [3] => żółw [5] => wąż [6] => waż [7] => wąz [9] => zółtodziób [10] => żołtodziob [11] => żóltodziób [12] => żółtodziób )

First word żółw should be in 6 variations but we have 4. Second word żółw is correctly in 3 variations. Third word żółtodziób should be imho i 8 variations but we have only 4.

This is a pretty good result for me considering that @Shadow said this must we very difficult to achieve. What is wrong with this procedure? How to fix that to get all results? Remember - I don't want to create UTF-8/UNICODE chars of any latin-alphabet language, just to create ASCII chars from UTF-8/UNICODE and add this variations of single word to the array $new_words. Any help appreciated. Thanks.

Unless you can come up with a logic how to universslly recognise words that need variations and how to come up with the variations, it will not be possible. You may have to ask peopke fsmiliar with linguistics first. Alternatively, you can build a dictionary, which you will gave to msintain and expand over the time for variations. — Shadow, Nov 27 '22 at 21:32
@Shadow Really? Could it be that complex? I now that is possible to detect non ASCII characters with regex /[^a-zA-Z]/. Also simple function $nonascii = iconv('UTF-8', 'US-ASCII//TRANSLIT', $words) create words only with ASCII characters. So I can have variable $keywords="+wąż +waz" in very simple way. I hope some guru give a 10-20 line solution to that problem with latin alphabet languages. Greetings from Poland. — Sylwester Bogusiak, Nov 28 '22 at 07:10
I don't think you have thought this one properly through! 1) Number of variations: if you have a long word with many letters that can have an accent or diacritics, then the number of variations that your code has to produce can explode very quickly, particularly if a letter has multiple accents / diacritics in a language. 2) accents / diacritics can change the meaning of word, thus resulting in irrelevant search results. 3) Foreign words with accents / diacritics can complicate generating variations. English does not have accents, but fiancé is spelled with accent. — Shadow, Nov 28 '22 at 12:12
Will you generate accented versions of all English words containing letter e just because fiancé is spelled with é? — Shadow, Nov 28 '22 at 12:15
As per the Unicode specification, you can basically accumulate as many diacritics as you want into a single character. This problem is typically solved at database level by providing an appropriate collation for the FULLTEXT index. — Álvaro González, Nov 28 '22 at 14:37
@Shadow If user does not type UTF-8 char in a single word but char form ASCII table I don't want to replace this char with any diacritics of any language. My idea is simple. When user type correctly spelled word, some procedure should create variations of this word mixed with ascii and utf-8 chars. I'm really close to solve that problem, but my procedure is not ideal and not create full set of variations of single word. — Sylwester Bogusiak, Nov 28 '22 at 21:59
@ÁlvaroGonzález Could you give some more info how to set FULLTEXT index for that purpose? I did in in my project in MYSQL DB but not sure the settings are correct. — Sylwester Bogusiak, Nov 28 '22 at 22:08
@ÁlvaroGonzález I'm searching words in column DESCRIPTION. This column is TEXT type and has FULLTEXT index set to some name. I just pick the name randomly, not sure that name must be some specific or not. There is more advanced settings for this INDEX but I don't know what settings will be best for searching words in DB for utf8mb4_unicode_ci COLLATION. Thanks for you hint. — Sylwester Bogusiak, Nov 28 '22 at 22:15
@SylwesterBogusiak and how can you tell if a word is spelled correctly? Using collation instead could be a good idea as long as you can detect the language from the search terms, apply the correct collation and also filter out rows that are not written in the same language. Still a better idea than trying to generate different versions of the word. — Shadow, Nov 29 '22 at 04:07
@Shadow I understand you point, but don't know if I should apply some changes to database or to MYSQL query. Where should I put info about used COLLATION? In query? If that be possible to detect language and filter rows that are written in the same language that be fantastic solution. Thanks for you hint, but please give me some more info how to do it. Perhaps you are right with that solution. I don't know how to tell if the word is spelled correctly now. Do you know how to do it? — Sylwester Bogusiak, Nov 29 '22 at 11:11
For example Google search engine is finding polish word żółw if we type this without diacritics zolw, so I see this problem is solved in other way than I try to apply. Same with spanish word lápiz when you type lapiz... google is finding correct result. Another example is in Firefox browser, when You search words on site with CTRL-F, the application finds all words even though you misspelled the word with no diacritics or with. I need similar solution and don't know that should I change setting in db or just try with another sql query. — Sylwester Bogusiak, Nov 29 '22 at 11:18
Language detection and spell checks are questions for Natural Language Processing (NLP), which is a very hot topic in data science. You probably need to use APIs created by big tech companies or need to implement your own models for these. However, neither php, nor sql is suitable for NLP, you need to use python or R for these purposes. For collation start with SO question: https://stackoverflow.com/questions/68390035/use-different-collation-with-each-query Please note, you need to research which collation(s) are the best for you and the answer may be different for each additional language — Shadow, Nov 29 '22 at 11:34
For example, GCP language detection API: https://cloud.google.com/translate/docs/basic/detecting-language, same for Azure: https://learn.microsoft.com/en-us/azure/cognitive-services/language-service/language-detection/how-to/call-api, and there are tons of others available. — Shadow, Nov 29 '22 at 11:38
@Shadow Thanks but I don't want to rebuild all project and use some API's. Imho there is possible to search in database with diacritics insensitive settings and this option should give much better results than diacritics sensitive. But I don't know how to set a query and what settings should be in db. Here is some topic, but there is no clear answer how it was solved https://stackoverflow.com/questions/26705362/mysql-diacritic-insensitive-fulltext-search — Sylwester Bogusiak, Nov 29 '22 at 11:51

How to create all non full ASCII nor full UTF-8 variations of single word when properly should be written with diacritics?

0 Answers0