3

I have a code that compares the output with the values of the array, and only terminates the operation with words in the array:

First code(just a example)

$myVar = 'essa pizza é muito gostosa, que prato de bom sabor';
$myWords=array(
    array('sabor','gosto','delicia'),
    array('saborosa','gostosa','deliciosa'),
);

foreach($myWords as $words){
    shuffle($words); // randomize the subarray
    // pipe-together the words and return just one match
    if(preg_match('/\K\b(?:'.implode('|',$words).')\b/',$myVar,$out)){
        // generate "replace_pair" from matched word and a random remaining subarray word
        // replace and preserve the new sentence
        $myVar=strtr($myVar,[$out[0]=>current(array_diff($words,$out))]);
    }
}
echo $myVar;

My Question:

I have a second code, which is not for rand/shuffle(I do not want rand, I want precision in substitutions, I always change column 0 through 1), is to always exchange the values:

// wrong output: $myVar = "minha irmã alanné é not aquela blnode, elere é a bom plperito";
$myVar = "my sister alannis is not that blonde, here is a good place";
$myWords=array(array("is","é"),
    array("on","no"),
    array("that","aquela"),
    //array("blonde","loira"),
    //array("not","não"),
    array("sister","irmã"), 
    array("my","minha"),
    //array("nothing","nada"),
    array("myth","mito"),
    array("he","ele"),
    array("good","bom"),
    array("ace","perito"),
   // array("here","aqui"), //if [here] it does not exist, it is not to do replacement from the line he=ele = "elere" non-existent word  
); 
$replacements = array_combine(array_column($myWords,0),array_column($myWords,1));
$myVar = strtr($myVar,$replacements);
echo $myVar;
// expected output:  minha irmã alannis é not aquela blonde, here é a bom place
//  avoid replace words slice!

expected output: minha irmã alannis é not aquela blonde, here é a bom place

    //  avoid replace words slice! always check if the word exists in the array before making the substitution.

alanné, blnode, elere, plperito

it examines whether the output will be of real words, which exist in the array myWords, this avoids typing errors like:

that 4 words is not an existent words, a writing error. how do you do that for the second code?

   in short, the exchange must be made by a complete word / key, an existing word. and not create something strange using slices of keywords!

2 Answers2

1

Unfortunately strtr() is the wrong tool for this job because it is "word boundary ignorant". To target whole words there is no simpler way that using a regex pattern with word boundaries.

Furthermore, to ensure that longer strings are match prior to shorter strings (strings that may exist inside other strings), you must sort $myWords by string length (descending / longest to shortest; using the multi-byte version only if necessary).

Once the array of words is sorted and converted to individual regex patterns, you can feed the arrays into the pattern and replace parameters of preg_replace().

Code (Demo)

$myVar = "my sister alannis is not that blonde, here is a good place";
$myWords=array(
    array("is","é"),
    array("on","no"),
    array("that","aquela"),
    array("sister","irmã"), 
    array("my","minha"),
    array("myth","mito"),
    array("he","ele"),
    array("good","bom"),
    array("ace","perito")
); 
usort($myWords,function($a,$b){return mb_strlen($b[0])<=>mb_strlen($a[0]);});  // sort subarrays by first column multibyte length
// remove mb_ if first column holds no multi-byte characters.  strlen() is much faster.

foreach($myWords as &$words){
    $words[0]='/\b'.$words[0].'\b/i';  // generate patterns using search word, word boundaries, and case-insensitivity
}

//var_export($myWords);
//var_export(array_column($myWords,0));
//var_export(array_column($myWords,1));

$myVar=preg_replace(array_column($myWords,0),array_column($myWords,1),$myVar);
echo $myVar;

Output:

minha irmã alannis é not aquela blonde, here é a bom place

What this doesn't do is appreciate the case of the matched substrings. I mean, my and My will both be replaced by minha.

To accommodate different casing, you will need to use preg_replace_callback().

Here is that consideration (which handles uppercase first letter words, not ALL CAPS words):

Code (Demo) <-- run this to see the original casing preserved after the replacement.

foreach($myWords as $words){
    $myVar=preg_replace_callback(
        $words[0],
        function($m)use($words){
            return ctype_upper(mb_substr($m[0],0,1))?
                mb_strtoupper(mb_substr($words[1],0,1)).mb_strtolower(mb_substr($words[1],1)):
                $words[1];
        },
        $myVar);
}
echo $myVar;
mickmackusa
  • 43,625
  • 12
  • 83
  • 136
  • Hi @mickmackusa, I'm sorry I did not ask the question more fully, I even used a previous case to make it clearer, but I see that I even caught it. see the update if you can help me again. please. – Ariane Martins Gomes Do Rego Nov 13 '17 at 00:25
  • Thank you very much, it helped me a lot, it helped a lot again, it worked perfectly in all possible tests !!! Thanksss! :-D – Ariane Martins Gomes Do Rego Nov 13 '17 at 01:31
  • 1
    Hi @mickmackusa, thank you by your approach! Sorry for the delay, I've been off the PC all these hours! yes, it really was a good insight on your part, I always use strtolower in the variables before doing anything with the strings stored in them, because I know there may be problems with uppercase and lowercase! and then, if need be: ucfirst, and then if need be, it's a way to hide the dirt from under the rug, with your idea above gets better code.and your second code gave me idea to apply it in another situation here. Thank you very much one more time :-) – Ariane Martins Gomes Do Rego Nov 14 '17 at 23:32
  • Hi @mickmackusa, :-) http://sandbox.onlinephpfunctions.com/code/004d3e62f58280043db37f2c404853f9b6da51f8 (thank you, I did not know about this website functions online)the code does not recognize syllables with accents as words, so it replaces the word already replaced by another word! but it is rare after a syllable with an accent, there is the formation of another word in English. ( I think you can even inform the system, put the word back with white space before and after it. )Thanks – Ariane Martins Gomes Do Rego Nov 15 '17 at 22:20
  • 1
    Good news, the solution is the addition of a single character! All patterns will need to have a `unicode modifier` represented by a `u` flag (next to the case-insensitive flag) at the end of the pattern. Here is a demo: http://sandbox.onlinephpfunctions.com/code/99da4b2d141e4635099a399bd0bce83ba89c1808 – mickmackusa Nov 15 '17 at 22:27
  • Wow! so fast! Thanks you so much!! Awesome! – Ariane Martins Gomes Do Rego Nov 15 '17 at 22:31
  • 1
    Hi @mickmackusa, I thought you would like to know for some reason, as per experience, that the first code here as a solution, failed to make the exchange for the correct string, when my array got very large, up to 4000 lines the code worked, then which was greater than 4000 lines, the replace was random !! so I used your second code `preg_replace_callback` and it `mb_substr` was 100% :-). good week :-) – Ariane Martins Gomes Do Rego Dec 26 '17 at 04:40
1

My previous method was incredibly inefficient. I didn't realize how much data you were processing, but if we are upwards of 4000 lines, then efficiency is vital (I think I my brain was stuck thinking about strtr() related processing based on your previous question(s)). This is my new/improved solution which I expect to leave my previous solution in the dust.

Code: (Demo)

$myVar = "My sister alannis Is not That blonde, here is a good place. I know Ariane is not MY SISTER!";
echo "$myVar\n";

$myWords = [
    ["is", "é"],
    ["on", "no"],
    ["that", "aquela"],
    ["sister", "irmã"], 
    ["my", "minha"],
    ["myth", "mito"],
    ["he", "ele"],
    ["good", "bom"],
    ["ace", "perito"],
    ["i", "eu"]  // notice I must be lowercase
];
$translations = array_column($myWords, 1, 0);  // or skip this step and just declare $myWords as key-value pairs

// length sorting is not necessary
// preg_quote() and \Q\E are not used because dealing with words only (no danger of misinterpretation by regex)

$pattern = '/\b(?>' . implode('|', array_keys($translations)) . ')\b/i';  // atomic group is slightly faster (no backtracking)
/* echo $pattern;
   makes: /\b(?>is|on|that|sister|my|myth|he|good|ace)\b/i
   demo: https://regex101.com/r/DXTtDf/1
*/
$translated = preg_replace_callback(
    $pattern,
    function($m) use($translations) {  // bring $translations (lookup) array to function
        $encoding = 'UTF-8';  // default setting
        $key = mb_strtolower($m[0], $encoding);  // standardize keys' case for lookup accessibility
        if (ctype_lower($m[0])) { // treat as all lower
            return $translations[$m[0]];
        } elseif (mb_strlen($m[0], $encoding) > 1 && ctype_upper($m[0])) {  // treat as all uppercase
            return mb_strtoupper($translations[$key], $encoding);
        } else {  // treat as only first character uppercase
            return mb_strtoupper(mb_substr($translations[$key], 0, 1, $encoding), $encoding)  // uppercase first
                   . mb_substr($translations[$key], 1, mb_strlen($translations[$key], $encoding) - 1, $encoding);  // append remaining lowercase
        }
    },
    $myVar
);
    
echo $translated;

Output:

My sister alannis Is not That blonde, here is a good place. I know Ariane is not MY SISTER!
Minha irmã alannis É not Aquela blonde, here é a bom place. Eu know Ariane é not MINHA IRMÃ!

This method:

  • does only 1 pass through $myVar, not 1 pass for every subarray of $myWords.
  • does not bother with sorting the lookup array ($myWords/$translations).
  • does not bother with regex escaping (preg_quote()) or making pattern components literal (\Q..\E) because only words are being translated.
  • uses word boundaries so that only complete word matches are replaced.
  • uses an atomic group as a micro-optimization which maintains accuracy while denying backtracking.
  • declares an $encoding value for stability / maintainability / re-usability.
  • matches with case-insensitivity but replaces with case-sensitivity ...if the English match is:
    1. All lowercase, so is the replacement
    2. All uppercase (and larger than a single character), so is the replacement
    3. Capitalized (only first character of multi-character string), so is the replacement
mickmackusa
  • 43,625
  • 12
  • 83
  • 136