1

How to separate (with white space) strings when it's glued, using keys in arrays to check if it's glued?

glued: sisteralannis, goodplace (replace for: sister alannis, good place)

Note: both have the starting part of existing keys in the array: sister, good, but they are not exactly keys, so replacing can not occur, so I need to separate them away, so that replacement is possible in the next step of the script. another solution would be to remove everything that is not exactly the same as the keys in $myWords

This code is to replace strings, I want an improvement, a code that verifies if the strings are glued, and add a space between them, separating them:

$myVar = "my sisteralannis is not that blonde, here is a goodplace";
$myWords=array(
    array("is","é"),
    array("on","no"),
    array("that","aquela"),
    array("sister","irmã"), 
    array("my","minha"),
    array("myth","mito"),
    array("he","ele"),
    array("good","bom"),
    array("ace","perito")
); 
usort($myWords,function($a,$b){return mb_strlen($b[0])<=>mb_strlen($a[0]);});  // sort subarrays by first column multibyte length
// remove mb_ if first column holds no multi-byte characters.  strlen() is much faster.

foreach($myWords as &$words){
    $words[0]='/\b'.$words[0].'\b/ui';  // generate patterns using search word, word boundaries, and case-insensitivity
}

$myVar=preg_replace(array_column($myWords,0),array_column($myWords,1),$myVar);
 //APPLY SECOND SOLUTION HERE

echo $myVar;

Expected Output: minha irmã alannis é not aquela blonde, here é a bom place.

=================

2ª solution More Simple: make match between $myVar and $myWords and delete anything that does not exist in $myWords.

would be to delete all strings of the variable that are not found in the array!

output: minha é aquela, é

  • How do you know to convert `goodplace` to `bom place`, but not convert `here` to `ele re`? In other words, what makes `goodplace` a "glued" word but `here` not, when both have their starting portion in the `$myWords` list? – salathe Dec 15 '17 at 00:04
  • `strtr()` will mangle the string. https://stackoverflow.com/a/47255104/2943403 @salathe is correct, this cannot be done logically. Ariane, please make this task solvable. – mickmackusa Dec 15 '17 at 00:10
  • I think I did not quite understand If it is not possible to do this, another thing would solve, query the arrays, and delete from the variable everything that does not exist in the array, **output:** `minha é aquela, é` search match between $myVar and $myWords end delete if not exists in $myWords – Ariane Martins Gomes Do Rego Dec 15 '17 at 00:16
  • this code works perfectly(thanks mickmackusa), but I did not count on so many glued words by typos of thousands of users that use the system, so I need to separate those words before doing the replacement, or else deleting everything that does not match with $myWords – Ariane Martins Gomes Do Rego Dec 15 '17 at 00:29

1 Answers1

1

I wouldn't say that I am 100% confident that this will handle all possible scenarios, but it does work for your input string and I did build it to accommodate words with the first letter capitalized. Beyond that, there are probably some fringe cases that will call for some tweaking.

There are some inline explanations to help with code comprehension.

Code: (Demo)

$myVar = "My sisteralannis is not that blonde, here is a goodplace";
$myWords=[["is","é"],["on","no"],["that","aquela"],["sister","irmã"],["my","minha"],
          ["myth","mito"],["he","ele"],["good","bom"],["ace","perito"]];
usort($myWords,function($a,$b){return strlen($b[0])<=>strlen($a[0]);});  // longer English words before shorter
$search=array_column($myWords,0);  // cache for multiple future uses

//input: "My sisteralannis is not that blonde, here is a goodplace";
//filter: ++ ------------- ++ --- ++++ ------  ---- ++ - ---------
//output: Minha            é      aquela     ,      é

$disqualifying_pattern='/ ?\b(?>'.implode('|',$search).')\b(*SKIP)(*FAIL)| ?[a-z]+/i';  // this handles the spaces for the sample input, might not work for all cases
//echo $disqualifying_pattern,"\n";
$filtered=preg_replace($disqualifying_pattern,'',$myVar);
//echo $filtered,"\n";

$patterns=array_map(function($v){return '/\b'.$v.'\b/i';},$search);
$replace=array_column($myWords,1);
echo preg_replace_callback(
        $patterns,
        function($m)use($patterns,$replace){
            $new=preg_replace($patterns,$replace,$m[0],1); // tell it to stop after replacing once
            if(ctype_upper($m[0][0])){  // if first letter of English word is uppercase
                $mb_ucfirst=mb_strtoupper(mb_substr($new,0,1));  // target and make upper, first letter of Portugese word
                return $mb_ucfirst.mb_substr($new, 1); // apply new uppercase letter to the rest of the Portugese word
            }
            return $new;
        },
        $filtered
    );

Output:

Minha é aquela, é
mickmackusa
  • 43,625
  • 12
  • 83
  • 136