1

I've assembled this little algorithm that is able to create a list of all possible arrangments of letters and underscores (tea -> te_ -> t_a -> _ea -> t__ ->...) of a set of words. It then compares the arrangments and finds the unique ones (tee & tea => [tea, tee, t_a, t_e, _ea, _ee, __a, __e]). My problem is that for larger words the program is using way too much memory (I'm creating a tree of all the possible combinations after all). I was thinking of creating each words' tree simultaneously and then unsetting the repetitions as it goes along, but I'm not sure how that would be done.

Here's the part of my code that has very poor memory-allocation with the words "tea" and "tee" as test cases:

$word1 = "tea tee";
$word1 = preg_split('/ +/', $word1);
$word1 = array2d($word1);
$word1 = get_multipleCombinations($word1);
$word1 = compare_combinations($word1);

foreach($word1 as $key1=>$level1){
    foreach($level1 as $key2=>$level2){
        foreach($level2 as $key3=>$level3){
            echo $word1[$key1][$key2][$key3];       
    }

    echo " ";
}

}

function array2d($words){
    $count = count($words);

    for ($i = 0; $i <= $count-1; $i++) {

        $words[$i] = str_split($words[$i]);
    }
    for ($i = 0; $i <= $count-1; $i++) {

        for ($j = 0; $j <= count($words[$i])-1; $j++){

            $words[$i][$j] = array("_", $words[$i][$j]);
        }
    }
    return $words;
}

function get_combinations($arrays) {
    $result = array(array());
    foreach ($arrays as $key => $values) {
        $temp = array();
        foreach ($result as $results) {
            foreach ($values as $value) {
                $temp[] = array_merge($results, array($key => $value));
            }
        }
        $result = $temp;
    }
    return $result;
}

function get_multipleCombinations($array){
    $count0 = count($array)-1;

    for ($i = 0; $i <= $count0; $i++){
        $result[$i] = get_combinations($array[$i]);
    }
    return($result);

}

function compare_combinations($array){
    $count = count($array)-1;

    for($j = 0; $j <= $count; $j++){
        for($z = 0; $z <= $count; $z++){
            if($j !== $z){
                for($i = 0; $i <= count($array[$j])-1; $i++){
                    if(count($array[$j]) === count($array[$z]) && $array[$j][$i] === $array[$z][$i]){
                        $array[$j][$i] = array("");
                        $array[$z][$i] = array("");
                    }
                }
            }
        }
    }
    return($array);

}

Feel free to criticize any dumb parts of the code/question, I'm new to this and it would be very helpful.

Thank you for your time.

1 Answers1

0

The first topic I would like to bring up is the idea of what is called Big O Notation. I won't bother butchering an explanation about it, so here is a link: What is a plain English explanation of "Big O" notation?

What I saw right away were all the nested for loops, which heavily impacts performance. If possible, it is recommended to write recursive functions as opposed to the nested for statements (and less loops in general).

The following is some code that may better guide you to using recursiveness as opposed to the nested for loops.

function getWordCombos(String $word) {
    $chars = str_split($word);
    return insertHyphens($chars);
}

function insertHyphens(Array $chars, $vals=[], $index=0) {
    //make an untarnished copy of original chars
    $original = $chars;

    for($i=$index; $i<count($original); $i++) {
        $chars = $original;

        //this alters $chars itself, which is why we set $chars = $original above
        array_splice($chars, $i, 1, '_');
        $tempWord = implode('', $chars);

        //setting and checking keys in array faster than in_array
        if (!isset($vals[$tempWord])) {
            $vals[$tempWord]=1;
        }

        //recursive bit
        $vals = insertHyphens($chars, $vals, $index + 1);
    }
    return $vals;
}
$words = "tea tee tail team";

$words = preg_split('/ +/', $words);
$unique = array();
foreach($words as $word) {
    $unique = array_merge($unique, getWordCombos($word));
}

var_dump(implode(' ', array_keys($unique)));

I ended up putting each combination of the word as an array key, instead of sticking it in the array, for performance reasons what is faster: in_array or isset?

It surprised me to find out that using if(!isset($vals[...])) runs quicker than if(!$vals[...])

Doubling the amount of words doubles the amount of time it takes, 2n. Increasing char length also impacts amount of time. When I put in some test 4 char and 5 char words:

4 char words 12 seconds for 10000 iterations && 20 words

5 char words 56 seconds for 10000 iterations && 20 words

Which is a little more than 4.5x as long increasing increasing the char count of the 20 words. Granted, that is over 10000 iterations, I'm not sure how many words you will be dealing with, or the variable lengths of the words themselves.

Hope this helps you along!

Community
  • 1
  • 1
WNP
  • 144
  • 1
  • 9
  • Oh wow, I really butchered that. However, this solution doesn't take into account combinations where underscores aren't joined together (like t_i_). The thing is that each word has 2^n combinations so, by nature, a lot of combinations end up being stored. What I was thinking of doing was generating the trees of all the words in the same loop and adding the verification inside the loop. So it would delete the excess combinations before generating them all. Also, there might be a way to find the combinations without comparing trees since we already know the letters with the same position – TryTryAgain May 29 '18 at 16:19
  • Also, I'm thinking it might not even be necessary to generate them all, it should stop as soon as it has found a unique combination of every word. (Since it's generating from most underscores to least.) – TryTryAgain May 29 '18 at 16:39
  • Ahh yes, I did not try 4 character words in the test. I will update my answer in a moment with a more robust solution. Also altered the way variables are stored in the array.. and did some performance checks.. – WNP May 29 '18 at 18:36