1

I have a text in which I would like to calculate occurences of the phrase "lorem ipsum dolor".

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.

The algorithm should be counting occurrences even if the searching phrase is written in different order. I've highlighted expected results. Is there any better way to achieve that than using regular expression with every possible combination?

In this case the result should be equal to 3

  • Lorem ipsum dolor
  • Ipsum lorem dolor
  • Dolor ipsum lorem

The phrase will have about 3-4 words and string will be a content of web page.

Mateusz Nowak
  • 4,021
  • 2
  • 25
  • 37

4 Answers4

2

You could try the regex:

/(?:(?:(?:lorem|ipsum|dolor)\s?)+)/gi

with preg_match_all and then count the number of matches. From your sample, you should get 3 matches.


I'm not too good at algorithms nor at PHP, but an attempt...

<?php

$string = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.';

$lower_string = strtolower($string);

$text = array('lorem', 'ipsum', 'dolor');

$perms = AllPermutations($text);
$result = 0;
foreach ($perms as $piece) {
    $phrase = join(' ', $piece);
    $result += substr_count($lower_string, $phrase);
}

# From http://stackoverflow.com/a/12749950/1578604
function AllPermutations($InArray, $InProcessedArray = array())
{
    $ReturnArray = array();
    foreach($InArray as $Key=>$value)
    {
        $CopyArray = $InProcessedArray;
        $CopyArray[$Key] = $value;
        $TempArray = array_diff_key($InArray, $CopyArray);
        if (count($TempArray) == 0)
        {
            $ReturnArray[] = $CopyArray;
        }
        else
        {
            $ReturnArray = array_merge($ReturnArray, AllPermutations($TempArray, $CopyArray));
        }
    }
    return $ReturnArray;
}

echo $result;
?>

ideone demo

Jerry
  • 70,495
  • 13
  • 100
  • 144
  • 2
    OP mentioned `Is there any better way to achieve that than using regular expression with every possible combination?` – Shankar Narayana Damodaran Jan 06 '14 at 18:38
  • 1
    @ShankarDamodaran Huh, seems I didn't pay enough attention. I find it strange that they tagged the question with regex though. – Jerry Jan 06 '14 at 18:39
  • Look at my answer for a php build-in solution. – Anyone Jan 06 '14 at 18:46
  • @Anyone Your answer is not a proper answer. It's a link-only answer. – Jerry Jan 06 '14 at 18:47
  • It's pretty much all the OP needs, but I've updated it with examples (which only take a click). This was the first result I found on google btw. – Anyone Jan 06 '14 at 18:48
  • I will update this, now that the OP has formulated the question better, this case yields the proper results. Not sure if it's faster than matching against all combinations, but sure is easier to write! – Anyone Jan 06 '14 at 18:56
  • 1
    @ShankarDamodaran Updated answer to include non-regex solution. – Jerry Jan 06 '14 at 19:20
  • Guys, check out my answer with the memory consumption chart too. – Ali Gajani Jan 06 '14 at 22:21
  • @Jerry, +1 for the answer. – Shankar Narayana Damodaran Jan 07 '14 at 05:29
  • @AliGajani I don't get how that `getrusage` works exactly (my code gives the memory usage roughly [in between yours and Mark](http://ideone.com/dHkBJ8)), but if you time your function, I had to make [250 iterations](http://ideone.com/G717JK) (the IDE couldn't support taking 1000) and it gives an average of 0.00011 sec/run, while with my code, I can do [1000 iterations](http://ideone.com/OD3TPg) and it gives an average of 0.00005 sec/run (twice as fast as yours). – Jerry Jan 07 '14 at 06:23
2
$haystack = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.';
$needle = 'Lorem ipsum dolor';

$hayWords = str_word_count(
    strtolower($haystack), 
    1
);
$needleWords = str_word_count(
    strtolower($needle), 
    1
);
$needleWordsCount = count($needleWords);

$foundWords = array_intersect(
    $hayWords, 
    $needleWords
);

$count = array_reduce(
    array_keys($foundWords),
    function($counter, $item) use ($foundWords, $needleWordsCount) {
        for($i = $item; $i < $item + $needleWordsCount; ++$i) {
            if (!isset($foundWords[$i]))
                return $counter;
        }
        return ++$counter;
    },
    0
);

var_dump($count);
Mark Baker
  • 209,507
  • 32
  • 346
  • 385
  • 1
    Works great until you get lorem ipsum dolor dolor - but then we are just getting silly! +1 this answer is as elegant as this is going to get – GrahamTheDev Jan 06 '14 at 19:20
  • @GrahamRitchie: Mine works with lorem ipsum dolor dolor as well :) – Ali Gajani Jan 06 '14 at 22:26
  • 2
    This code will return a count of 4 with "lorem ipsum dolor dolor" which is technically wrong: it counts both the "lorem ipsum dolor" and the "ipsum dolor dolor", the latter of which isn't valid (although "lorem ipsum dolor lorem" would be correctly counted as 2)... but the likelihood of those combination are fairly slim – Mark Baker Jan 06 '14 at 22:33
1

I think you are looking for this: http://nl1.php.net/substr_count

$text = 'This is a test';
echo strlen($text); // 14

echo substr_count($text, 'is'); // 2

// the string is reduced to 's is a test', so it prints 1
echo substr_count($text, 'is', 3);

// the text is reduced to 's i', so it prints 0
echo substr_count($text, 'is', 3, 3);

// generates a warning because 5+10 > 14
echo substr_count($text, 'is', 5, 10);


// prints only 1, because it doesn't count overlapped substrings
$text2 = 'gcdgcdgcd';
echo substr_count($text2, 'gcdgcd');
Anyone
  • 2,814
  • 1
  • 22
  • 27
  • This will return only 1 result either. "The algorithm should be counting occurrences even if the searching phrase is written in different order." – Mateusz Nowak Jan 06 '14 at 18:52
  • Additionally you could put both the needle and haystack through strtolower (or upper if you please) and compare "case-insensitive" – Anyone Jan 06 '14 at 18:52
  • @estshy, if I understand correctly you want to match the 3 words differently in a row? – Anyone Jan 06 '14 at 18:54
  • @Anyone Please take a look at my question, I've added some information that may help you better undestand the problem. – Mateusz Nowak Jan 06 '14 at 18:56
  • 1
    Yes I see. In that case I would recommend Jerry's answer, good luck! – Anyone Jan 06 '14 at 18:57
1

Note: Works with "Lorem ipsum dolor dolor" as well.

Good evening everyone. I have figured out another technique. This one uses a varying approach to what Mark Baker did, which I appreciate very much. Also, go down to see memory usage.

In a nutshell, it takes basic string (lorem ipsum dolor) that needs to be matched, which is then shuffled into all possible combinations (in this case 3! = 6).

Further,all those 6 combination of strings are then added into array which is used to do the matching substr_count. I am also using shuffle(), in_array and array_push.

The code is self explanatory and if you are curious, here's my IDEONE. This is Mark Baker's solution on IDEONE. They both take the same amount of time and memory, and my solution is 4 lines shorter, if not more elegant :)

<?php

    $string = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.';

//convert main string to lowercase to have an even playing field
    $string2 = strtolower($string);
    $substring = 'lorem ipsum dolor';

//add the first lorem ipsum dolor to launch the array 
    $arr = array($substring);

//run until the array is full with all possible combinations i.e. 6 (factorial of 3)
    for ($i=0; $i<=20; $i++) {
        $wordArray = explode(" ",$substring);
        shuffle($wordArray);
        $randString= implode(" ",$wordArray);

//if random string isn't in the array, then only you push the new value 
        while (! (in_array($randString,$arr)) ) {
            array_push($arr,$randString);
        }

    }

//var_dump($arr);

//here, we do the matching, and this is pretty self explanatory
    $n = sizeof($arr);
    for ($q=0; $q<=$n; $q++) {
        $sum += substr_count($string2,$arr[$q]);
    }

    echo "Total occurances: ".$sum;

?>

Memory Usage

As you can already see, Mark's code ups me on +2 occasions, but the difference is very negligible due to the nature of this programme, and the data complexity associated. Obviously, the difference could be big given the program's complexity, but this is what it is.

enter image description here

Ali Gajani
  • 14,762
  • 12
  • 59
  • 100
  • However, your code does give notices and warnings, and your line count approach is rather disingenuous – Mark Baker Jan 06 '14 at 22:35
  • Yes, those are minor warnings which can be fixed. I just wanted to demonstrate that it could be done in an alternative way and I spent a considerable amount of time getting my head around to it, hence thought of sharing. Also, I am not counting lines? – Ali Gajani Jan 06 '14 at 22:38
  • 1
    Since when does less lines mean better? – Jerry Jan 07 '14 at 06:25