0

I am trying to get a count of common phrases from a body of text. I don't just want single words, but rather all series of words between any stop words. So for example, https://en.wikipedia.org/wiki/Wuthering_Heights I would like the phrase "wuthering heights" to be counted rather than "wuthering" and "heights".

if (in_array($word, $this->stopwords)) 
{
    $cleanPhrase = preg_replace("/[^A-Za-z ]/", '', $currentPhrase);
    $cleanPhrase = trim($cleanPhrase);
    if($cleanPhrase != "" && strlen($cleanPhrase) > 2)
    {
        $this->Phrases[$cleanPhrase] = substr_count($normalisedText, $cleanPhrase);
        $currentPhrase = "";
    }
    continue;
}
else

$currentPhrase = $currentPhrase . $word . " ";

The problem I have with this "age" is being counted if the word "stage" is being used. The solution here is to add whitespace to either side of the $cleanPhrase variable. The problem this leads to then is if there is no white space. There could be a comma, full stop or some other character that signals some kind of punctuation. I want to count all of these. Is there a way I can do this without having to do something like this.

$terminate = array(".", " ", ",", "!", "?");
$count = 0;
foreach($terminate as $tpun)
{
    $count += substr_count($normalisedText, $tpun . $cleanPhrase . $tpun);
}
Dan Hastings
  • 3,241
  • 7
  • 34
  • 71
  • Wow a frequency counter on PHP. I have a very similar project. Although mine is written in C++, the idea of a solution may be the same. When parsing the text - I use spaces, new lines, tabs as delimeters, and then for each single word I determine whether the word has some other punctuation - commas, etc, remember it the word does, and then form phrases based on remembered punctuation. It actually is a bit more complex, but the main idea was to remember a punctuation for each word separatly. Although yes it leads to a lot of `if ... else if ... ` statements. – Eugene Anisiutkin Mar 19 '20 at 10:57

1 Answers1

1

By utilizing this answer with slight modification, you can do this:

$sentence = "Age: In this day and age, people of all age are on the stage.";
$word = 'age';
preg_match_all('/\b'.$word.'\b/i', $sentence, $matches);

\b represents a word boundary. So that string will give a count of 3 if searching for age (the i flag in the pattern means case insensitive, you can remove it if you want to match case as well).

If you're only going to match on one phrase at a time, you'll find your count in count($matches[0]).

El_Vanja
  • 3,660
  • 4
  • 18
  • 21