I am trying to get a count of common phrases from a body of text. I don't just want single words, but rather all series of words between any stop words. So for example, https://en.wikipedia.org/wiki/Wuthering_Heights I would like the phrase "wuthering heights" to be counted rather than "wuthering" and "heights".
if (in_array($word, $this->stopwords))
{
$cleanPhrase = preg_replace("/[^A-Za-z ]/", '', $currentPhrase);
$cleanPhrase = trim($cleanPhrase);
if($cleanPhrase != "" && strlen($cleanPhrase) > 2)
{
$this->Phrases[$cleanPhrase] = substr_count($normalisedText, $cleanPhrase);
$currentPhrase = "";
}
continue;
}
else
$currentPhrase = $currentPhrase . $word . " ";
The problem I have with this "age" is being counted if the word "stage" is being used. The solution here is to add whitespace to either side of the $cleanPhrase
variable. The problem this leads to then is if there is no white space. There could be a comma, full stop or some other character that signals some kind of punctuation. I want to count all of these. Is there a way I can do this without having to do something like this.
$terminate = array(".", " ", ",", "!", "?");
$count = 0;
foreach($terminate as $tpun)
{
$count += substr_count($normalisedText, $tpun . $cleanPhrase . $tpun);
}