1

I'm working on a search engine. I found on the web a well written php function enabling a keyword listing from a text. The function works perfectly in English. However when I tried to adapt it in French I observed that the "é", "è", "à" letters and all letters with accents are not displayed in the array output.

For example, if the text contains : "Hello Héllo" =>=> Output = "Hello Hllo"

I guess the issue is somewhere in the following code line:

$text = preg_replace('/[^a-zA-Z0-9 -.]/', '', $text); // only take alphanumerical characters, but keep the spaces and dashes too…

Any idea ? Thanks a lot from France !

Full code is the following:

function generateKeywordsFromText($text){


// List of words NOT to be included in keywords
  $stopWords = array('à','à demi','à peine','à peu près','absolument','actuellement','ainsi');
  
  $text = preg_replace('/\s\s+/i', '', $text); // replace multiple spaces etc. in the text
  $text = trim($text); // trim any extra spaces at start or end of the text
  $text = preg_replace('/[^a-zA-Z0-9 -.]/', '', $text); // only take alphanumerical characters, but keep the spaces and dashes too…
  $text = strtolower($text); // Make the text lowercase so that output is in lowercase and whole operation is case in sensitive.

  // Find all words
  preg_match_all('/\b.*?\b/i', $text, $allTheWords);
  $allTheWords = $allTheWords[0];
  
  //Now loop through the whole list and remove smaller or empty words
  foreach ( $allTheWords as $key=>$item ) 
  {
      if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
          unset($allTheWords[$key]);
      }
  }   
  
  // Create array that will later have its index as keyword and value as keyword count.
  $wordCountArr = array();
  
  // Now populate this array with keywrds and the occurance count
  if ( is_array($allTheWords) ) {
      foreach ( $allTheWords as $key => $val ) {
          $val = strtolower($val);
          if ( isset($wordCountArr[$val]) ) {
              $wordCountArr[$val]++;
          } else {
              $wordCountArr[$val] = 1;
          }
      }
  }
  
  // Sort array by the number of repetitions
  arsort($wordCountArr);
  
  //Keep first 10 keywords, throw other keywords
  $wordCountArr = array_slice($wordCountArr, 0, 50);
  
  // Now generate comma separated list from the array
  $words="";
  foreach  ($wordCountArr as $key=>$value)
  $words .= " " . $key ;
  
  // Trim list of comma separated keyword list and return the list
  return trim($words," ");
  } 
  echo $contentkeywords = generateKeywordsFromText("Hello, Héllo");
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Hymed Ghenai
  • 199
  • 15
  • You must probably want `preg_replace('/[^\p{L}0-9 .-]+/u', '', $text)` – Wiktor Stribiżew Jul 11 '20 at 14:33
  • @WiktorStribiżew this funciton [preg_replace('/[^\p{L}0-9 .-]+/u', '', $text)] only erase the "é". Output = "hello". Problem still unsolved. Any other idea ? – Hymed Ghenai Jul 11 '20 at 14:55
  • [**The code works**](https://3v4l.org/9LsM9). – Wiktor Stribiżew Jul 11 '20 at 14:57
  • well, seems that the function deactivates somewhere this preg_replace then ... See the code [link] (https://3v4l.org/YYHVB) – Hymed Ghenai Jul 11 '20 at 15:13
  • Finding all words with `/\b.*?\b/i` is something very exotic, I have never seen this before :) You just need `preg_match_all('/\w+/u', $text, $allTheWords);`. You can also fix the whitespace shrinking `$text = preg_replace('/\s{2,}/ui', '', $text);`. See [this PHP demo](https://3v4l.org/jdhDi). – Wiktor Stribiżew Jul 11 '20 at 15:29

1 Answers1

1

You need to fix all your three preg_replace calls:

$text = preg_replace('/\s{2,}/ui', '', $text); // replace multiple spaces etc. in the text
$text = preg_replace('/[^\p{L}0-9 .-]+/u', '', $text); // only take alphanumerical characters, but keep the spaces and dashes too…
// Find all words
preg_match_all('/\w+/u', $text, $allTheWords);

See the PHP demo

Details

  • '/\s{2,}/ui' - this will match any two or more Unicode whitespace chars
  • '/[^\p{L}0-9 .-]+/u' - matches one or more chars other than any Unicode letter (\p{L}), any ASCII digit (0-9) or space, dot or hyphen (note the - must be used at the end of the character class)
  • '/\w+/u' matches all Unicode words, sequences of one or more letter/digit/underscore chars.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563