How can I get the correct position of a word in a UTF-8 text?

Question

I have a simple PHP code to get a sentences of a text and bold an specific word.

First of all I get an array with the words that I want and their position in the text.

$all_words = str_word_count($text, 2, 'åæéø');

// $words is an array with the words that I want find.
$words_found = array();
foreach ($all_words as $pos => $word_found) {
  foreach ($words as $word) {
    if ($word == strtolower($word_found)) {
      $words_found[$pos] = $word_found;
      break;
    }
  }
}

Then, for every word in $words_found I get a portion of the text with the word in the middle.

$length = 90;
foreach ($words_found as $offset => $word) {
  $word_length = strlen($word);

  $start = $offset - $length;
  $last_start = $start + $length + $word_length;

  $first_part = substr($text, $start, $length);
  $last_part = substr($text, $last_start, $length);

  $sentence = $first_part . '<b>' . $word . '</b>' . $last_part;
}

It works fine excepts that the text is a UTF-8 text with danish characteres (åæéø). So when $first_part or $last_part starts by an unicode character the susbtr string is empty.

I know mb_substr function, so I replace my code with it.

$word_length = mb_strlen($word, 'UTF-8');
$first_part = mb_substr($text, $start, $length, 'UTF-8');
$last_part = mb_substr($text, $last_start, $length, 'UTF-8');

But with this function (mb_substr) the position of the word ($offset) is wrong, the new substrings ($sentence) doesn't match as it should be.

Does it exist something like mb_str_word_count? How can I get a the correct position of the words?

Have you tried stripos()? http://www.php.net/manual/en/function.stripos.php — Mario Radomanana, Feb 04 '14 at 12:42
There is possible dublicat [is PHP str_word_count() multibyte](http://stackoverflow.com/questions/8290537/is-php-str-word-count-multibyte-safe) — Victor Bocharsky, Feb 04 '14 at 12:42
@MarioJohnathan `stripos()` doesn't work because it matchs even if the word to search is a substring of another word. — ilazgo, Feb 04 '14 at 13:27

Mario Radomanana · Answer 1 · 2014-02-05T12:25:34.870

2

Try using regex with Word Boundaries

$string = 'That this notpink a or pink blue red dark.';
$regex = '/\bpink\b/';
preg_match($regex, $string, $match, PREG_OFFSET_CAPTURE);
$pos = $match[0][1];
echo $pos;

Edit :

If you don't like regex, you can match word with stripos by using space

if(stripos($string, 'pink ') === 0)
    $pos = 0;
else if(stripos($string, ' pink') !== false)
    $pos = stripos($string, ' pink') + 1;
else
    $pos = stripos($string, ' pink ') + 1;

edited Feb 05 '14 at 12:25

answered Feb 04 '14 at 13:45

Mario Radomanana

1,698
1
21
31

But I think, in this case, the regex is not good at performance if the text is really big and there are a lot of words to search. – ilazgo Feb 04 '14 at 13:48
I edited my answer by using stripos with spaces to match word – Mario Radomanana Feb 04 '14 at 13:55
Your non-regex code doesn't work because if the text is, for example, "Your pink is so pinky.". It will match "pink" and "pinky". – ilazgo Feb 05 '14 at 12:18

score 1 · Accepted Answer · answered Feb 05 '14 at 12:25

I try the solution by @Mario Johnathan but it didn't work properly for me.

Finally I get a solution by my own: I use the non multi-byte functions like substr and the position given by str_word_count, and the solution is changing the first substring if the first character is a danish character.

$first_part_aux = str_split(trim($first_part));

if (!ctype_alpha($first_part_aux[0])) {
  for ($i = 1; $i < count($first_part_aux); $i++) {
    if (ctype_alpha($first_part_aux[$i])) {
      $start = $start + $i;
      $length = $length - $i;

      $first_part = substr($text, $start, $length);

      break;
    }
  }
}

How can I get the correct position of a word in a UTF-8 text?

2 Answers2