0

In written Hebrew there are marks for vowels called niqqud instead of full letters. In English "a e i o u" are letters; in Hebrew they are marks under the letters. For example, in נִקּוּד there is a dot for "i" under the first letter (נִ) (Hebrew is read right-to-left). Each mark is a character but not a letter.

I am trying to get the last 2 letters (not characters) of any word in Hebrew. The problem is that the functions: substr() and mb_substr() include the vowel marker as a full character, and because of that it's not giving me the last 2 letters. What can I do?

Here is my code:

<?php
    $array = array('סָאוּנְדּמֶן','לֵיְמֶן','דֹּמֶן','דּוֹרְמֶן','אחמד','בןהמלך');
    $dynamicstring = 'שֶׁמֶן';
    $word_strlen = strlen($dynamicstring);
    $newstring = substr($dynamicstring, -4);

    echo strlen($dynamicstring);
    echo '<br>';
    echo htmlspecialchars($newstring);
?>
CJ Dennis
  • 4,226
  • 2
  • 40
  • 69
Amanda
  • 21
  • 4

3 Answers3

2

You should use mb_substr();. Make sure you also check the following:

  • HTML document set to same charset
  • Database connection to insert data set to the same charset
  • Database table set to the same charset
  • Database connection to fetch data set to the same charset

For Hebrew you should use UTF-8 as charset.

This should be the correct code:

<?php
    $array = array('סָאוּנְדּמֶן','לֵיְמֶן','דֹּמֶן','דּוֹרְמֶן','אחמד','בןהמלך');
    $dynamicstring = 'שֶׁמֶן';
    $word_strlen = mb_strlen($dynamicstring, 'UTF-8');
    $newstring = mb_substr($dynamicstring, ($word_strlen-2), $word_strlen, 'UTF-8');

    echo mb_strlen($dynamicstring);
    echo '<br>';
    echo htmlspecialchars($newstring);
?>
John T
  • 814
  • 10
  • 17
  • Again, as I say in another comment, I have already using this. I'm not using SQL so its not the problem. I have trying to do what that you say with mb_substr and its not worked.. I need the **last** 2 characters – Amanda Jul 24 '18 at 07:25
0

use mb_substr($string, 0, 3, 'UTF-8'); method and instead of UTF-8 specify the correct format for hebrew language

  • I have trying this and its not working (additionally I need the last 2 characters so it need to be: $string, -4, 0, 'UTF-8' – Amanda Jul 24 '18 at 07:08
0

Neither substr() nor mb_substr() knows about character marks or letters. substr() simply looks at the number of bytes and mb_substr() looks at the number of codepoints. Since both character marks and letters use a single codepoint each, there's no way for mb_substr() to distinguish between them.

What you can do instead is use a regular expression:

if (preg_match('/\X\X$/u', $dynamicstring, $match)) {
    $newstring = $match[0];
}

or

$newstring = preg_replace('/^.*?(\X\X)$/us', '$1', $dynamicstring);

Using either of these in your program outputs the last two letters of שֶׁמֶן:

מֶן

Each \X will match a letter plus all immediately following marks. The /u option at the end of the expression is to switch on Unicode mode (UTF-8), otherwise it won't be able to recognise Hebrew codepoints.

If you want to use the single line preg_replace() version, you must add ^.*? to the start of the regex pattern to match all characters from the start of the input string up to the next pattern. The *? instead of * is to make it non-greedy, otherwise it will take part of the next sequence as well. The /s option is only needed if the input has line breaks in it, to allow . to match all characters including line breaks, otherwise it can be left out.

I recommend looking at Regular-Expressions.info, especially the section on PHP and the section on Unicode.

CJ Dennis
  • 4,226
  • 2
  • 40
  • 69