Can't use substr to specify number of Hebrew letters

Question

In written Hebrew there are marks for vowels called niqqud instead of full letters. In English "a e i o u" are letters; in Hebrew they are marks under the letters. For example, in נִקּוּד there is a dot for "i" under the first letter (נִ) (Hebrew is read right-to-left). Each mark is a character but not a letter.

I am trying to get the last 2 letters (not characters) of any word in Hebrew. The problem is that the functions: substr() and mb_substr() include the vowel marker as a full character, and because of that it's not giving me the last 2 letters. What can I do?

Here is my code:

<?php
    $array = array('סָאוּנְדּמֶן','לֵיְמֶן','דֹּמֶן','דּוֹרְמֶן','אחמד','בןהמלך');
    $dynamicstring = 'שֶׁמֶן';
    $word_strlen = strlen($dynamicstring);
    $newstring = substr($dynamicstring, -4);

    echo strlen($dynamicstring);
    echo '<br>';
    echo htmlspecialchars($newstring);
?>

Possible duplicate of [substr doesn't work fine with utf8](https://stackoverflow.com/questions/14785682/substr-doesnt-work-fine-with-utf8) — Nigel Ren, Jul 24 '18 at 07:10
@NigelRen no dude its not, my problem here is about the hebrew letters scoring. not same — Amanda, Jul 24 '18 at 07:12
@din Other than using `mb_substr`, you also need to use [`mb_strlen`](http://php.net/manual/en/function.mb-strlen.php) as well of course. — DarkBee, Jul 24 '18 at 07:22
@DarkBee I am using strlen only for private check, after all, I do not have to use the length of the string — Amanda, Jul 24 '18 at 07:27
A bit confused about your question. In your code, what would be the correct answer? (the two last characters) — John T, Jul 24 '18 at 09:30
I've posted code that should be correct but it don't give the answer you wish. I don't see any reason it should not work for Hebrew but have to trust you that it don't. — John T, Jul 24 '18 at 10:43
@JohnT your code its worked absolutely in hebrew, but not with scored hebrew letters. — Amanda, Jul 24 '18 at 11:36
Possible duplicate of [strpos return wrong position at hebrew](https://stackoverflow.com/questions/22976410/strpos-return-wrong-position-at-hebrew) — Jorge Fuentes González, Aug 10 '18 at 14:20

John T · Answer 1 · 2018-07-24T10:41:27.793

You should use mb_substr();. Make sure you also check the following:

HTML document set to same charset
Database connection to insert data set to the same charset
Database table set to the same charset
Database connection to fetch data set to the same charset

For Hebrew you should use UTF-8 as charset.

This should be the correct code:

<?php
    $array = array('סָאוּנְדּמֶן','לֵיְמֶן','דֹּמֶן','דּוֹרְמֶן','אחמד','בןהמלך');
    $dynamicstring = 'שֶׁמֶן';
    $word_strlen = mb_strlen($dynamicstring, 'UTF-8');
    $newstring = mb_substr($dynamicstring, ($word_strlen-2), $word_strlen, 'UTF-8');

    echo mb_strlen($dynamicstring);
    echo '<br>';
    echo htmlspecialchars($newstring);
?>

Again, as I say in another comment, I have already using this. I'm not using SQL so its not the problem. I have trying to do what that you say with mb_substr and its not worked.. I need the **last** 2 characters — Amanda, Jul 24 '18 at 07:25

score 0 · Answer 2 · answered Jul 24 '18 at 07:07

0

use mb_substr($string, 0, 3, 'UTF-8'); method and instead of UTF-8 specify the correct format for hebrew language

answered Jul 24 '18 at 07:07

Rimvydas Tamošiūnas

9
1

I have trying this and its not working (additionally I need the last 2 characters so it need to be: $string, -4, 0, 'UTF-8' – Amanda Jul 24 '18 at 07:08

score 0 · Answer 3 · answered Aug 11 '18 at 01:02

Neither substr() nor mb_substr() knows about character marks or letters. substr() simply looks at the number of bytes and mb_substr() looks at the number of codepoints. Since both character marks and letters use a single codepoint each, there's no way for mb_substr() to distinguish between them.

What you can do instead is use a regular expression:

if (preg_match('/\X\X$/u', $dynamicstring, $match)) {
    $newstring = $match[0];
}

or

$newstring = preg_replace('/^.*?(\X\X)$/us', '$1', $dynamicstring);

Using either of these in your program outputs the last two letters of שֶׁמֶן:

מֶן

Each \X will match a letter plus all immediately following marks. The /u option at the end of the expression is to switch on Unicode mode (UTF-8), otherwise it won't be able to recognise Hebrew codepoints.

If you want to use the single line preg_replace() version, you must add ^.*? to the start of the regex pattern to match all characters from the start of the input string up to the next pattern. The *? instead of * is to make it non-greedy, otherwise it will take part of the next sequence as well. The /s option is only needed if the input has line breaks in it, to allow . to match all characters including line breaks, otherwise it can be left out.

I recommend looking at Regular-Expressions.info, especially the section on PHP and the section on Unicode.

Can't use substr to specify number of Hebrew letters

3 Answers3