-1

I trying to echo this line:

$mother_name = "סבטלנה ואסילבנה"; 
echo strrev($mother_name);

result is: �נבליסאו� �נלטבס�

What is wrong? Where from I getting this symbol. and when I trying to echo single char from the string I getting also for example $mother_name[0]

Paebbels
  • 15,573
  • 13
  • 70
  • 139
L. Vadim
  • 539
  • 3
  • 12
  • I think PHP is having difficulties with the charset of the Hebrew characters, I might be wrong though.. – Rick J Mar 29 '18 at 20:19
  • 1
    You shouldn't be reversing the string, you should be using an RTL mark for embedding hebrew in otherwise LTR documents. https://en.wikipedia.org/wiki/Right-to-left_mark – Sammitch Mar 29 '18 at 20:19
  • 1
    Sadly, pretty much all the str* functions are unsuitable for multibyte encodings. They operate on single bytes, not characters. – Peter Mar 29 '18 at 20:20
  • Not sure whether it's helpful, but I've just tested it with Scala (uses same JVM strings as Java), it prints סבטלנה ואסילבנה in forward direction and הנבליסאו הנלטבס if the string is reversed. Afair php's strings are rather byte arrays that do not cope with unicode very well... – Andrey Tyukin Mar 29 '18 at 20:22
  • Actually, that string is totally fine. What are you trying to do to it and why? – Sammitch Mar 29 '18 at 20:24
  • 1
    @bobblebubble you also cannot simply reverse a UTF string like that, if it contains combining marks, diacritics, or other similar codepoints it's going to corrupt the content of the string. – Sammitch Mar 29 '18 at 20:30
  • I have to reverse string. I writing code where I using reversing string – L. Vadim Mar 29 '18 at 20:37
  • @bobblebubble working. write answer i will mark it as right , thanks – L. Vadim Mar 29 '18 at 20:46

1 Answers1

5

The marked duplicate answer is also wrong for the same reason that @bubblebobble's comment is wrong. You cannot simply reverse the order of individual code points and expect a sane string to come out the other side.

The intl library provides a sane method around this via IntlBreakIterator::createCharacterInstance() which interprets coherent sequences of code points:

function utf8_strrev($input) {
    $it = IntlBreakIterator::createCharacterInstance('he_IL.utf8');
    $it->setText($input);

    $ret = '';
    $prev = 0;
    foreach ($it as $pos) {
        $ret = substr($input, $prev, $pos - $prev) . $ret;
        $prev = $pos;
    }
    return $ret;
}

function naieve_utf8_strrev($input) {
    return implode("", array_reverse(preg_split('//u', $input)));
}

$tests = [
    "test",
    "סבטלנה ואסילבנה",
    "nai\xcc\x88ve fail"
];

foreach($tests as $test) {
    var_dump(
        $test,
        naieve_utf8_strrev($test),
        utf8_strrev($test)
    );
    echo PHP_EOL;
}

Output:

string(4) "test"
string(4) "tset"
string(4) "tset"

string(29) "סבטלנה ואסילבנה"
string(29) "הנבליסאו הנלטבס"
string(29) "הנבליסאו הנלטבס"

string(12) "naïve fail"
string(12) "liaf ev̈ian"
string(12) "liaf evïan"

and I still think that trying to reverse a hebrew string like this is the wrong way to to go if all you want is a left-to-right display of hebrew text. You should be using UTF8 LRO/RLO and PDF marks to switch the direction.

Edit: Finally tracked down the correct codepoints.

function utf8_force_ltr($input) {
    $LRO = "\xe2\x80\xad"; // left-right override
    $PDF = "\xe2\x80\xac"; // pop directional formatting
    return $LRO . $input . $PDF;
}

var_dump($test, utf8_force_ltr($test));

Output:

string(29) "סבטלנה ואסילבנה"
string(35) "‭סבטלנה ואסילבנה‬"
Sammitch
  • 30,782
  • 7
  • 50
  • 77
  • Thank you for showing. you think this won't work either [`preg_match_all('/\X/u', $input, $out)`...](https://eval.in/981083)? Actually I thought, that the empty regex would match between each `\X` which matches any number of Unicode characters that form an extended Unicode sequence. – bobble bubble Mar 29 '18 at 21:25
  • 1
    @bobblebubble the `\X` does seem to work. And it's not that your original solution doesn't properly match multi-byte UTF8 sequences [it does], it's that a single on-screen glyph may be composed of several sequences whose order must be preserved even when reversing or otherwise modifying the string. That's why the diacritic is applied to the wrong character when reversed using the naieve method. – Sammitch Mar 29 '18 at 22:27