35

I'm trying to concatenate several strings containing both arabic and western characters (mixed in the same string). The problem is that the result is a String that is, most likely, semantically correct, but different from what I want to obtain, because the order of the characters is altered by the Unicode Bidirectional Algorithm. Basically, I just want to concatenate as if they were all LTR, ignoring the fact that some are RTL, a sort of "agnostic" concatenation.

I'm not sure if I was clear in my explanation, but I don't think I can do it any better.

Hope someone can help me.

Kind regards,

Carlos Ferreira

BTW, the strings are being obtained from the database.

EDIT

enter image description here

The first 2 Strings are the strings I want to concatenate and the third is the result.

EDIT 2

Actually, the concatenated String is a little different from the one in the image, it got altered during the copy+paste, the 1 is after the first A and not immediately before the second A.

Carlos Ferreira
  • 565
  • 1
  • 8
  • 15

3 Answers3

64

You can embed bidi regions using unicode format control codepoints:

  • Left-to-right embedding (U+202A)
  • Right-to-left embedding (U+202B)
  • Pop directional formatting (U+202C)

So in java, to embed a RTL language like Arabic in an LTR language like English, you would do

myEnglishString + "\u202B" + myArabicString + "\u202C" + moreEnglish

and to do the reverse

myArabicString + "\u202A" + myEnglishString + "\u202C" + moreArabic

See Bidirectional General Formatting for more details, or the Unicode specification chapter on "Directional Formatting Codes" for the source material.

Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • 1
    Is there a way in Java/Android to remove all such characters and others, that won't really be shown when printing them? I need it to sort a list of strings, but some of them have the special character "\u202B", which ruins the order of the items of the list. Using trim() function doesn't remove them. – android developer Mar 19 '17 at 14:35
  • @androiddeveloper, your best bet may be to strip all except graphical or spacing characters by doing something like `myString.replaceAll("[^ \t\r\n\\p{Graph}]+", "")`. I don't remember off the top of my head, but doing category Z minus the zero-width spaces is probably the best approximation of "printable" space characters. – Mike Samuel Mar 19 '17 at 15:30
  • How about going over each character of the input string, and only if "Character.isIdentifierIgnorable(c)" returns false, add it to the new string ? Will this suffice? – android developer Mar 20 '17 at 07:57
  • @androiddeveloper, if that's what you want then `.replaceAll("[\\p{identifier ignorable}]+", "")` should do it. – Mike Samuel Mar 20 '17 at 11:05
  • It does the same thing? Question is if this function is even the correct one to use in order to clean such special characters and maybe others. By testing it seems it is, but I want to be sure. – android developer Mar 20 '17 at 17:27
2

It's very likely that you need to insert Unicode directional formatting codes into your string to get your string display correctly. For details see Directional Formatting Codes of the Unicode Bidirectional Algorithm specification.

Maybe the Bidi class can help you in determining the correct sequence, as it implements the Unicode Bidirectional Algorithm.

MicSim
  • 26,265
  • 16
  • 90
  • 133
  • The Bidi class helps to determine the sequence, but I don't know how I can force it to treat the String as LTR instead of RTL. But I'll have a look at the link you've mentioned, maybe I can figure it out. Thanks. – Carlos Ferreira May 31 '11 at 07:58
  • I don't have any experience with this, but it seems you have to use a combination of the implicit directional marks LRM (U+200E) and RLM (U+200F), which don't display, and the directional code terminator PDF (U+202C). There is also an online demo at http://unicode.org/cldr/utility/bidi.jsp, where you can test around. – MicSim May 31 '11 at 10:15
  • @MicSim this worked for me and thanks for pointing me in the right direction. It was not intuitive. I used class `Bidi.requiresBidi(...)` in an if/else, then did this: `StringBuilder stbr = new StringBuilder(); stbr.append("\u200e"); //"LRM"<--@start string stbr.append( cl.get(c).getPosition() + " " ); stbr.append("\u202b"); //"RLE" open tag, RTL text is next stbr.append( cl.get(c).getName() ); //<--Arabic name stbr.append("\u202c"); //"PDF" close tag, RTL text was inserted cont LtoR...` The LRM at the beginning was the difference. – spencemw May 10 '21 at 20:25
1

It's not changing order of the codepoints. What's happening is that when it comes to display the string, it sees that the string starts with a right-to-left script, so it displays it right-to-left.

MRAB
  • 20,356
  • 6
  • 40
  • 33