24

How can I programmatically check if the browser treats some character as RTL in JavaScript?

Maybe creating some transparent DIV and looking at where text is placed?

A bit of context. Unicode 5.2 added Avestan alphabet support. So, if the browser has Unicode 5.2 support, it treats characters like U+10B00 as RTL (currently only Firefox does). Otherwise, it treats these characters as LTR, because this is the default.

How do I programmatically check this? I'm writing an Avestan input script and I want to override the bidi direction if the browser is too dumb. But, if browser does support Unicode, bidi settings shouldn't be overriden (since this will allow mixing Avestan and Cyrillic).

I currently do this:

var ua = navigator.userAgent.toLowerCase();

if (ua.match('webkit') || ua.match('presto') || ua.match('trident')) {
    var input = document.getElementById('orig');
    if (input) {
        input.style.direction = 'rtl';
        input.style.unicodeBidi = 'bidi-override';
    }
}

But, obviously, this would render script less usable after Chrome and Opera start supporting Unicode 5.2.

j0k
  • 22,600
  • 28
  • 79
  • 90
Kryzhovnik
  • 381
  • 1
  • 2
  • 6
  • 1
    You can't programmatically check how the browser renders a certain character. It could be down to the underlying operating system or the browser could have its own rendering code (I think Safari on Windows doesn't use the Windows OS text renderer for instance). If you are lucky you might find a resource that tells you which version of each browser support which version of Unicode. You can check whether a given character is RTL or not, but you'll have to find a JavaScript Unicode library or get the data from [`UnicodeData.txt`](http://unicode.org/Public/UNIDATA/UnicodeData.txt) and `bsearch()`. – hippietrail Aug 17 '12 at 13:45
  • well, there are 17 languages that are RTL, so you could check the `keyCode` of a `keydown` event and match if with the ranges of the keycodes of these 17 languages...http://en.wikipedia.org/wiki/Right-to-left – vsync Feb 12 '13 at 02:06
  • possible duplicate of [change text direction of textbox automatically](http://stackoverflow.com/questions/7770235/change-text-direction-of-textbox-automatically) – Iman Mahmoudinasab Sep 26 '15 at 05:08

6 Answers6

36
function isRTL(s){           
    var ltrChars    = 'A-Za-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02B8\u0300-\u0590\u0800-\u1FFF'+'\u2C00-\uFB1C\uFDFE-\uFE6F\uFEFD-\uFFFF',
        rtlChars    = '\u0591-\u07FF\uFB1D-\uFDFD\uFE70-\uFEFC',
        rtlDirCheck = new RegExp('^[^'+ltrChars+']*['+rtlChars+']');

    return rtlDirCheck.test(s);
};

playground page

vsync
  • 118,978
  • 58
  • 307
  • 400
  • Measuring divs seems insane to me... regex seems to be the only way and google's search page agrees. Go to google.com and paste in some RTL and the mic icon flips. Looking at the source and they use (a much more complicated/complete?) regex. – Cory Mawhorter Jul 08 '14 at 01:21
  • @Javid - I don't remember. why? – vsync Apr 12 '17 at 09:39
  • @vsync I'm so much curious about it. Did you dig into Unicode documentation or copy it from somewhere else or something? – Javid Apr 12 '17 at 12:13
  • 1
    @Javid - I probably found those codes in another place and constructed the code around it. It was like 5 years ago so I honestly don't remember much.. are you asking because you think some codes might be missing? – vsync Apr 12 '17 at 15:20
  • @vsync No not at all. I'm not much of a SO copy/paste person. So everytime I use SO to find answers I try to understand the solution. In this case, I have to know where these numbers have come from. BTW I checked your code with few languages and it works flawlessly. – Javid Apr 12 '17 at 19:36
  • 1
    I found the docs here: http://www.unicode.org/Public/UNIDATA/extracted/DerivedBidiClass.txt – tanghao Aug 23 '18 at 08:27
  • This doesn't include the RTL characters in the supplementary multilingual plane, which contains the Avestan block. Don't some regex implementations have named classes for all unicode character attributes? – Matthew Morrone Jan 31 '20 at 15:18
  • https://en.wikipedia.org/w/index.php?title=Comparison_of_regular-expression_engines#Part_2, Unicode Property Support. doesn't list JS though. https://www.regular-expressions.info/unicode.html has a pretty long list of unicode regex classes, but directionality isn't there. – Matthew Morrone Jan 31 '20 at 15:24
  • last one, I promise. this is C#, but maybe a Javascript library exists out there somewhere? https://stackoverflow.com/questions/4330951/how-to-detect-whether-a-character-belongs-to-a-right-to-left-language – Matthew Morrone Jan 31 '20 at 15:34
  • this does not work when you have a combination of rtl and ltr words in the same string. that's ok , but at least it would be good to know if it happens instead of getting false for isRtl when we have this type of combination. – boaz levinson Dec 13 '20 at 12:17
  • @boazlevinson - All it does if providing you a function that gets an input (a character) and and the output is `true` for `rtl`. What you do with this test and when you test is up to you. You can apply this function on a whole string or on a typed-character basis. – vsync Dec 13 '20 at 18:01
9

I realize this is quite a while after the original question was asked and answered but I found vsync's update to be rather useful and just wanted to add some observations. I would add this in comment to his answer but my reputation is not high enough yet.

Instead of a regular expression that searches from the start of the line zero or more non-LTR characters and then one RTL character, wouldn't it make more sense to search from the start of the line zero or more weak/neutral characters and then one RTL character? Otherwise you have the potential for matching many RTL characters unnecessarily. I would welcome a more thorough examination of my weak/neutral character group as I merely used the negation of the combined LTR and RTL character groups.

Additionally, shouldn't characters such as LTR/RTL marks, embeds, overrides be included in the appropriate character groupings?

I would think then that the final code should look something like:

function isRTL(s){           
    var weakChars       = '\u0000-\u0040\u005B-\u0060\u007B-\u00BF\u00D7\u00F7\u02B9-\u02FF\u2000-\u2BFF\u2010-\u2029\u202C\u202F-\u2BFF',
        rtlChars        = '\u0591-\u07FF\u200F\u202B\u202E\uFB1D-\uFDFD\uFE70-\uFEFC',
        rtlDirCheck     = new RegExp('^['+weakChars+']*['+rtlChars+']');

    return rtlDirCheck.test(s);
};

Update

There may be some ways to speed up the above regular expression. Using a negated character class with a lazy quantifier seems to help improve speed (tested on http://regexhero.net/tester/?id=6dab761c-2517-4d20-9652-6d801623eeec, site requires Silverlight 5)

Additionally, if the directionality of the string is unknown, my guess is that for most cases the string will be LTR instead of RTL and creating an isLTR function would return results faster if that is the case but as OP is asking for isRTL, will provide isRTL function:

function isRTL(s){           
    var rtlChars        = '\u0591-\u07FF\u200F\u202B\u202E\uFB1D-\uFDFD\uFE70-\uFEFC',
        rtlDirCheck     = new RegExp('^[^'+rtlChars+']*?['+rtlChars+']');

    return rtlDirCheck.test(s);
};
mcarthurart
  • 89
  • 1
  • 5
  • you can test it on jsPERF. btw i've tested your functioned and they do not work... you test them on my playground page, in my answer. – vsync Aug 03 '14 at 10:41
3

Testing for both Hebrew and Arabic (the only modern RTL languages/character sets I know which flow right-to-left except for any Persian-related which I've not researched):

/[\u0590-\u06FF]/.test(textarea.value)

More research suggests something along the lines of:

/[\u0590-\u07FF\u200F\u202B\u202E\uFB1D-\uFDFD\uFE70-\uFEFC]/.test(textarea.value)
jimmont
  • 2,304
  • 1
  • 27
  • 29
2

First addressing the question in the heading:

There are no tools in JavaScript as such for accessing Unicode properties of characters. You would need to find a library or service for the purpose (I’m afraid that might be difficult, if you need something reliable) or to extract the relevant information from the Unicode character “database” (a collection of text files in specific formats) and to write your own code to use it.

Then the question in message body:

This seems even more desperate. But as this would probably be something for a limited number of users who are knowledgeable and know Avestan, maybe it would not be too bad to display a string of Avestan characters along with an image of them in proper directionality and ask the user click on a button if the order is wrong. And you could save this selection in a cookie, so that the user needs to do this only once (per browser; though it should be relatively short-lived cookie, as the browser may get updated).

Jukka K. Korpela
  • 195,524
  • 37
  • 270
  • 390
  • I do understand that it’s not a straitforward thing to do. However, I hope it can be accomplished somehow. I'm currently checking if I can create a hidden div with two spans, get their bounding rects and compare the X coordinate. If this works, I'll write about it here. – Kryzhovnik Aug 17 '12 at 14:58
2

Thanks for your comments, but it seems I've done this myself:

function is_script_rtl(t) {
    var d, s1, s2, bodies;

    //If the browser doesn’t support this, it probably doesn’t support Unicode 5.2
    if (!("getBoundingClientRect" in document.documentElement))
        return false;

    //Set up a testing DIV
    d = document.createElement('div');
    d.style.position = 'absolute';
    d.style.visibility = 'hidden';
    d.style.width = 'auto';
    d.style.height = 'auto';
    d.style.fontSize = '10px';
    d.style.fontFamily = "'Ahuramzda'";
    d.appendChild(document.createTextNode(t));

    s1 = document.createElement("span");
    s1.appendChild(document.createTextNode(t));
    d.appendChild(s1);

    s2 = document.createElement("span");
    s2.appendChild(document.createTextNode(t));
    d.appendChild(s2);

    d.appendChild(document.createTextNode(t));

    bodies = document.getElementsByTagName('body');
    if (bodies) {
        var body, r1, r2;

        body = bodies[0];
        body.appendChild(d);
        var r1 = s1.getBoundingClientRect();
        var r2 = s2.getBoundingClientRect();
        body.removeChild(d);

        return r1.left > r2.left;
    }

    return false;   
}

Example of using:

Avestan in <script>document.write(is_script_rtl('') ? "RTL" : "LTR")</script>,
Arabic is <script>document.write(is_script_rtl('العربية') ? "RTL" : "LTR")</script>,
English is <script>document.write(is_script_rtl('English') ? "RTL" : "LTR")</script>.

It seems to work. :)

Kryzhovnik
  • 381
  • 1
  • 2
  • 6
  • 1
    Yep, measuring on-page element layout is the only way I can think of to detect support. I'd suggest using `offsetLeft` rather than `getBoundingClientRect` as browser support is better. – bobince Aug 19 '12 at 08:46
  • Thanks, I'm going to use that. But I've found another problem: Opera layouts Avestan on-page as RTL, but in textarea as LTR! :( – Kryzhovnik Aug 20 '12 at 12:01
0

Here's another solution that is robust against minor amounts of RTL text in a primarily LTR string, or minor amounts of LTR text in a RTL string.

It works by counting the number of LTR or RTL characters, then classifies the string based on wether there are more LTR or RTL characters.

isRTL(text) {
  let rtl_count = (text.match(/[\u0591-\u07FF\uFB1D-\uFDFD\uFE70-\uFEFC]/g) || []).length;
  let ltr_count = (text.match(/[A-Za-z\u00C0-\u00C0\u00D8-\u00F6\u00F8-\u02B8\u0300-\u0590\u0800-\u1FFF\u2C00-\uFB1C\uFDFE-\uFE6F\uFEFD-\uFFFF]/g) || []).length;

  return (rtl_count > ltr_count);
}
phayes
  • 1,432
  • 1
  • 12
  • 11