0

In Python, how to detect if a peace of text has a majority of ltr (left-to-right) or rtl (right-to-left) Unicode symbols?

As example someting like that:

>>> guesstextorientation("abطcdαδ")
"ltr"
>>> guesstextorientation("עִבְרִיתa")
"rtl"

It could also ignore the writing systems where the two directions are allowed like CJK.

fauve
  • 226
  • 1
  • 10
  • You could create a map of Unicode characters to their corresponding language. For Hebrew, it is right to left. – Tim Biegeleisen Sep 18 '22 at 07:30
  • https://stackoverflow.com/questions/48058402/unicode-table-information-about-a-character-in-python discusses how to examine the Unicode properties of a character. Voting to close as lacking focus / any research effort. – tripleee Sep 18 '22 at 08:31

2 Answers2

2

You can use this way with regex and Unicode escapes of rtl languages( here I used Persian and Arabic):

Code:

import re

# Persian \u0600-\u06FF
# Arabic \u0627-\u064a

def guesstextorientation(text):
    
    lang = ['ltr','rtl']
    # you need to add other languages pattern here
    pattern = re.compile('[\u0627-\u064a]|[\u0600-\u06FF]')
    
    return lang[len(re.findall(pattern, text)) > (len(text)/2)]

print(guesstextorientation("abطcdαδ"))
print(guesstextorientation("سلام ایران"))

Output:

ltr
rtl
Shahab Rahnama
  • 982
  • 1
  • 7
  • 14
0

This is a late response to the question, and I am including it for future reference.

Determining the direction of a string is complex. But if you are looking at simple approximations, you can look at the bidirectional property values in the string. Below, I will focus on characters with strong direction, and ignore the characters with a weak direction.

The bidirectional property is available via unicodedata.bidirectional()

A common method to control direction of text when the direction is not known is to use the first strong heuristic, select the direction matching the first strong character encountered when iterating through the text. Although this can be the wrong direction, but it is a common fallback.

The second approach is to look at how many characters in the string are strong LTR and strong RTL, and selecting the direction that has the most characters in the string.

For first strong, something like:

import unicodedata as ud
def first_strong(s):
    properties = ['ltr' if v == "L" else 'rtl' if v in ["AL", "R"] else "-" for v in [ud.bidirectional(c) for c in list(s)]]
    for value in properties:
        if value == "ltr":
            return "ltr"
        elif value == "rtl":
            return "rtl"
    return None

For dominant direction:

from collections import Counter
import unicodedata as ud

def dominant_strong_direction(s):
    count = Counter([ud.bidirectional(c) for c in list(s)])
    rtl_count = count['R'] + count['AL'] + count['RLE'] + count["RLI"]
    ltr_count = count['L'] + count['LRE'] + count["LRI"] 
    return "rtl" if rtl_count > ltr_count else "ltr"

For the following test strings, each yields the following results:

s1 = "HTML : دليل تصميم وإنشاء المواقع على الإنترنت"
first_strong(s1)
# rtl
dominant_strong_direction(s1)
# rtl

s2 = "تبسيط إنشاء صفحات الويب باستخدام لغة HTML : أبسط طريقة لتعلم لغة HTML"
first_strong(s2)
# rtl
dominant_strong_direction(s2)
# rtl

s3 = "one שתיים three"
first_strong(s3)
# ltr
dominant_strong_direction(s3)
# ltr

s4 = ">one שתיים three<!"
first_strong(s4)
# rtl
dominant_strong_direction(s4)
# ltr

Trying to estimate the direction, can give the wrong results.

Andj
  • 481
  • 3
  • 8