This is a late response to the question, and I am including it for future reference.
Determining the direction of a string is complex. But if you are looking at simple approximations, you can look at the bidirectional property values in the string. Below, I will focus on characters with strong direction, and ignore the characters with a weak direction.
The bidirectional property is available via unicodedata.bidirectional()
A common method to control direction of text when the direction is not known is to use the first strong heuristic, select the direction matching the first strong character encountered when iterating through the text. Although this can be the wrong direction, but it is a common fallback.
The second approach is to look at how many characters in the string are strong LTR and strong RTL, and selecting the direction that has the most characters in the string.
For first strong, something like:
import unicodedata as ud
def first_strong(s):
properties = ['ltr' if v == "L" else 'rtl' if v in ["AL", "R"] else "-" for v in [ud.bidirectional(c) for c in list(s)]]
for value in properties:
if value == "ltr":
return "ltr"
elif value == "rtl":
return "rtl"
return None
For dominant direction:
from collections import Counter
import unicodedata as ud
def dominant_strong_direction(s):
count = Counter([ud.bidirectional(c) for c in list(s)])
rtl_count = count['R'] + count['AL'] + count['RLE'] + count["RLI"]
ltr_count = count['L'] + count['LRE'] + count["LRI"]
return "rtl" if rtl_count > ltr_count else "ltr"
For the following test strings, each yields the following results:
s1 = "HTML : دليل تصميم وإنشاء المواقع على الإنترنت"
first_strong(s1)
# rtl
dominant_strong_direction(s1)
# rtl
s2 = "تبسيط إنشاء صفحات الويب باستخدام لغة HTML : أبسط طريقة لتعلم لغة HTML"
first_strong(s2)
# rtl
dominant_strong_direction(s2)
# rtl
s3 = "one שתיים three"
first_strong(s3)
# ltr
dominant_strong_direction(s3)
# ltr
s4 = ">one שתיים three<!"
first_strong(s4)
# rtl
dominant_strong_direction(s4)
# ltr
Trying to estimate the direction, can give the wrong results.