19

Is there a way to detect the language of the data being entered via the input field?

philfreo
  • 41,941
  • 26
  • 128
  • 141
HyderA
  • 20,651
  • 42
  • 112
  • 180

10 Answers10

36

hmm i may offer an improved version of DimaKrasun's function:

functoin is_arabic($string) {
    if($string === 'arabic') {
         return true;
    }
    return false;
}

okay, enough joking!

Pekkas suggestion to use the google translate api is a good one! but you are relying on an external service which is always more complicated etc.

i think Rushyos approch is good! its just not that easy. i wrote the following function for you but its not tested, but it should work...

    <?
function uniord($u) {
    // i just copied this function fron the php.net comments, but it should work fine!
    $k = mb_convert_encoding($u, 'UCS-2LE', 'UTF-8');
    $k1 = ord(substr($k, 0, 1));
    $k2 = ord(substr($k, 1, 1));
    return $k2 * 256 + $k1;
}
function is_arabic($str) {
    if(mb_detect_encoding($str) !== 'UTF-8') {
        $str = mb_convert_encoding($str,mb_detect_encoding($str),'UTF-8');
    }

    /*
    $str = str_split($str); <- this function is not mb safe, it splits by bytes, not characters. we cannot use it
    $str = preg_split('//u',$str); <- this function woulrd probably work fine but there was a bug reported in some php version so it pslits by bytes and not chars as well
    */
    preg_match_all('/.|\n/u', $str, $matches);
    $chars = $matches[0];
    $arabic_count = 0;
    $latin_count = 0;
    $total_count = 0;
    foreach($chars as $char) {
        //$pos = ord($char); we cant use that, its not binary safe 
        $pos = uniord($char);
        echo $char ." --> ".$pos.PHP_EOL;

        if($pos >= 1536 && $pos <= 1791) {
            $arabic_count++;
        } else if($pos > 123 && $pos < 123) {
            $latin_count++;
        }
        $total_count++;
    }
    if(($arabic_count/$total_count) > 0.6) {
        // 60% arabic chars, its probably arabic
        return true;
    }
    return false;
}
$arabic = is_arabic('عربية إخبارية تعمل على مدار اليوم. يمكنك مشاهدة بث القناة من خلال الموقع'); 
var_dump($arabic);
?>

final thoughts: as you see i added for example a latin counter, the range is just a dummy number b ut this way you could detect charsets (hebrew, latin, arabic, hindi, chinese, etc...)

you may also want to eliminate some chars first... maybe @, space, line breaks, slashes etc... the PREG_SPLIT_NO_EMPTY flag for the preg_split function would be useful but because of the bug I didn't use it here.

you can as well have a counter for all the character sets and see which one of course the most...

and finally you should consider chopping your string off after 200 chars or something. this should be enough to tell what character set is used.

and you have to do some error handling! like division by zero, empty string etc etc! don't forget that please... any questions? comment!

if you want to detect the LANGUAGE of a string, you should split into words and check for the words in some pre-defined tables. you don't need a complete dictionary, just the most common words and it should work fine. tokenization/normalization is a must as well! there are libraries for that anyway and this is not what you asked for :) just wanted to mention it

Josh Crozier
  • 233,099
  • 56
  • 391
  • 304
The Surrican
  • 29,118
  • 24
  • 122
  • 168
  • Your function is making my head go all fuzzy. I'll try to implement it when i'm in a better mood, and let you know if it worked on it. But from what I read, it looks promising. – HyderA Aug 23 '10 at 11:02
  • roger that, don't forget to include the external uniord function on the top! lemme know if ya need any halp – The Surrican Aug 23 '10 at 16:39
  • The dictionary is a very good idea, only problem is that outside Latin script you quickly encounter circumstances where external context changes characters - such as multi-glyph characters. You'd have to be careful to avoid context-sensitive characters in your dictionary. – Rushyo Aug 24 '10 at 11:29
  • @Rushyo ... what? ... if you split the text into words by whitespaces, tokenice, lower case it and see what you hit in your database. if you hit it, see what relations there are. one word can be in more than one languages. from the hit ratio you should brette easy be able to tell. example: "i am your grandfathers computer xcT4" -> tokenzied "i am your grandfather computer xcT4" assume i, am, your, grandfather are engish words and computer is as well english as german. xcT4 is unknown. you will get 4 vs. 1, good ratio to guess its english – The Surrican Aug 24 '10 at 11:41
  • the original questino was only to detect character set. the provided soultion works very well with multi byte characters. language detection is a whole different thing where multi byte characters dont really matter... – The Surrican Aug 24 '10 at 11:43
  • A multi-byte character can be made up of multiple glyphs. It is similar to the problem a != á, except that outside of Latin you have situations where characters alter based on the context they are used in. So you have 'abc', yet when you type 'd' your word suddenly changes to 'abXd' or similar. This occurs regularly in Arabic script (for example, with the prefix al-) so simply searching for 'al-' will bring up zip. It's just a gotcha worth looking out for. – Rushyo Aug 24 '10 at 11:47
  • To reiterate: A code point is not a character is not a glyph. It makes a seemingly simple problem non-trivial. al- !== al- – Rushyo Aug 24 '10 at 11:52
  • Quote: "The Arabic alphabet, historically a cursive derived from the Nabataean alphabet, most letters take a variant shape depending on which they are followed (word-initial), preceded (word-final) or both (medial) by other letters." – Rushyo Aug 24 '10 at 11:55
  • Not to mention digraphs.. although since we're not talking Croatian we're safe from those =] – Rushyo Aug 24 '10 at 12:14
  • To put it in a Latin context: Encyclopædia !== Encyclopaedia... yet we'd want those to be the same. In Latin these edge cases are so rare as to be no issue in 99.9999% of circumstances, in Arabic it's a much bigger problem. That said, Unicode neatly side-steps some of the problems by dumping the problem on the renderer. – Rushyo Aug 24 '10 at 12:17
  • 1
    Unicode character properties are your friend... See DimaKrasun's solution. You could still use the 60% rule with a one-liner (which I leave as an exercise to the reader). This answer is unnecessarily complicated (and has poor performance). – Artefacto Nov 03 '10 at 00:14
  • very good. just make sure you remove this line : echo $char ." --> ".$pos.PHP_EOL; if you want to return true/false only for testing if it's arabic or not – DoubleM Sep 27 '15 at 02:03
14

Use regular expression for shorter and easy answer

 $is_arabic = preg_match('/\p{Arabic}/u', $text);

This will return true (1) for arabic string and 0 for non arabic string

Affan
  • 1,132
  • 7
  • 16
  • This works for me clean and easy. To clarify it more it checks if any part of the string contains Arabic characters. So if its part Arabic part other language it will still return `true`. – Kash May 05 '23 at 10:45
13

this will check if the string is Arabic Or has Arabic text

text must be UNICODE e.g UTF-8

$str = "بسم الله";
if (preg_match('/[اأإء-ي]/ui', $str)) {
    echo "A match was found.";
} else {
    echo "A match was not found.";
}
Mohammed Ahmed
  • 431
  • 5
  • 11
3

I assume that in 99% of cases, it is enough to check that string contains Arabic letters and does not consist of all of them.

My core assumption is that if it contains at least two or three Arabic letters, the reader should know how to read it.

You can use a simple function:

<?php
/**
 * Return`s true if string contains only arabic letters.
 *
 * @param string $string
 * @return bool
 */
function contains_arabic($string)
{
    return (preg_match("/^\p{Arabic}/i", $string) > 0);
}

Or if the regex classes do not work:

function contains_arabic($subject)
{
    return (preg_match("/^[\x0600-\x06FF]/i", $subject) > 0);
}
E_net4
  • 27,810
  • 13
  • 101
  • 139
Dmytro Krasun
  • 1,714
  • 1
  • 13
  • 20
2

I would use regular expressions to get the number of Arabic characters and compare it to the total length of the string. If the text is for instance at least 60% Arabic charactes, I would consider it as mainly Arabic and apply RTL formatting.

/**
 * Is the given text mainly Arabic language? 
 *
 * @param string $text string to be tested if it is arabic. :-)
 * @return bool 
 */
function ct_is_arabic_text($text) {
    $text = preg_replace('/[ 0-9\(\)\.\,\-\:\n\r_]/', '', $text); // Remove spaces, numbers, punctuation.
    $total_count = mb_strlen($text); // Length of text
    if ($total_count==0)
        return false;
    $arabic_count = preg_match_all("/[اأإء-ي]/ui", $text, $matches); // Number of Arabic characters
    if(($arabic_count/$total_count) > 0.6) { // >60% Arabic chars, its probably Arabic languages
        return true;
    }
    return false;
}

For inline RTL formatting, use CSS. Example class:

.embed-rtl {
 direction: rtl;
 unicode-bidi: normal;
 text-align: right;
}
HaWei
  • 41
  • 5
1
public static function isArabic($string){
    if(preg_match('/\p{Arabic}/u', $string))
        return true;
    return false;
}
Mohammad Anini
  • 5,073
  • 4
  • 35
  • 46
1

I'm not aware of a PHP solution for this, no.

The Google Translate Ajax APIs may be for you, though.

Check out this Javascript snippet from the API docs: Example: Language Detection

Pekka
  • 442,112
  • 142
  • 972
  • 1,088
  • Script detection is a very different thing from language detection. – Rushyo Aug 22 '10 at 12:29
  • 1
    @Rushyo well, at the moment, he is asking for *language* detection rather than script. – Pekka Aug 22 '10 at 12:35
  • Taken literally, yes, but I doubt that's the intent. – Rushyo Aug 22 '10 at 12:37
  • @Rushyo you don't really know that. I can think of a number of legitimate reasons to try and detect *language* – Pekka Aug 22 '10 at 12:51
  • In which case, we'd need to know the dialect as well - info not provided. – Rushyo Aug 23 '10 at 08:18
  • @Pekka: you're right, it's language-detection. @Rushyo: The reason is so I can decide whether to display it RTL or LTR. Also, most arab speakers don't know what dialect they speak. It's irrelevant in most cases. – HyderA Aug 23 '10 at 10:56
  • @gAMBOOKa in that case, you can take your pick. I like the character range detection approach outlined in the other answers as well, as it doesn't rely on an external service. If this is going to be extended to other languages, though, or if it's likely you encounter difficult (mixed) data, Google's processing algorithms may be superior. – Pekka Aug 23 '10 at 11:00
  • @gAMBOOKa That's the script, not the language... hate to labour the point. – Rushyo Aug 24 '10 at 06:23
  • @Rushyo: They're interchangeable depending on context. When was the last time you heard someone say English is a script. @Pekka: We were initially using Google's Language Detection API but now our app needs to function without internet availability as well. – HyderA Aug 24 '10 at 08:53
  • "When was the last time you heard someone say English is a script." Never as Latin is the script. English is the language. And the term Latin script is used all the time - esp. in computing! Localisation is basically impossible without understand that distinction. – Rushyo Aug 24 '10 at 11:19
  • @gAMBOOKa Also, the idea that 'most arab speakers don't know what dialect they speak' is nonsense. Different dialects of Arabic can make mutual conversation impossible (to quote Wikipedia: Arabic has many different, geographically distributed spoken varieties, some of which are mutually unintelligible). That's like confusing Breton and French because they're both Latin and based in France! – Rushyo Aug 24 '10 at 11:25
  • To reiterate: Arabic script includes Kurdish, Urdu, Sindhi and Kashmiri, Tajik, Kazakh, etc. - in the same way Latin might include English, French, Breton, Cymraeg, German, etc. One man's perfectly sensible Arabic is another man's gibberish. – Rushyo Aug 24 '10 at 11:27
  • In other words, if you just want to detect script (which is all you need to decide whether to use RTL or LTR) the problem is trivial and doesn't require anything nearly so complex as language detection - which needs you to teach the system how to detect Kurdish, Urbu, Sindhi, Kashmiri, etc. – Rushyo Aug 24 '10 at 11:49
  • I think the lack of distinction between script + language is making your job a helluva lot more complex than it needs to be - I am trying to be helpful, honest :) – Rushyo Aug 24 '10 at 11:53
1

I assume you're referring to a Unicode string... in which case, just look for the presence of any character with a code between U+0600–U+06FF (1536–1791) in the string.

Rushyo
  • 7,495
  • 4
  • 35
  • 42
  • the first thing I thought of regex with U+0600–U+06FF, but next was to use \p{Arabic} - in regex, i think \p{Arabic} is the same with U+0600–U+06FF, but i haven`t tried it – Dmytro Krasun Aug 22 '10 at 12:34
  • I'm pretty sure it's the same, but this method's far quicker. – Rushyo Aug 23 '10 at 08:16
1

The PHP Text_LanguageDetect library is able to detect 52 languages. It's unit-tested and installable via composer and PEAR.

cweiske
  • 30,033
  • 14
  • 133
  • 194
0

This function checks whether the entered line/sentence is arabic or not. I trimmed it first then check word by word calculating the total count for both.

function isArabic($string){
        // Initializing count variables with zero
        $arabicCount = 0;
        $englishCount = 0;
        // Getting the cleanest String without any number or Brackets or Hyphen
        $noNumbers = preg_replace('/[0-9]+/', '', $string);
        $noBracketsHyphen = array('(', ')', '-');
        $clean = trim(str_replace($noBracketsHyphen , '', $noNumbers));
        // After Getting the clean string, splitting it by space to get the total entered words 
        $array = explode(" ", $clean); // $array contain the words that was entered by the user
        for ($i=0; $i <= count($array) ; $i++) {
            // Checking either word is Arabic or not
            $checkLang = preg_match('/\p{Arabic}/u', $array[$i]);
            if($checkLang == 1){
                ++$arabicCount;
            } else{
                ++$englishCount;
            }
        }
        if($arabicCount >= $englishCount){
            // Return 1 means TRUE i-e Arabic
            return 1;
        } else{
            // Return 0 means FALSE i-e English
            return 0;
        }
    }
Yasir Tahir
  • 790
  • 1
  • 11
  • 31