3

I have long enough string (5000+ chars), and I need to check if it is in English.

After brief web search I found several solutions:

  • using of PEAR Text_LanguageDetect (it looks attractive but I'm still avoiding solutions which I don't understand how thet works)
  • check letters frequency (I made a function below with some comments)
  • check the string for national charecters (like č, ß and so on)
  • check the string for markers like 'is', 'the' or anything

So the function is the following:

function is_english($str){
    // Most used English chars frequencies
    $chars = array(
        array('e',12.702),
        array('t', 9.056),
        array('a', 8.167),
        array('o', 7.507),
        array('i', 6.966),
        array('n', 6.749),
        array('s', 6.327),
        array('h', 6.094),
        array('r', 5.987),
    );

    $str = strtolower($str);
    $sum = 0;
    foreach($chars as $key=>$char){
        $i = substr_count($str,$char[0]);
        $i = 100*$i/strlen($str);    // Normalization
        $i = $i/$char[1];
        $sum += $i;
    }
    $avg = $sum/count($chars);

    // Calculation of mean square value
    $value = 0;
    foreach($chars as $char)
        $value += pow($char[2]-$avg,2);

    // Average value
    $value = $value / count($chars);
    return $value;
}

Generally this function estimates the chars frequency and compares it with given pattern. Result should be closer to 0 as the frequency closer the pattern.

Unfortunately it working not as good: mostly I could consider that results 0.05 and lower is English and higher is not. But there are many English strings have high values and many foreign (in my case mostly German) - low.

I can't implement Third solution yet as I wasn't able to find any comprehensive chars set - foreign language markers.

The forth looks attractive but I can not figure out which marker is best to be used.

Any thoughts?

PS After some discussion Zod proposed that this question is duplicate to question Regular expression to match non-English characters?, which answers only in part. So I'd like to keep this question independent.

Community
  • 1
  • 1
Vlada Katlinskaya
  • 991
  • 1
  • 10
  • 26
  • I think what you want to do in this case is try to guess what language the string is in, and assume it's English if English gets the highest score. Being 80% sure it's English is no good if you're 90% sure it's German. Understand? – alzee Jan 19 '17 at 21:17
  • why regular expression check for a-z cannot be used ? – zod Jan 19 '17 at 21:32
  • @user3137702, sorry, I don't :( – Vlada Katlinskaya Jan 19 '17 at 21:32
  • @zod Many languages have a-z in their alphabet – Vlada Katlinskaya Jan 19 '17 at 21:33
  • Possible duplicate of [Regular expression to match non-English characters?](http://stackoverflow.com/questions/150033/regular-expression-to-match-non-english-characters) – zod Jan 19 '17 at 21:36
  • @VladaKatlinskaya what you need to do is try to detect what language the string is in -- not if the string is English or not. Once you've guessed at the language, if the guess was "English", then proceed. I hope that is clearer. Simply asking "is this English Y/N?" is a recipe for failure. – alzee Jan 19 '17 at 21:36
  • @zod, you've flagged my question as possible duplicate but the answer you've linked answering to only 1/forth part of the question. I think that this approach has it's drawbacks so could you please remove *duplicate* flag keeping the link to that answer? – Vlada Katlinskaya Jan 20 '17 at 08:10

3 Answers3

2

I think the fourth solution might be your best bet, but I would expand it to include a wider dictionary.

You can find some comprehensive lists at: https://en.wikipedia.org/wiki/Most_common_words_in_English

With your current implementation, you will suffer some setbacks because many languages use the standard latin alphabet. Even languages that go beyond the standard latin alphabet typically use primarily "English-compliant characters," so to speak. For example, the sentence "Ich bin lustig" is German, but uses only latin alphabetic characters. Likewise, "Jeg er glad" is Danish, but uses only latin alphabetic characters. Of course, in a string of 5000+ characters, you will probably see some non-latin characters, but that is not guaranteed. Additionally, but focusing solely on character frequency, you might find that foreign languages which utilize the latin alphabet typically have similar character occurrence frequencies, thus rendering your existing solution ineffective.

By using an english dictionary to find occurrences of English words, you would be able to look over a string and determine exactly how many of the words are English, and from there, calculate a frequency of the number of words that are English. (With a higher percentage indicating the sentence is probably English.)

The following is a potential solution:

<?php
$testString = "Some long string of text that you would like to test.";

// Words from: https://en.wikipedia.org/wiki/Most_common_words_in_English
$common_english_words = array('time', 'person', 'year', 'way', 'day', 'thing', 'man', 'world', 'life', 'hand', 'part', 'child', 'eye', 'woman', 'place', 'work', 'week', 'case', 'point', 'government', 'company', 'number', 'group', 'problem', 'fact', 'be', 'have', 'do', 'say', 'get', 'make', 'go', 'know', 'take', 'see', 'come', 'think', 'look', 'want', 'give', 'use', 'find', 'tell', 'ask', 'seem', 'feel', 'try', 'leave', 'call', 'good', 'new', 'first', 'last', 'long', 'great', 'little', 'own', 'other', 'old', 'right', 'big', 'high', 'different', 'small', 'large', 'next', 'early', 'young', 'important', 'few', 'public', 'bad', 'same', 'able', 'to', 'of', 'in', 'for', 'on', 'with', 'at', 'by', 'from', 'up', 'about', 'into', 'over', 'after', 'beneath', 'under', 'above', 'the', 'and', 'a', 'that', 'i', 'it', 'not', 'he', 'as', 'you', 'this', 'but', 'his', 'they', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all', 'would', 'there', 'their', 'I', 'we', 'what', 'so', 'out', 'if', 'who', 'which', 'me', 'when', 'can', 'like', 'no', 'just', 'him', 'people', 'your', 'some', 'could', 'them', 'than', 'then', 'now', 'only', 'its', 'also', 'back', 'two', 'how', 'our', 'well', 'even', 'because', 'any', 'these', 'most', 'us');

/* you might also consider replacing "'s" with ' ', because 's is common in English
   as a contraction and simply removing the single quote could throw off the frequency. */
$transformedTest = preg_replace('@\s+@', ' ', preg_replace("@[^a-zA-Z'\s]@", ' ', strtolower($testString)));

$splitTest = explode(' ', $transformedTest);

$matchCount = 0;
for($i=0;$i<count($splitTest);$i++){
    if(in_array($splitTest[$i], $common_english_words))
        $matchCount++;
}

echo "raw count: $matchCount\n<br>\nPercent: " . ($matchCount/count($common_english_words))*100 . "%\n<br>\n";
if(($matchCount/count($common_english_words)) > 0.5){
    echo "More than half of the test string is English. Text is likely English.";
}else{
    echo "Text is likely a foreign language.";
}
?>

You can see an example here which includes two sample strings to test (one which is German, and one which is English): https://ideone.com/lfYcs2

In the IDEOne code, when running it on the English string, you will see that the result is roughly 69.3% matching with the common English words. When running it on the German, the match percentage is only 4.57% matching with the common English words.

Spencer D
  • 3,376
  • 2
  • 27
  • 43
  • Thanks for ideas. Regarding the frequencies of non-English Latin based strings: if you will check the Wikipedia page I linked to my question - the frequencies are pretty different even for Latin characters. So I think that this approach should work for long strings. But I could make mistakes in calculations or the strings should be sanitized. I don't know actually. – Vlada Katlinskaya Jan 19 '17 at 21:38
  • @VladaKatlinskaya, interesting. I had not read the wikipedia article you linked to, but I assumed the frequencies might be similar for some languages. Thanks for pointing that out. I will be adding a solution to this momentarily. – Spencer D Jan 19 '17 at 22:10
1

This problem is called language detection and is not trivial to solve with a single function. I suggest you use LanguageDetector from github.

Jeff
  • 452
  • 4
  • 9
1

i would go with the fourth solution and try to also search for not englisch. For Example if you find "the" then high posibility for english. If you find "el" or "la" then the posibility is high for spanish. I would search for "der","die"and "das" then it is very posible that it is German.

Carlos R
  • 205
  • 2
  • 12