I have long enough string (5000+ chars), and I need to check if it is in English.
After brief web search I found several solutions:
- using of PEAR Text_LanguageDetect (it looks attractive but I'm still avoiding solutions which I don't understand how thet works)
- check letters frequency (I made a function below with some comments)
- check the string for national charecters (like č, ß and so on)
- check the string for markers like 'is', 'the' or anything
So the function is the following:
function is_english($str){
// Most used English chars frequencies
$chars = array(
array('e',12.702),
array('t', 9.056),
array('a', 8.167),
array('o', 7.507),
array('i', 6.966),
array('n', 6.749),
array('s', 6.327),
array('h', 6.094),
array('r', 5.987),
);
$str = strtolower($str);
$sum = 0;
foreach($chars as $key=>$char){
$i = substr_count($str,$char[0]);
$i = 100*$i/strlen($str); // Normalization
$i = $i/$char[1];
$sum += $i;
}
$avg = $sum/count($chars);
// Calculation of mean square value
$value = 0;
foreach($chars as $char)
$value += pow($char[2]-$avg,2);
// Average value
$value = $value / count($chars);
return $value;
}
Generally this function estimates the chars frequency and compares it with given pattern. Result should be closer to 0 as the frequency closer the pattern.
Unfortunately it working not as good: mostly I could consider that results 0.05 and lower is English and higher is not. But there are many English strings have high values and many foreign (in my case mostly German) - low.
I can't implement Third solution yet as I wasn't able to find any comprehensive chars set - foreign language markers.
The forth looks attractive but I can not figure out which marker is best to be used.
Any thoughts?
PS After some discussion Zod proposed that this question is duplicate to question Regular expression to match non-English characters?, which answers only in part. So I'd like to keep this question independent.