Charset detection in PHP

Question

//i've added a new take on this please see Cheating PHP integers . any help will be much appreciated. I've had an idea to trying and hack the storage option of the arrays by packing the integers into unsigned bytes (only need 8 or 16 bits integers to reduce the memory considerably).

Hi

I'm currently working on custom charset detection libraries and created a port from Mozilla's charset detection algorithm and used chardet (the python port) for a helping hand. However, this is extremely memory intensive in PHP (around 30mb of memory if I just load in Western language detection). I've optimised all I can without rewriting it from scratch to load each piece (this would reduce memory but make it a lot slower).

My question is that, do you know of any LGPL PHP libraries that do charset detection? This would be purely for research to give me a slight guiding hand in the right direction.

I already know of mb_detect_encoding but it's far too limited and brings up far too many false positives with the text files i have (yet python's chardet detects them perfectly)

@Jase You should upload your code and we could suggest ideas! — Yehonatan, Mar 31 '11 at 18:33
Are you reinventing something that already exists in the mbstring extension? PHP's own mb_detect_encoding works well enough for me. I use it for this very task. — Dmitri, Mar 31 '11 at 18:33
mb_detect_encoding works well if you provide the correct order or guesses. No library can detect encoding 100% accurate. You should probably learn everything you can about unicode, get to know it inside and out before you start writing any code. In short - by the nature of the unicode there is no way to accurately detect charset just be examining the string, especially short strings without providing some hints. — Dmitri, Mar 31 '11 at 18:38
@Yehonatan - In it's current form, it's almost identical to chardet for python (it's just in the technical exercise phase at the moment). The only difference is the ability to load and unload specific continent languages check. So a good indicator for what i've done is to look at chardet because variable names, class structure etc is the same. Chardet also has the issue of using quick a bit of memory but not nearly as much as PHP — Jase, Mar 31 '11 at 18:41
@Dmitri I've tried everything with mb_detect_encoding. But when there's about 20 character sets to detect, it can get confused very quickly. It confuses UTF16 and UTF32 quite easily in the tests i've run — Jase, Mar 31 '11 at 18:43
@Dmitri. Is there a specific order you use I could try out? I've written a quick script that will run each of my 300 text files through it. I get a 98% success rate with with chardet but only 82% rate last time I tried mb_detect_encoding — Jase, Mar 31 '11 at 18:51
@Jase this is my detect order: $aDetectOrder = array('ASCII', 'UTF-8','ISO-8859-1', 'JIS', 'ISO-8859-15', 'EUC-JP', 'SJIS' ); — Dmitri, Mar 31 '11 at 21:43
Unfortunately this is far too limited for my needs. I edited my post with a new stackoverflow question regarding hacking PHP integers to force 16/8bit integers and we've worked out a solution to get the array size down. In an array of integers and a custom array of 30,000 entries, we've got the memory from 3.09mb down to 0.05mb using a custom array access class which stores the integers using pack into one long string (using substr for access). A huge hack but the memory savings speak for themselves :D — Jase, Mar 31 '11 at 22:19

score 1 · Answer 1 · answered Mar 31 '11 at 23:04

1

I created a method which encodes correctly to UTF-8. But it was hard to figure out what is currently encoded so I came to this solution:

<?php
function _convert($content) { 
    if(!mb_check_encoding($content, 'UTF-8')
        OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {

        $content = mb_convert_encoding($content, 'UTF-8');

        if (mb_check_encoding($content, 'UTF-8')) {
            // log('Converted to UTF-8');
        } else {
            // log('Could not converted to UTF-8');
        }
    }
    return $content;
}
?>

As you can see I do a conversion to check if it still the same (UTF-8/16) and if not convert it. Maybe you can use some of this code.

answered Mar 31 '11 at 23:04

powtac

40,542
28
115
170

Hi, Thank you for the contribution but my aim is to create a PHP character set detection library which includes SJIS, BIG5, UTF-8/16/32, the iso character sets and many many more based on context analysis. This will generally mean that if grabbing a HTTP request doesn't supply an encoded heading or a text file has been uploaded from a foreign country, I'll be able to guess quite well the charset it was encoded with. Then I can use iconv or mb_convert_encoding to convert that character set to UTF-8 ready for easier manipulation. It will be modular so people can plug their own in too. – Jase Mar 31 '11 at 23:11
The code you submitted could be created as a module but the problem is that it won't support streams. This is vital if working with fairly large files – Jase Mar 31 '11 at 23:12

score 0 · Answer 2 · answered Mar 31 '11 at 18:29

0

First of all, interesting project you are working on! I'm curious how the end product will be.

Have you take a look at the ICU project already?

answered Mar 31 '11 at 18:29

pderaaij

1,337
12
32

I have taken a glimpse at the ICU project (java version. I'm not a java programmer but I can read Java). but opted for chardet cus it was really easy to port. What would be the difference in the ICU project to mozilla's and python's chardet? Is it generally more efficient? The reason chardet is so memory heavy is because it loads a large number of very large arrays for context analysis – Jase Mar 31 '11 at 18:48

Charset detection in PHP

2 Answers2

Linked