PHP: removing invalid utf-8 characters in XML using filter

Question

I have a large file, so I have created a filter for removing invalid utf-8 characters from XML.

class ValidUTF8XMLFilter extends php_user_filter {

    protected static $pattern = '/([\x09\x0A\x0D\x20-\x7E]|[\xC2-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}|\xED[\x80-\x9F][\x80-\xBF]|\xF0[\x90-\xBF][\x80-\xBF]{2}|[\xF1-\xF3][\x80-\xBF]{3}|\xF4[\x80-\x8F][\x80-\xBF]{2})|./x';

    function filter($in, $out, &$consumed, $closing)
    {
        while ($bucket = stream_bucket_make_writeable($in)) {
            $bucket->data = preg_replace(self::$pattern, '$1', $bucket->data);
            $consumed += $bucket->datalen;
            stream_bucket_append($out, $bucket);
        }
        return PSFS_PASS_ON;
    }
}

This filter will remove also utf-8 characters not only invalid in xml, but also in utf-8. The regex is taken from Multilingual form encoding. The class was taken from this answer: How to skip invalid characters in XML file using PHP and rewritten. The pattern in that answer won't work for invalid utf-8 characters, eg. 0x1D.

Will this filter work, in situation, where invalid bytes starts at the end of buffer and ends in beginning of next filtering? Is this situation possible?

What are you trying to do? Are you trying to strip ill-formed UTF-8 subsequences (in general it's a bad idea, you should replace them with substitution characters, but that's another topic) or do you want to operate on a valid UTF-8 sequence but remove the characters that are illegal in XML (e.g. most C0 control codes)? — Artefacto, Nov 19 '10 at 10:46
I want to strip ill-formated UTF-8 usbsequences _and_ remove the characters that are illegal in XML. — prostynick, Nov 19 '10 at 10:59
How did you get a UTF-8 file that is not a UTF-8 file? Stop right there and reconsider your givens. They do not make any sense. — tchrist, Nov 19 '10 at 23:29

score 2 · Accepted Answer · answered Nov 19 '10 at 11:21

2

No, I don't think it will work. It will strip valid sequences of code units that happen to be split between several buckets.

It should not consume potentially incomplete sequences in the end (and, if necessary, it should pass nothing and return PSFS_FEED_ME).

answered Nov 19 '10 at 11:21

Artefacto

96,375
17
202
225

The problem is, It's hard to find out proper regex to find that situation. The second thing is, you said, that it will strip valid sequences of code. Is it possible, that it won't strip illegal sequences of code? – prostynick Nov 19 '10 at 12:07
1

@pro No, it's not possible it won't strip illegal sequences because an illegal sequence, when separated, never becomes a legal sequence. The reason for that is that the Unicode specification requires that valid lead bytes (or bytes in the ascii range) never be considered part of an illegal sequence. – Artefacto Nov 19 '10 at 14:11
@pro I wouldn't recommend you use a regex. Table 3-7 of the unicode spec will help you here: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf – Artefacto Nov 19 '10 at 14:13

PHP: removing invalid utf-8 characters in XML using filter

1 Answers1