How can I convert a complex binary Perl regular expression to C# or PowerShell?

Question

This Perl binary regex found at http://www.w3.org/International/questions/qa-forms-utf-8.en.php matches UTF-8 documents without the UTF-8 BOM header:

$field =~
m/\A(
 [\x09\x0A\x0D\x20-\x7E]            # ASCII
 | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
 |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
 | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
 |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
 |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
 | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
 |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
)*\z/x;

I need this because I am working on a PowerShell equivalent to 'grep -I', and part of this involves detecting text encoding.

But how do I rewrite this in C# or PowerShell? Or in other words, in ".Net Regex" syntax?

EDIT: Found this http://social.msdn.microsoft.com/Forums/en-US/regexp/thread/6a81be63-e6da-4156-a5bf-8b9782a1ac40 question about the same Regex of all things. The short answer seems like this can not be done with .Net since .Net does not support binary regular expressions.

This is a very simple regex. Could you explain what specific problem you have converting this? — Sinan Ünür, Jul 08 '09 at 20:24

SLaks · Answer 1 · 2009-07-08T23:58:15.157

1

Try this: (I haven't checked that it matches correctly; you can easily try it in LINQPad).

new Regex(@"
    ^(
    [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
    |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$", RegexOptions.IgnorePatternWhitespace)

EDIT:

Try reading your file using an ASCII StreamReader; that should do what you're looking for. (Note that I didn't actually try it)

edited Jul 08 '09 at 23:58

answered Jul 08 '09 at 21:00

SLaks

868,454
176
1,908
1,964

The Perl regex is a binary regex. So this will not work. After more research it doesn's seem that .Net supports binary regular expressions. – kervin Jul 08 '09 at 23:09
You can fake "binary" regex matching by decoding the byte stream in such a way that each byte is converted to a character with the same numeric value. Just use ISO-8859-1. – Alan Moore Jul 09 '09 at 05:57

score 1 · Answer 2 · answered Jul 08 '09 at 23:24

1

This post at http://social.msdn.microsoft.com/Forums/en-US/regexp/thread/6a81be63-e6da-4156-a5bf-8b9782a1ac40 describes several workarounds.

answered Jul 08 '09 at 23:24

kervin

11,672
5
42
59

score 1 · Answer 3 · answered Jul 08 '09 at 23:48

The odds are pretty good that if a sequence has no invalid UTF-8 characters, it can be treated as UTF-8. Since RegExps are for text in .Net, not byte arrays, here's a non-regexp solution that should work. Personally, I'd rather use this as a fallback mechanism (e.g. mycommand -autodetect) and offer pipeline parameters that allow user-specified encodings.

       string result=String.Empty;
        Encoding ae = Encoding.GetEncoding(
              Encoding.UTF8.EncodingName,
              new EncoderExceptionFallback(), 
              new DecoderExceptionFallback());
        try {
            result=ae.GetString(mybytes);
        }
        catch (DecoderFallbackException e)
        {
            //revert to some sensible default. Maybe the Ansi Code page for this environment?
            // This will use the substitution fallback mechanism, which usually replaces unknown characters with question marks.
            result=Encoding.Default.GetString(mybytes);
        }

If you can interact with unmanaged code, research the MLANG dll that ships with IE. It has alternate encoding autodetection methods that may be more useful.

score 0 · Answer 4 · answered Jul 08 '09 at 20:25

0

What specifically are you trying to do?

You should be able to use the System.Text.Encoding class.

answered Jul 08 '09 at 20:25

SLaks

868,454
176
1,908
1,964

I don't see how to *detect* the encoding of a binary stream using this class. The regular expression in the question matches true if the binary stream is UTF-8 encoded. – kervin Jul 08 '09 at 20:38
kervin: You can try parsing the stream as UTF-8. If it fails, then it wasn't UTF-8, otherwise it was. – Joey Jul 10 '09 at 16:01

How can I convert a complex binary Perl regular expression to C# or PowerShell?

4 Answers4