3

This Perl binary regex found at http://www.w3.org/International/questions/qa-forms-utf-8.en.php matches UTF-8 documents without the UTF-8 BOM header:

$field =~
m/\A(
 [\x09\x0A\x0D\x20-\x7E]            # ASCII
 | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
 |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
 | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
 |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
 |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
 | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
 |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
)*\z/x;

I need this because I am working on a PowerShell equivalent to 'grep -I', and part of this involves detecting text encoding.

But how do I rewrite this in C# or PowerShell? Or in other words, in ".Net Regex" syntax?

EDIT: Found this http://social.msdn.microsoft.com/Forums/en-US/regexp/thread/6a81be63-e6da-4156-a5bf-8b9782a1ac40 question about the same Regex of all things. The short answer seems like this can not be done with .Net since .Net does not support binary regular expressions.

Community
  • 1
  • 1
kervin
  • 11,672
  • 5
  • 42
  • 59

4 Answers4

1

Try this: (I haven't checked that it matches correctly; you can easily try it in LINQPad).

new Regex(@"
    ^(
    [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
    |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$", RegexOptions.IgnorePatternWhitespace)

EDIT:

Try reading your file using an ASCII StreamReader; that should do what you're looking for. (Note that I didn't actually try it)

SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
  • The Perl regex is a binary regex. So this will not work. After more research it doesn's seem that .Net supports binary regular expressions. – kervin Jul 08 '09 at 23:09
  • You can fake "binary" regex matching by decoding the byte stream in such a way that each byte is converted to a character with the same numeric value. Just use ISO-8859-1. – Alan Moore Jul 09 '09 at 05:57
1

This post at http://social.msdn.microsoft.com/Forums/en-US/regexp/thread/6a81be63-e6da-4156-a5bf-8b9782a1ac40 describes several workarounds.

kervin
  • 11,672
  • 5
  • 42
  • 59
1

The odds are pretty good that if a sequence has no invalid UTF-8 characters, it can be treated as UTF-8. Since RegExps are for text in .Net, not byte arrays, here's a non-regexp solution that should work. Personally, I'd rather use this as a fallback mechanism (e.g. mycommand -autodetect) and offer pipeline parameters that allow user-specified encodings.

       string result=String.Empty;
        Encoding ae = Encoding.GetEncoding(
              Encoding.UTF8.EncodingName,
              new EncoderExceptionFallback(), 
              new DecoderExceptionFallback());
        try {
            result=ae.GetString(mybytes);
        }
        catch (DecoderFallbackException e)
        {
            //revert to some sensible default. Maybe the Ansi Code page for this environment?
            // This will use the substitution fallback mechanism, which usually replaces unknown characters with question marks.
            result=Encoding.Default.GetString(mybytes);
        }

If you can interact with unmanaged code, research the MLANG dll that ships with IE. It has alternate encoding autodetection methods that may be more useful.

JasonTrue
  • 19,244
  • 4
  • 34
  • 61
0

What specifically are you trying to do?

You should be able to use the System.Text.Encoding class.

SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
  • I don't see how to *detect* the encoding of a binary stream using this class. The regular expression in the question matches true if the binary stream is UTF-8 encoded. – kervin Jul 08 '09 at 20:38
  • kervin: You can try parsing the stream as UTF-8. If it fails, then it wasn't UTF-8, otherwise it was. – Joey Jul 10 '09 at 16:01