4

I have made a straightforward implementation conforming to the W3 specification. Here I simply hold the different sets of legal characters (legal start chars differ from following chars) and use string.Contains. But the sets of legal characters are surprisingly (to me anyway) large, and just checking a character at a time of the candidate string becomes a tad expensive.

This isn't really an issue at the moment, as I need to validate a few strings once (taking milliseconds) per execution of a batch (taking seconds, minutes or even hours), but I'm curious to know what others will suggest.

Here's my straightforward implementation:

using System;
using System.Text;
using Project.Common; // Guard

namespace Project.Common.XmlUtilities
{
    static public class XmlUtil
    {
        static public bool IsLegalElementName(string localName)
        {
            Guard.ArgumentNotNull(localName, "localName");
            if (localName == "") 
                return false;

            if (NameStartChars.IndexOf(localName[0]) == -1)
                return false;

            for (int i = 1; i < localName.Length; i++)
                if (NameChars.IndexOf(localName[i]) == -1)
                    return false;

            return true;
        }


        // See W3 spec at http://www.w3.org/TR/REC-xml/#NT-NameStartChar.
        static public readonly string NameStartChars = AZ.ToLower() + AZ + ":_" + GetStringFromCharRanges(0xC0, 0xD6, 0xD8, 0xF6, 0xF8, 0x2FF, 0x370, 0x37D, 0x37F, 0x1FFF, 0x200C, 0x200D, 0x2070, 0x218F, 0x2C00, 0x2FEF, 0x3001, 0xD7FF, 0xF900, 0xFDCF, 0xFDF0, 0xFFFD, 0x10000, 0xEFFFF);

        // See W3 spec at http://www.w3.org/TR/REC-xml/#NT-NameChar.
        static public readonly string NameChars = NameStartChars + "-.0123456789" + char.ConvertFromUtf32(0xB7) + GetStringFromCharRanges(0x0300, 0x036F, 0x203F, 0x2040);

        public const string AZ = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

        // Hacky but convenient: alternating low-high unicode points specifies multiple ranges, e.g. 0-5 and 10-12 would be 0, 5, 10, 12.
        static string GetStringFromCharRanges(params int[] lowHigh)
        {
            var sb = new StringBuilder();
            for (int i = 0; i < lowHigh.Length; i += 2)
            {
                int low = lowHigh[i];
                int high = lowHigh[i + 1];
                for (int ci=low; ci < high; ci++)
                    sb.Append(char.ConvertFromUtf32(ci));
            }
            return sb.ToString();
        }
    }
}

Although I haven't bothered to build it I reckon creating a sorted list once, in a type initializer, and binary search the lists (instead of linearly search with string.Contains) to check each character would strike a good balance of space, time and complexity. But perhaps you have other (better!) ideas?

The Dag
  • 1,811
  • 16
  • 22
  • I lied: I don't use string.Contains, but string.IndexOf. As I discovered when cleaning up my using list, string.Contains(char) doesn't exist - and I was using LINQ without intending to! Extension methods are stupid. :) – The Dag Jun 05 '13 at 09:52
  • try using XmlWriter : http://www.dotnetperls.com/xmlwriter – Mzf Jun 05 '13 at 09:57

2 Answers2

5

There exists a static string VerifyName(string name) function, but it throws an exception for invalid names.

I would still prefer to use this:

try
{
    XmlConvert.VerifyName(name);
    return true;
}
catch
{
   return false;
}
H H
  • 263,252
  • 30
  • 330
  • 514
  • Thanks, I wasn't aware of this one. I'll compare the speed of this one (so long as names are valid; I don't really care if it takes long when names are not valid as this should be exceptional). But I was also interested in the problem itself and a good algorithm/data structure for solving it. I wrote "sorted list" but should have said sorted char[], not to waste space. – The Dag Jun 05 '13 at 12:26
  • 3
    Marked this as answer, although I do wish Microsoft had exposed the logic as a method simply returning a bool (XmlElement.IsLegalName perhaps). A void exception-throwing variant like what we got would be a handy supplement, though I'm not sure XmlConvert is the most logical class to host the logic. :) – The Dag Jun 21 '13 at 07:45
0

I would go for a regex or simply try to create a XElement with the name in question (if there's an exception the name is invalid...)

JeffRSon
  • 10,404
  • 4
  • 26
  • 51
  • 1
    I think regex would be the worst of all worlds - the expression would be awful, performance would be terrible, and so on really. I like the idea of reusing the knowledge in the FW. For scenarios where performance doesn't matter when the test fails, the cost of catching the exception doesn't matter. – The Dag Jun 05 '13 at 12:17
  • Not quite - you may find some discussion e.g. http://stackoverflow.com/questions/2519845/how-to-check-if-string-is-a-valid-xml-element-name - however, as I already wrote, XElement will check the name for you as well. – JeffRSon Jun 05 '13 at 12:44
  • That RegEx is simply not *correct*. It seems to have a strong western bias, since it excludes the 40,000 or so legal characters from non-western languages. In fact, the vast majority of legal names of any given length would not be matched by that RegEx. Instantiating an XElement should work but surely includes considerable overhead beyond verifying the legality of the name. – The Dag Jun 21 '13 at 07:40