0

I need a method in c#/.net that can take any string, with lots of weird characters, as input and produce a valid subdomain that is as close to the input as possible.

Example: Input: Øyvind & René's Company Ltd. Output: oyvindrenescompanyltd.example.com

Does anyone know of a .net library that can help me do this conversion?

It's easy to just remove all characters not valid in a subdomain, but if I have to replace a lot of characters (ø -> o, é -> e) then it's not trivial to capture all the variations.

  • 1
    Have you considered internationalized domain names, or simply asking the user? Once you want to support languages that do not use Latin-based scripts, automated conversion becomes really difficult. – dtb Sep 02 '13 at 15:55
  • 3
    Related: [How do I remove diacritics (accents) from a string in .NET?](http://stackoverflow.com/questions/249087) – dtb Sep 02 '13 at 15:58
  • Have a look at [Unidecode](http://www.developerfusion.com/project/98702/unidecode-sharp/), which is a port of the Perl Unidecode module that converts non-latin characters to latin ones. It might be a good idea to strip any remaining invalid characters afterwards anyway, just in case. – Cameron Sep 02 '13 at 15:59
  • Why is it less trivial to replace than to remove? You know what is the set of "not valid" you must also know what valid values you'd like to replace with...Am I missing something? – David Tansey Sep 02 '13 at 15:59
  • 1
    @David: Yes, just because you know a character is not valid doesn't mean you know what it should be replaced with. For example, 'é' should become 'e', but there's thousands of possible diacritics... building a lookup table is error-prone and time consuming, and would likely miss characters -- even if it didn't, it would be out of date at some point. Not to mention that there's several ways of representing a diacritic character in Unicode. – Cameron Sep 02 '13 at 16:04

2 Answers2

2

but if I have to replace a lot of characters (ø -> o, é -> e) then it's not trivial to capture all the variations.

Actually it's pretty easy to remove diacritic characters (accents, etc), by taking advantage of Unicode normalization:

    public static string RemoveDiacritics(this string s)
    {
        if (s == null) throw new ArgumentNullException("s");
        string formD = s.Normalize(NormalizationForm.FormD);
        char[] chars = new char[formD.Length];
        int count = 0;
        foreach (char c in formD)
        {
            if (CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
            {
                chars[count++] = c;
            }
        }
        string noDiacriticsFormD = new string(chars, 0, count);
        return noDiacriticsFormD.Normalize(NormalizationForm.FormC);
    }

(note that it works only on the full .NET framework, not on Windows Phone, WinRT or Silverlight)

Thomas Levesque
  • 286,951
  • 70
  • 623
  • 758
  • Thanks, this is a good solution for removing diacritics (é -> e), but it doesn't cover the other required conversions like ø -> o. I have set Cameron's solution using Unidecode as the accepted answer, as this covers both diacritics and other character conversions, and so I believe that is a more complete solution. – Jon Anders Amundsen Sep 03 '13 at 11:05
1

You can use Unidecode, a port of the Perl module of the same name (or you can use the RemoveDiacritics method posted by Thomas Levesque):

using BinaryAnalysis.UnidecodeSharp;
using System.Text.RegularExpressions;

public static string MakeSubdomain(string rawSubdomain, string baseDomain)
{
    if (baseDomain.Length + 2 > 253) {
        throw new ArgumentException("Base domain is already too long for a subdomain");
    }
    if (baseDomain.Length == 0) {
        throw new ArgumentException("Invalid base domain");
    }

    var sub = rawSubdomain.Unidecode();
    sub = Regex.Replace(sub, @"[^a-zA-Z0-9-]+", "");
    sub = Regex.Replace(sub, @"(^-+)|(-+$)", "");
    sub = sub.ToLowerInvariant();

    if (sub.Length > 63) {
        sub = sub.Substring(0, 63);
    }
    if (sub.Length + baseDomain.Length + 1 > 253) {
        sub = sub.Substring(0, 252 - baseDomain.Length);
    }
    return sub + "." + baseDomain;
}
Cameron
  • 96,106
  • 25
  • 196
  • 225