2

Is anyone aware of any simple way to anglicize a string? Currently, in our system, we're doing replacements on "invalid" characters, such as shown below:

        ret = ret.Replace("ä", "ae");
        ret = ret.Replace("Ä", "Ae");
        ret = ret.Replace("ß", "ss");
        ret = ret.Replace("ç", "c");
        ret = ret.Replace("Ç", "C");
        ret = ret.Replace("Ž", "Z");

The issue here is that as we're opening the business up in additional countries (Turkey, Russia, Hungary...), we're finding that there's a whole slew of characters that this process does not convert.

Is anyone aware of any sort of solution that would allow us to not depend on a table of "invalid" characters?

Also, if it helps, we're using C# to code. :)

Thanks!


edit:

In response to some comments, our system does support the full set of unicode characters... however, other system that we integrate to (such as card processors) do not. :(

Tyllyn
  • 279
  • 1
  • 4
  • 12
  • 7
    It's pretty much guaranteed that there will always be some weird language with some weird characters that will fall through the cracks; why not change your application to support unicode? – Carl Norum Dec 08 '09 at 20:09
  • 2
    A weird language, like... Any language in the world except English? – Amnon Dec 08 '09 at 20:11
  • 1
    @Carl: As the system seems to be in C#, it could be assumed that it already supports Unicode. There might be text processing scenarios where you don't want diacritical characters (indexing, stemming, or some other form of text "normalization") – Dirk Vollmar Dec 08 '09 at 20:27
  • 1
    If you're opening it up in Russia, what do you mean by "anglicization" in that context, even? For Cyrillic, your examples don't really make sense, since many letters look the same but don't _mean_ the same (e.g. Russian "Н" corresponds to English "N"). You can go for full transliteration, but that wouldn't be very user-friendly, would it... – Pavel Minaev Dec 08 '09 at 20:41
  • 1
    I'm pretty sure you mean AngliciSation *grin* – blowdart Dec 08 '09 at 20:44

4 Answers4

2

Check out this question and its answers and take a look at this blog entry on converting diacritical characters to their ASCII equivalents.

Community
  • 1
  • 1
luvieere
  • 37,065
  • 18
  • 127
  • 179
  • I've actually just tried that method, and it doesn't seem to catch every character. æœÄŒæßüÿt° is converted æœAŒæßuyt° ö, which I would expect an anglicization to oe, changes to simply o – Tyllyn Dec 08 '09 at 20:33
  • 2
    @Tyllyn: In fact the translation can also be language dependent. In Swedish "ö" is mapped to "o", whereas in German you would represent it as "oe". – Dirk Vollmar Dec 08 '09 at 20:37
  • @divo: Good lord, that makes everything even more confusing. : – Tyllyn Dec 08 '09 at 20:47
1

As an answer to the modified problem (mail server supports only alphanumeric characters in usernames):

Let the users choose their own usernames, allowing only alphanumeric characters. They probably know best how to "anglicize" it.

Amnon
  • 7,652
  • 2
  • 26
  • 34
  • We are going to the route, to ensure that usernames are properly put into the [a-zA-Z0-9]. but , at least with the email part, does not allow us to handle pre-existing usernames. Also, with one of the card processors, we send them a file that needs "anglicized" prior. Fields that need converted as such include address and name. We could allow the user to enter a properly anglicized solution, but this would most definitely cause a slow down in continuing these operations at the rate that we have, affecting the business as a whole. We want to have as little user involvement as possible. – Tyllyn Dec 08 '09 at 20:59
1

I apologize for a shameless plug, but I couldn't resist. I once wrote a Python module that does exactly what the author of the post needed:

https://github.com/revl/anglicize

Because Python is almost as readable as pseudocode and the module is only about 125 lines long, it's relatively easy to rewrite it in C#.

Here's what the module produces given the input from the original post:

$ echo 'ä Ä ß ç Ç Ž' | anglicize
a A ss s S S

As you can see, "ß" was replaced with "ss" as requested, while "ç", "Ç", and "Ž" were replaced with "s", "S", and "S" respectively, likely because those were the phonetic equivalents in English.

As for "ä" and "Ä", the transliterations "ae" and "Ae" would probably work better than "a" and "A". I will gladly change the transliteration table if the linguists out there confirm that that's the right thing to do.

The module can transliterate the whole input text at once, or it can process input data in chunks. The documentation is in the README file that comes with the module.

revl
  • 101
  • 2
  • 4
0

Just because a letter looks similar to a traditional English letter does not make it equivalent. What is the business case for not just supporting Unicode and any characters your audience chooses to use?

richardtallent
  • 34,724
  • 14
  • 83
  • 123
  • Our mail server (which we are changing soon) doesn't support characters outside of the [a-zA-Z0-9] set for usernames. And to card processors we're using doesn't support it at some point. From our business practice, we have not limited to this limited character set... and, well, it has caused problems when going to other systems. :( – Tyllyn Dec 08 '09 at 20:49