23

I need to do the following:

    static string[] pats = { "å", "Å", "æ", "Æ", "ä", "Ä", "ö", "Ö", "ø", "Ø" ,"è", "È", "à", "À", "ì", "Ì", "õ", "Õ", "ï", "Ï" };
    static string[] repl = { "a", "A", "a", "A", "a", "A", "o", "O", "o", "O", "e", "E", "a", "A", "i", "I", "o", "O", "i", "I" };
    static int i = pats.Length;
    int j;

     // function for the replacement(s)
     public string DoRepl(string Inp) {
      string tmp = Inp;
        for( j = 0; j < i; j++ ) {
            tmp = Regex.Replace(tmp,pats[j],repl[j]);
        }
        return tmp.ToString();            
    }
    /* Main flow processes about 45000 lines of input */

Each line has 6 elements that go through DoRepl. Approximately 300,000 function calls. Each does 20 Regex.Replace, totalling ~6 million replaces.

Is there any more elegant way to do this in fewer passes?

cairnz
  • 3,917
  • 1
  • 18
  • 21
  • Your questions title says "faster" but your question says "elegant", which are you after? The two can be, but certainly not always, mutually exclusive. – Rob Nov 11 '10 at 14:38
  • Nicely pointed out, in fact the code is running fast enough really, i was looking for a more efficient way in terms of computation cycles. It's 45,000 lines of input now, but it is plausible it will scale to millions of lines and dozens of elements per line, meaning quite a lot of substitutions. – cairnz Nov 11 '10 at 14:43
  • Actually, if there was any way to easily change the "special" letters into the most basic one (by changing codepage or any other way, it'd be just as helpful) - all these äåâãàá-variations into an a, etc. – cairnz Nov 11 '10 at 14:45
  • The fastest (IMHO) - is by using a map[65536] (see my answer below), the more elegant on the other hand... I guess it is also using that map :), or, depending on how many language and BCL features you want to use, could make it fairly elaborate. A dictionary approach saves memory, but looses on speed as it needs to perform a more expensive bucket lookup for each character. – Yuri Astrakhan Nov 11 '10 at 16:12
  • Don't forget, in your case, you're only looking to replace accented characters. So, all of your replacement characters are greater than lower case 'z' in the ascii table. Before you even run your lookup, make sure the chars value is greater than 'z' – Bengie Nov 11 '10 at 18:57

8 Answers8

21
static Dictionary<char, char> repl = new Dictionary<char, char>() { { 'å', 'a' }, { 'ø', 'o' } }; // etc...
public string DoRepl(string Inp)
{
    var tmp = Inp.Select(c =>
    {
        char r;
        if (repl.TryGetValue(c, out r))
            return r;
        return c;
    });
    return new string(tmp.ToArray());
}

Each char is checked only once against a dictionary and replaced if found in the dictionary.

Jesper Larsen-Ledet
  • 6,625
  • 3
  • 30
  • 42
  • I think this would be the accepted answer, it answers my specific question, even though people further down understood the underlying nature of what i was trying to accomplish and gave better solutions to do it. I will keep this one in mind when i need to do different kind of replacements (on single characters!) (As i don't think yours will take multicharacter patterns. – cairnz Nov 11 '10 at 14:58
  • I think the solution with direct 64KB map is significantly faster, yet consume a bit more memory. (see my answer below) – Yuri Astrakhan Nov 11 '10 at 17:27
  • @JesperLarsen-Ledet why you use dictionary ? – onmyway133 Nov 07 '12 at 10:11
  • @JesperLarsen-Ledet does it has effective performance over 2 array as the questioner uses ? – onmyway133 Nov 12 '12 at 11:25
  • 1
    @Yamamoto Yes. Dictionary lookup is O(1). DoRepl is O(N) on the length of the input string. It doesn't get faster than that. – Jesper Larsen-Ledet Nov 13 '12 at 09:00
12

How about this "trick"?

string conv = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(input));
Jonas Elfström
  • 30,834
  • 6
  • 70
  • 106
  • This was a very nice trick! It took everything in the list, and even replaced some weird symbols with ? (i like this way) – cairnz Nov 11 '10 at 14:55
  • I should mention that this was the implementation i selected in the end. It did what we needed without any maintenance. I'd doublevote you up if i could! :) – cairnz Apr 02 '11 at 21:36
10

Without regex it might be way faster.

    for( j = 0; j < i; j++ ) 
    {
        tmp = tmp.Replace(pats[j], repl[j]);
    }

Edit

Another way using Zip and a StringBuilder:

StringBuilder result = new StringBuilder(input);
foreach (var zipped = patterns.Zip(replacements, (p, r) => new {p, r}))
{
  result = result.Replace(zipped.p, zipped.r);
}
return result.ToString();
Stefan Steinegger
  • 63,782
  • 15
  • 129
  • 193
3

First, I would use a StringBuilder to perform the translation inside a buffer and avoid creating new strings all over the place.

Next, ideally we'd like something akin to XPath's translate(), so we can work with strings instead of arrays or mappings. Let's do that in an extension method:

public static StringBuilder Translate(this StringBuilder builder,
    string inChars, string outChars)
{
    int length = Math.Min(inChars.Length, outChars.Length);
    for (int i = 0; i < length; ++i) {
        builder.Replace(inChars[i], outChars[i]);
    }
    return builder;
}

Then use it:

StringBuilder builder = new StringBuilder(yourString);
yourString = builder.Translate("åÅæÆäÄöÖøØèÈàÀìÌõÕïÏ",
    "aAaAaAoOoOeEaAiIoOiI").ToString();
Frédéric Hamidi
  • 258,201
  • 41
  • 486
  • 479
  • This is probably not as efficient as the dictionary method, as you need to search and replace over the string for each character. – Johan Dahlin Nov 11 '10 at 16:28
  • @Johan, you're probably right. I'd confess I was only aiming for a literal solution to the questioner's problem, and I'm quite biased towards extension methods in that context :) – Frédéric Hamidi Nov 11 '10 at 16:41
2

The problem with your original regex is that you're not using it to its fullest potential. Remember, a regex pattern can have alternations. You will still need a dictionary, but you can do it in one pass without looping through each character.

This would be achieved as follows:

string[] pats = { "å", "Å", "æ", "Æ", "ä", "Ä", "ö", "Ö", "ø", "Ø" ,"è", "È", "à", "À", "ì", "Ì", "õ", "Õ", "ï", "Ï" };
string[] repl = { "a", "A", "a", "A", "a", "A", "o", "O", "o", "O", "e", "E", "a", "A", "i", "I", "o", "O", "i", "I" };
// using Zip as a shortcut, otherwise setup dictionary differently as others have shown
var dict = pats.Zip(repl, (k,v) => new { Key = k, Value = v }).ToDictionary(o => o.Key, o => o.Value);

string input = "åÅæÆäÄöÖøØèÈàÀìÌõÕïÏ";
string pattern = String.Join("|", dict.Keys.Select(k => k)); // use ToArray() for .NET 3.5
string result = Regex.Replace(input, pattern, m => dict[m.Value]);

Console.WriteLine("Pattern: " + pattern);
Console.WriteLine("Input: " + input);
Console.WriteLine("Result: " + result);

Of course, you should always escape your pattern using Regex.Escape. In this case this is not needed since we know the finite set of characters and they don't need to be escaped.

Ahmad Mageed
  • 94,561
  • 19
  • 163
  • 174
1

If you want to remove accents then perhaps this solution would be helpful How do I remove diacritics (accents) from a string in .NET?

Otherwise I would to this in single pass:

Dictionary<char, char> replacements = new Dictionary<char, char>();
...
StringBuilder result = new StringBuilder();
foreach(char c in str)
{
  char rc;
  if (!_replacements.TryGetValue(c, out rc)
  {
    rc = c;
  }
  result.Append(rc);
}
Community
  • 1
  • 1
Arunas
  • 918
  • 1
  • 8
  • 20
1

The fastest (IMHO) way (compared even with the dictionary) in the special case of one-to-one character replacement would be a full character map:

public class Converter
{
    private readonly char[] _map;

    public Converter()
    {
        // This code assumes char to be a short unsigned integer
        _map = new char[char.MaxValue];

        for (int i = 0; i < _map.Length; i++)
            _map[i] = (char)i;

        _map['å'] = 'a';  // Note that 'å' is used as an integer index into the array.
        _map['Å'] = 'A';
        _map['æ'] = 'a';
        // ... the rest of overriding map
    }

    public string Convert(string source)
    {
        if (string.IsNullOrEmpty(source))
            return source;

        var result = new char[source.Length];

        for (int i = 0; i < source.Length; i++)
            result[i] = _map[source[i]]; // convert using the map

        return new string(result);
    }
}

To further speed up this code, you might want to use the "unsafe" keyword and use pointers. This way, traversing the string array could be done faster and without bound-checks (which in theory would be optimized away by the VM, but might not).

Yuri Astrakhan
  • 8,808
  • 6
  • 63
  • 97
  • 1
    **Don't use pointers**. Pinning the array will make it much slower. – SLaks Nov 11 '10 at 17:59
  • Perhaps - depends on the size of the string i guess. But more importantly, the map, not a dictionary should be used for the most optimal performance. The biggest hack could be to create a string, fix it in place, and modify it. Totally against the .NET rules, but could work... – Yuri Astrakhan Nov 11 '10 at 18:58
  • The performance benefit for an array vs a dictionary is going to be small. It might not even benefit at all, since this array has a lot more cache pressure. You'd have to measure to find out. In any case, using a map at all is going to be most of the speedup – whether you implement the map with an array or a dictionary is likely to be a comparatively insignificant difference. – Joren Nov 11 '10 at 19:55
  • That is a good point - guess the only way to find out is to run it. Thanks. – Yuri Astrakhan Nov 11 '10 at 21:39
0

I'm not familiar with the Regex class, but most regular expression engines have a transliterate operation that would work well here. Then you would only need one call per line.

caveman
  • 1,755
  • 1
  • 14
  • 19
  • Actually one "line" is one InputBuffer row, and there's six columns (for now) - so it'd be hard to do it per line, but you are absolutely correct in theory that if it had been a "single line of text" it would only be necessary with one call per line :) – cairnz Nov 11 '10 at 15:31