8

In C#, what is the fastest way to detect duplicate characters in a String and remove them (removal including 1st instance of the duplicated character)?

Example Input: nbHHkRvrXbvkn

Example Output: RrX

Welbog
  • 59,154
  • 9
  • 110
  • 123
Alex
  • 75,813
  • 86
  • 255
  • 348

5 Answers5

21

Fastest as in fewest-lines-of-code:

var s = "nbHHkRvrXbvkn";
var duplicates = s.Where(ch => s.Count(c => c == ch) > 1);
var result = new string(s.Except(duplicates).ToArray()); // = "RrX"

Fastest as in fastest-performance would probably be something like this (does not preserve order):

var h1 = new HashSet<char>();
var h2 = new HashSet<char>();

foreach (var ch in "nbHHkRvrXbvkn")
{
    if (!h1.Add(ch))
    {
        h2.Add(ch);
    }
}

h1.ExceptWith(h2); // remove duplicates

var chars = new char[h1.Count];
h1.CopyTo(chars);
var result = new string(chars); // = "RrX"

Performance test

When in doubt -- test it :)

Yuriy Faktorovich's answer        00:00:00.2360900
Luke's answer                     00:00:00.2225683
My 'few lines' answer             00:00:00.5318395
My 'fast' answer                  00:00:00.1842144
dtb
  • 213,145
  • 36
  • 401
  • 431
  • 1
    Very nice. Great performance comparison too. Performance variation is probably even more visible with very large strings. – Alex Aug 27 '09 at 22:39
  • 1
    I've repeated the performance test in Release build with detached debugger (but same input string). I'm surpised by the performance of Yuriy's answer; it's pretty fast! – dtb Aug 27 '09 at 22:44
  • 1
    @dtb: The thing that slows down my answer compared to yours is that I'm preserving the original order in the output string, which necessitates an additional loop through the input string. The technique that you and I use for actually finding the dupes is *exactly* the same. – LukeH Aug 27 '09 at 22:55
  • see my solution below ... you had the right idea but using an array for known data is 4 times faster than HashSet in my tests – gabe Aug 28 '09 at 16:04
  • c# has var? ... i've never seen the use of var in c# before... BTW great piece of code good job! +1 – jay_t55 Aug 30 '09 at 06:11
  • stands for variant, usually used with LINQ for anonymous types, discouraged otherwise. – Yuriy Faktorovich Sep 02 '09 at 01:43
9

Here is a pretty fast one preserving order. But I'd be a little worried about how LINQ does Group and Where:

string s = "nbHHkRvrXbvkn";
Console.WriteLine( 
    s.ToCharArray()
        .GroupBy(c => c)
        .Where(g => g.Count() == 1)
        .Aggregate(new StringBuilder(), (b, g) => b.Append(g.Key)));

Edit: This one beats Luke's in some cases still slower than dtb's, but it preserves the order

private static string MyMethod(string s)
{
    StringBuilder sb = new StringBuilder(s.Length);
    foreach (var g in s.ToCharArray().GroupBy(c => c))
        if (g.Count() == 1) sb.Append(g.Key);

    return sb.ToString();
}
Yuriy Faktorovich
  • 67,283
  • 14
  • 105
  • 142
4

This one should be pretty quick (and it preserves the original order):

public static string RemoveDuplicates(string source)
{
    HashSet<char> found = new HashSet<char>();
    HashSet<char> dupes = new HashSet<char>();

    foreach (char c in source)
    {
        if (!found.Add(c))
        {
            dupes.Add(c);
        }
    }

    StringBuilder sb = new StringBuilder(source.Length);
    foreach (char c in source)
    {
        if (!dupes.Contains(c))
        {
            sb.Append(c);
        }
    }
    return sb.ToString();
}
LukeH
  • 263,068
  • 57
  • 365
  • 409
  • Why do you think creating a StringBuilder that is probably too large will take less time than allowing it to get space on the fly? – Yuriy Faktorovich Aug 28 '09 at 03:19
  • @Yuri: I benchmarked! I tested with millions of random strings and pre-sizing the `StringBuilder` was faster in most cases. Of course, in the real-world the strings probably wouldn't be purely random. In that situation the performance difference would depend on the ratio of dupes to non-dupes in the source string. – LukeH Aug 28 '09 at 08:59
  • @Yuriy: I just benchmarked on a different machine (Vista64 vs XP32) and the results were much closer. On the 64-bit machine it appears to make no real difference whether the `StringBuilder` is pre-allocated or not. (In which case it probably makes sense to not bother pre-allocating and save ourselves some RAM.) – LukeH Aug 29 '09 at 09:13
2

This preserves order and, based on my tests, is 4 times faster than using a HashSet. This assumes your character range is 0-255 but you can extend that easily. If you're planning on using this in a loop, move the int[] c = new int[255]; out and in the function do an Array.Clear(c,0,255).


        private static string RemoveDuplicates(string s)
        {
            int[] c = new int[255];
            for (int i = 0; i < s.Length; i++)
            {
                c[s[i]]++;
            }
            StringBuilder sb = new StringBuilder();
            for (int i = 0; i < s.Length; i++)
            {
                if (c[s[i]] == 1) sb.Append(s[i]);
            }
            return sb.ToString();
        }
gabe
  • 127
  • 3
  • also, i don't know if the compiler will unroll those loops for you, but you can try that too http://en.wikipedia.org/wiki/Loop_unwinding – gabe Aug 28 '09 at 16:08
  • What is your test timing/stopwatch result with the sample string? – Alex Aug 28 '09 at 20:04
  • With the sample string, this method would be 1/4 or less compared to others. – Yuriy Faktorovich Aug 29 '09 at 00:52
  • @Yuriy: Is that with the array size correctly set to 65536? In my tests there's not really much difference between this method and using a `HashSet` once the array is sized correctly. – LukeH Aug 29 '09 at 09:06
  • 1
    No, once the Array is set correctly on average working on a 100 character string it was 60% worse. – Yuriy Faktorovich Aug 29 '09 at 14:10
0

This algorithm is general, can be applied to any language

  1. create a map(HashTable) char->int that holds the count of each character found, initially empty
  2. scan the string once to populate the map.
  3. create a new empty string that'll hold the output, may be you'll need to use a StringBuilder.
  4. scan the string(orthe map, whichever is shorter) copying only characters which an occurrence of 1 to the output string/StringBuilder
Diaa Sami
  • 3,249
  • 24
  • 31