Check string on duplicate char

Question

first time I write on the SO, because he could not find solution myself. At the interview, I was given the task to write a method that checks the characters in the string to a unique.

Requirements: not using LINQ. Desirable: do not use additional data types (Dictionary, HashSet...etc. Arrays and lists Allowed)

Example:

"Hello" - return false; "Helo" - return true

My implementation:

static HashSet<char> charSet = new HashSet<char>();

static bool IsUniqueChar(string str)
{       
    foreach (char c in str)
    {            
           charSet.Add(c);                         
    }
    return charSet.Count() == str.Length;
}

But it does not meet the requirements of data types, and is not the best performance... I also tried the approach with a dictionary:

static Dictionary<char,bool> charSetDictionary = new Dictionary<char,bool>();

static bool IsUniqueChar(string str)
{
    try
    {
        foreach (char c in str)
        {
            charSetDictionary.Add(c,true);
        }
    }
    catch 
    {
        return false;
    }

But he is no better than the previous. I will welcome any idea how to solve this task better? p.s

static void Main(string[] args)
{
    Stopwatch sw = Stopwatch.StartNew();

    IsUniqueChar("Hello");

    sw.Stop();

    Console.WriteLine("Elapsed={0}", sw.Elapsed); //~005044
}

have you check this http://stackoverflow.com/a/1343953/1880431 — meda, Feb 15 '14 at 06:49
what are the characters ? Are they only alphabets ? is A=='a'..? — Vikram Bhat, Feb 15 '14 at 06:57

score 8 · Answer 1 · answered Feb 15 '14 at 06:52

The fastest way uses HashSet<char>:

var set = new HashSet<char>();

foreach(var c in input)
{
    if(!set.Add(c))
        return false;
}

return true;

It's O(n) solution in worst case (input is unique). Returns false as soon as first duplicate is found.

Without HashSet<char> you can easily transform string to char[], sort it and check if you have two consecutive items with the same value.

var chars = input.ToCharArray();
chars.Sort();

for(int i = 1; i < chars.Length; i++)
{
    if(chars[i-1] == chars[i])
        return false;
}

return true;

Sort is O(n log(n)) and so is the entire function.

U are right that c# char has 2 bytes hence it is better to HashSet or sorting instead of boolean array +1 . — Vikram Bhat, Feb 15 '14 at 12:24

score 1 · Answer 2 · answered Feb 15 '14 at 18:11

All answers so far are based on the assumption that one .NET char corresponds to one Unicode character. This is only true for characters in the Basic Multilingual Plane. Characters outside the BMP are encoded using two char objects (surrogate pair).

The following code handles this special case:

HashSet<string> set = new HashSet<string>();

for (int i = 0; i < str.Length; i++)
{
    string s;
    if (char.IsHighSurrogate(str[i]))
    {
        s = str.Substring(i, 2);
        i++;
    }
    else
    {
        s = str.Substring(i, 1);
    }

    if (!set.Add(s))
    {
        return false;
    }
}
return true;

Lavneet · Answer 3 · 2014-02-15T13:37:55.287

0

Suppose the test string is passed through textBox1, then it follows:

string tst;
int i,j, stat =0;
tst = textBox1.Text;
for (i = 0; i < tst.Length; i++)
{
    for (j = 0; j < tst.Length; j++)
    {
        if ((tst[i] == tst[j]) && (i != j))
        {
            stat = 1;
            break;
        }
        else continue;
    }
    if (stat == 1) break;
    else continue;
}
if (stat == 1) MessageBox.Show("False");
else MessageBox.Show("True");

Every string is an array of characters.

edited Feb 15 '14 at 13:37

answered Feb 15 '14 at 06:55

Lavneet

516
5
19

inefficient O(n^2) can do better than that. – Vikram Bhat Feb 15 '14 at 06:59
@Vikram: Yes, you're right. I didn't mention efficiency in my code. No doubt it's best and worst cases are way too far, but for the purpose, it's ok. – Lavneet Feb 15 '14 at 07:01
Isn't this always producing `false` guys, for any string? If you allow i to be equal to j, `arr[i] == arr[j]` is always true at this point... Which is the case in the first iteration of the loop. – t_over Feb 15 '14 at 13:35
@t_over : Yes. You're right. I've made the edit. Thanx. – Lavneet Feb 15 '14 at 13:38

Ilya Sulimanov · Accepted Answer · 2014-02-15T07:31:58.363

-1

Most likely, your interviewer would like to see an approach using knowledge about Unicode:

static bool[] charsHash = new bool[512];

static bool IsUniqueChar(string str)
{
    if (str.Length > 512) return false;

    foreach (char c in str)
    {
        bool alreadyExist = charsHash[(int)c];
        if (alreadyExist) return false;
        else charsHash[(int)c] = !alreadyExist;
    }
    return true;
}

static void Main(string[] args)
{
    Stopwatch sw = Stopwatch.StartNew();

    IsUniqueChar("Hello");

    sw.Stop();
    Console.WriteLine("Elapsed={0}", sw.Elapsed);//~000283
}

edited Feb 15 '14 at 07:31

answered Feb 15 '14 at 06:51

Ilya Sulimanov

7,636
6
47
68

This is the best way O(1) extra space and O(n) time complexity – Vikram Bhat Feb 15 '14 at 07:00
From wiki: *the latest version of Unicode contains a repertoire of more than 110,000 characters covering 100 scripts.* It would require really big `charsHash` array to make it work with Unicode :) – MarcinJuraszek Feb 15 '14 at 07:04
@vikram I really dont understand why that is the case. If a char is 2 bytes, there are 65536 values, not 512. Could you or the OP explain? – Rotem Feb 15 '14 at 08:57
Actually char represents a unicode character, using a variable length UTF16 encoding. It can be either 2 or 4 bytes, otherwise it wiuld not be able to represent more than 65536 characters. See wikipedia. – Rotem Feb 15 '14 at 09:15
@Rotem silly me sorry u are right this is bit inefficient in c# (would have been sufficient in c as char has 1 byte in it) – Vikram Bhat Feb 15 '14 at 12:22
-1. How is this solution "using knowledge about Unicode"? The algorithm crashes for strings containing any character with a codepoint larger than U+01FF (511). – Sebastian Negraszus Feb 15 '14 at 14:30
@Rotem Actually, `char` represents a UTF-16 code unit which is always 2 bytes large. For characters outside of the Basic Multilingual Plane (> U+10FFFF), 2 chars are used (see surrogate pairs) – Sebastian Negraszus Feb 15 '14 at 14:36
@Rotem Correction (too late for editing): The BMP of course ends at U+FFFF, not U+10FFFF – Sebastian Negraszus Feb 15 '14 at 14:44

Check string on duplicate char

4 Answers4