0

I have a file with allot of sentences. I need to make a dictionary with the words from that file. Until now I've separated the words and sort them using Split() and Sort() methods. My problem is to make a list without duplicate words. How can I do that?

static int n = 0;

public static string[] NoDuplicate(string[] array)
{
    int i;
    string[] res = (string[])array.Clone();
    for (i = 0; i < array.Length-1; i++)
    {
       if (array[i + 1] != array[i])
            res[n++] = (string)array[i];
    }
    return res;
}
  • how can I do it more neat?
  • I don't like that method because is initialized using Clone() and the length is too big.
  • 6
    you want to do `.Distinct()` on list here. but make sure.. you have number of occurrence of each words in your list before doing `.Distinct()`. – Amit May 10 '18 at 04:30
  • 6
    You can use a `HashSet` for that. – Ofir Winegarten May 10 '18 at 04:32
  • Is "ABC" a duplicate of "abc"? How are you planning to _use_ the results of `NoDuplicate` (i.e. why are you doing this)? Did you try https://msdn.microsoft.com/en-us/library/bb301504(v=vs.110).aspx ? – mjwills May 10 '18 at 04:33
  • I second Ofir's suggestion. You use a Dictionary to hold pairs of keys and values. Seems to me like you only need keys in this situation. – Zohar Peled May 10 '18 at 04:34
  • Possible duplicate : [How do I remove duplicates from a C# array?](https://stackoverflow.com/questions/9673/how-do-i-remove-duplicates-from-a-c-sharp-array) – Slaven Tojić May 10 '18 at 05:34

4 Answers4

6

You can also use HashSet beside the .Distinct() feature of LINQ:

HashSet: This is an optimized set collection. It helps eliminates duplicate strings or elements in an array. It is a set that hashes its contents.

public static string[] NoDuplicate(string[] array)
{
    string[] result = new HashSet<string>(array).ToArray();
    return result;
}

If you want to eliminate the duplicate with case-insensitive, you can pass an IEqualityComparer argument like this:

Using HashSet:

public static string[] NoDuplicate(string[] array)
{
    string[] result = new HashSet<string>(array, StringComparer.OrdinalIgnoreCase)
                         .ToArray();
    return result;
}

Using LINQ's Distict feature:

public static string[] NoDuplicate(string[] array)
{
    string[] result = array.Distinct(StringComparer.OrdinalIgnoreCase)
                     .ToArray();
    return result;
}
cdhowie
  • 158,093
  • 24
  • 286
  • 300
Duc Filan
  • 6,769
  • 3
  • 21
  • 26
5

Try this:

private static string[] NoDuplicate(string[] inputArray)
{
    var result = inputArray.Distinct().ToArray();
    return result;
}
Juan M. Vergara
  • 186
  • 1
  • 4
0

Instead of dictionary create a trie of words. At each level keep a count of sameWord if it repeats. This way you can avoid using too much space and it will be faster to search any word O(log(n)) where n is number of distinct words

public  class WordList {

    private int sameWord        = 0;

    String name                 = "";

    WordList [] child = new WordList[26];

    public  void add( String s, WordList c, int index )
    {
        sameWord++;

        if(index > 0)
        {
            name += ""+s.charAt(index-1);         
        }

        if(index == s.length())
        {
            return;
        }

        if(c.child[s.charAt(index)-'a'] ==null)
        {
            c.child[s.charAt(index)-'a'] = new WordList();
        }
        add(s,c.child[s.charAt(index)-'a'],index+1);
    }

    public static WordList findChar(char c)
    {
        return child[(int)(c-'a')];
    }
}
Akshay Anand
  • 434
  • 5
  • 12
  • what are you trying to achieve here? – Amit May 10 '18 at 05:10
  • Hi @Amit, I am trying to suggest a better way to save the words. The user wants to create a dictionary where as if he/she decides to use tries it will be much easier and efficient to store all words in a smaller data structure. It will be faster to search and find how many times the word was used can be stored in sameWord when it repeats in the file. Also, sort is automatically taken care as trie of character a-z are used. For eg. to put 2 words in trie { START, STAR } the space take in only 5 characters to save two words. Please let me know if you have any doubt. – Akshay Anand May 10 '18 at 14:39
  • in add method, you are doing `c.child` , but child is `static` so you cannot access it with object. you should be doing `WordList.child` and after doing this, in `Add` method, you will not required to pass `WordList` object as parameter. And also, in your code, you have Declare class `WordList` but everywhere used `wordList` (have you tried it to run in your computer). One thing I m having difficulty here, what should I pass as `index` in `Add` method ? – Amit May 11 '18 at 02:31
  • @Amit thank you for pointing out. I ran this code 2 years back and this was from archive. Index will start from zero index and if first character is found it will be incremented and then keep on going on the path till it exist. If a character doesn't exist the word doesn't exist and then it adds that word. – Akshay Anand May 14 '18 at 05:01
-2

You can try below solution:

private static string[] NoDuplicate(string[] inputArray)
{
     List<string> stringList = new List<string>();
     foreach (string s in inputArray)
     {
           if (!stringList.Contains(s))
           {
                stringList.Add(s);
           }
     }
     return stringList.ToArray();
}
Ravi Solanki
  • 17
  • 1
  • 4
  • `List.Contains` is `O(n)`, while `HashSet.Contains` is basically `O(1)`, so while this solution will work, it is probably much slower than the `HashSet` and `Distinct` (which uses a kind of `HashSet` behind the scenes) solutions. – Corak May 10 '18 at 05:05
  • @Corak I think `List.Contains` should be O(n) in worst case, but what I don't understand how `HashSet` achieves this in O(1) ? – Amit May 10 '18 at 05:09
  • @Amit - To be honest, I don't fully understand the exact math behind it, but basically imagine the hashset to be an array, and when you put an item in, you first get the hashcode of that item (an `int` value) and use that basically as the index into that array. Array access by index is `O(1)` (`positionInMemory = startOfArray + (index * sizeOf(itemType))`). It's a bit more involved with bucket logic and if by chance several items have the same hashcode, you still need to check equality for each of them separately. So it's not *exactly* `O(1)`, but very *very* close. – Corak May 10 '18 at 05:21
  • You can check out the exact implementation here: [Reference Source HashSet](https://referencesource.microsoft.com/System.Core/System/Collections/Generic/HashSet.cs.html) – Corak May 10 '18 at 05:21
  • @Corak very imformative article and comment. thank you. in fact, after reading this, I even tried comparing them. `Hashset` is incredibly faster than `List`. see the [Project](https://github.com/anutural/ListVsHashset) – Amit May 10 '18 at 06:29