1

What way I could use to avoid duplicates in a list?

One way is when I will add a new item, check first if the element exists, but this make me use more code and iterate all the list to check if it exists.

Another way I could use a hashset, that if I try to add a new item, itself check if the item exists, if not, it will add the new item, if exists, then do nothing.

But I know that the hashset is less efficient, need more resources than a list, so I don't know if using a hashset to avoid duplicates it is a good use of the hashset.

There are any other alternative?

Thanks.

Álvaro García
  • 18,114
  • 30
  • 102
  • 193
  • Use `HashSet` it doesn't degrade your performance.Check this http://stackoverflow.com/questions/4558754/define-what-is-a-hashset – Praburaj Jan 05 '17 at 10:05
  • 1
    There are other alternatives but depending on the size of your items they might not be suitable. A `HashSet` is better because it removes duplicates and guarantees `O(1)` `Add` and `Contains`. But you could also add elements to the list and then use `Distinct().ToList()` using LINQ. It depends on your use case. – jorgonor Jan 05 '17 at 10:05
  • 2
    _"But I know that the hashset is less efficient"_ that's wrong. It is just not a list, so it doesn't provide access via index. Apart from that it's very efficient – Tim Schmelter Jan 05 '17 at 10:07
  • `But I know that the hashset is less efficient, need more resources than a list`. Can you provide more details for that (example)? – Tân Jan 05 '17 at 10:07
  • Generally you use more memory to get more speed or less memory but process more. so yes its good to use hashset if you want to avoid duplicates. but you are going to use more memory. that's the trade-off you cant avoid that. – M.kazem Akhgary Jan 05 '17 at 10:07
  • How big is this collection going to be? Unless you expect it to be very large then it sounds like you might be worrying about something that is not really a problem. – Equalsk Jan 05 '17 at 10:08
  • 2
    If you want an implementation of `ICollection` that does not allow duplicates, whilst still retaining an ordering, consider using `SortedSet` rather than `List`. – Rahul Hendawe Jan 05 '17 at 10:10
  • Using a HashSet will certainly be more efficient that using a list and manually removing duplicates... This is the point in creating data structure that are suited for certain scenarios, so you should use them. – Milney Jan 05 '17 at 10:22

5 Answers5

4

You can achieve this in a single line of code :-

List<long> longs = new List<long> { 1, 2, 3, 4, 3, 2, 5 };

List<long> unique = longs.Distinct().ToList();

unique will contains only 1,2,3,4,5

Anand Systematix
  • 632
  • 5
  • 15
  • 3
    So you want to do this every time an item will be added? very inefficient – Tim Schmelter Jan 05 '17 at 10:10
  • 3
    @AnandSystematix: while this might be sufficient if you have all items beforehand, it's not viable if the list will be filled on demand and this is performance critical. OP wants to avoid duplicates during add, he doesn't want to remove duplicates. – Tim Schmelter Jan 05 '17 at 10:12
  • 1
    @M.kazemAkhgary: well, what period, every 5 minutes? However, you hava an incosistent state for some time. That's not good at all. If you have luck the customer doesn't notice the bug – Tim Schmelter Jan 05 '17 at 10:14
  • 2
    I don't think this is an answer to question "how to avoid duplicates in list". Avoiding duplicates in collection and selecting unique items from collection with duplicates - different things. – Sergey Berezovskiy Jan 05 '17 at 10:14
  • Yes thanks for correcting, it will be useful if we have all items. – Anand Systematix Jan 05 '17 at 10:14
  • 1
    The only reason the OP gave for not using `HashSet` was performance, and this is much less efficient than just using a `HashSet`. It also doesn't "avoid" duplicates, as others mentioned. The implication was to have a continually unique collection, like `HashSet`. – Dave Cousineau Jan 05 '17 at 10:16
2

You cannot avoid duplicates in List. No way - there is no verification of items.

If you don't bother with order of items - use HashSet.

If you want to preserve order of items (actually there is a little ambiguity - should item appear at index of first addition or at index of last addition). But you want to be sure that all items are unique, then you should write your own List class. I.e. something which implements IList<T> interface:

public class ListWithoutDuplicates<T> : IList<T>

And you have different options here. E.g. you should decide what is more important for you - fast addition or memory consumption. Because for fast addition and contains operation you should use some hash-based data structure. Which is unordered. Here is sample implementation with HashSet for storing hashes of all items stored in the internal list. You will need following fields:

private readonly HashSet<int> hashes = new HashSet<int>();
private readonly List<T> items = new List<T>();
private static readonly Comparer<T> comparer = Comparer<T>.Default;

Adding items is simple (warning: no null-checks here and further) - use item hash code to quickly O(1) check if it's already added. Use same approach for removing items:

public void Add(T item)
{
    var hash = item.GetHashCode();
    if (hashes.Contains(hash))
        return;

    hashes.Add(hash);
    items.Add(item);
}

public bool Remove(T item)
{
    var hash = item.GetHashCode();
    if (!hashes.Contains(hash))
        return false;

    hashes.Remove(item.GetHashCode());
    return items.Remove(item);
}

Some index-based operations:

public int IndexOf(T item)
{
    var hash = item.GetHashCode();
    if (!hashes.Contains(hash))
        return -1;

    return items.IndexOf(item);
}

public void Insert(int index, T item)
{
    var itemAtIndex = items[index];
    if (comparer.Compare(item, itemAtIndex) == 0)
        return;

    var hash = item.GetHashCode();

    if (!hashes.Contains(hash))
    {
        hashes.Remove(itemAtIndex.GetHashCode());
        items[index] = item;
        hashes.Add(hash);
        return;
    }

    throw new ArgumentException("Cannot add duplicate item");
}

public void RemoveAt(int index)
{
    var item = items[index];
    hashes.Remove(item.GetHashCode());
    items.RemoveAt(index);
}

And left-overs:

public T this[int index]
{
    get { return items[index]; }
    set { Insert(index, value); }
}

public int Count => items.Count;
public bool Contains(T item) => hashes.Contains(item.GetHashCode());
public IEnumerator<T> GetEnumerator() => items.GetEnumerator();
IEnumerator IEnumerable.GetEnumerator() => items.GetEnumerator();

That's it. Now you have list implementation which will add item only once (first time). E.g.

var list = new ListWithoutDuplicates<int> { 1, 2, 1, 3, 5, 2, 5, 3, 4 };

Will create list with items 1, 2, 3, 5, 4. Note: if memory consumption is more important than performance, then instead of using hashes use items.Contains operation which is O(n).

BTW What we just did is actually a IList Decorator

Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459
1

A List is a data-structure that may contain duplicates. Duplicate elements are disambiguated by their index.

One way is when I will add a new item, check first if the element exists, but this make me use more code and iterate all the list to check if it exists.

This is possible, but it is error-prone and slow. You will need to iterate through the entire list every time you want to add an element. It is also possible that you will forget to check somewhere in your code.

Another way I could use a hashset, that if I try to add a new item, itself check if the item exists, if not, it will add the new item, if exists, then do nothing.

This is the preferred way. It is best to use the standard library to enforce the contraints that you want.

But I know that the hashset is less efficient, need more resources than a list, so I don't know if using a hashset to avoid duplicates it is a good use of the hashset.

The efficiency depends on what you are trying to do; see https://stackoverflow.com/a/23949528/1256041.

There are any other alternative?

You could implement your own ISet using List. This would make insertion much slower (you would need to iterate the whole collection), but you would gain O(1) random-access.

Community
  • 1
  • 1
sdgfsdh
  • 33,689
  • 26
  • 132
  • 245
1

The hashset is the best way to check if the item exist because it's O(1).

So you can insert the items both in a list and in hashset and before inserting a new item you check if it's exist in the hashset.

Moshe D
  • 768
  • 1
  • 4
  • 13
-1

If you are having a datatable ldtbResult and it contains values like this:

EMPLOYER_PAYMENT_PLAN_HEADER_ID |EMPLOYER_ACCOUNT_ID|PLAN_TYPE_VALUE |  STATUS_VALUE
538                             |     360915        |     STND       |   PAID
538                             |     360915        |     STND       |   PAID
538                                   360915              STND           PEND

and in a collection, you want to have rows of this datatable such that no 2 two rows should have the same primary key value—i.e., employer_payment_plan_header_id(PK column) in this case—then here is the code:

//below line assigns value of the datatable ldtbResult to the collection //rpEmployerPaymentPlanList

lprmPaymentPlanUtility.rpEmployerPaymentPlanList = GetCollection<busEmployerPaymentPlanHeader>(ldtbResult, "icdoEmployerPaymentPlanHeader");


//Below line ensures that not 2 rows have the same employer_payment_plan_header_id (PK Field) value:

List<busEmployerPaymentPlanHeader> distinctPaymentPlans = lprmPaymentPlanUtility.rpEmployerPaymentPlanList
  .GroupBy(x => x.icdoEmployerPaymentPlanHeader.employer_payment_plan_header_id)
  .Select(group => group.First())
  .ToList();

//below line is the value of the list back to the original collection //rpEmployerPaymentPlanList

lprmPaymentPlanUtility.rpEmployerPaymentPlanList = new Collection<busEmployerPaymentPlanHeader>(distinctPaymentPlans);
Jeremy Caney
  • 7,102
  • 69
  • 48
  • 77