92

Lets say I have a list of some column values coming from a table, how do I remove empty strings and duplicate values. Please see the following code:

List<string> dtList = dtReportsList.AsEnumerable().Select(dr => dr.Field<string>("column1")).ToList();

This is what I have coded just now but but Amiram's code is way more elegant, so I will choose that answer here is how I did it:

DataTable dtReportsList = someclass.GetReportsList();

        if (dtReportsList.Rows.Count > 0)
       { 
           List<string> dtList = dtReportsList.AsEnumerable().Select(dr => dr.Field<string>("column1")).ToList();
           dtList.RemoveAll(x=>x == "");
           dtList = dtList.Distinct().ToList();         

           rcboModule.DataSource = dtList;
           rcboModule.DataBind();               
           rcboModule.Items.Insert(0, new RadComboBoxItem("All", "All"));
       }
user4157124
  • 2,809
  • 13
  • 27
  • 42
Developer
  • 2,987
  • 10
  • 40
  • 51
  • Understand that RemoveAll() mutates dtList; each element that is removed forces the List to rearrange elements in higher indices in the underlying array it uses. It would be faster simply to skip them like Amiram does with his Where method. – KeithS Aug 08 '12 at 15:09

4 Answers4

241
dtList = dtList.Where(s => !string.IsNullOrWhiteSpace(s)).Distinct().ToList();

I assumed empty string and whitespace are like null. If not you can use IsNullOrEmpty (allow whitespace), or s != null

El0din
  • 3,208
  • 3
  • 20
  • 31
Amiram Korach
  • 13,056
  • 3
  • 28
  • 30
  • Just one thing; deduping with Distinct() is relatively inefficient because the method must assume the worst case. – KeithS Aug 08 '12 at 14:56
  • @KeithS What assertions do we know about this data that `Distinct` doesn't which allows it to be optimized? – Servy Aug 08 '12 at 14:59
  • We can sort the list and then assert that it's sorted, making the deduping algorithm linear; see my answer. – KeithS Aug 08 '12 at 15:01
12

To simplify Amiram Korach's solution:

dtList.RemoveAll(s => string.IsNullOrWhiteSpace(s))

No need to use Distinct() or ToList()

Bojan
  • 769
  • 9
  • 16
10

Amiram's answer is correct, but Distinct() as implemented is an N2 operation; for each item in the list, the algorithm compares it to all the already processed elements, and returns it if it's unique or ignores it if not. We can do better.

A sorted list can be deduped in linear time; if the current element equals the previous element, ignore it, otherwise return it. Sorting is NlogN, so even having to sort the collection, we get some benefit:

public static IEnumerable<T> SortAndDedupe<T>(this IEnumerable<T> input)
{
   var toDedupe = input.OrderBy(x=>x);

   T prev;
   foreach(var element in toDedupe)
   {
      if(element == prev) continue;

      yield return element;
      prev = element;      
   }
}

//Usage
dtList  = dtList.Where(s => !string.IsNullOrWhitespace(s)).SortAndDedupe().ToList();

This returns the same elements; they're just sorted.

KeithS
  • 70,210
  • 21
  • 112
  • 164
  • Great. If I'm not wrong, by iterating the elements you're actually performing the ordering. Can you think of a way to make your method "lazy"? – Amiram Korach Aug 08 '12 at 15:08
  • Unfortunately, most sorts require knowledge of the entire collection to be sorted; the very last element could be the first one that needs to be returned. So, all elements of the input must be evaluated to produce the first element of the output. The only sort I can think of that could be interrupted after finding the next element of its output is a SelectionSort variant, and in that case we're back where we started. – KeithS Aug 08 '12 at 15:14
  • Besides, in our case, the result of the entire operation is a list, requiring "eager" execution to begin with. If we wanted to work with it as an IEnumerable and defer execution of it, you could take the meat of the function and put it in a hidden Iterator class that implements IEnumerable. – KeithS Aug 08 '12 at 15:19
  • 3
    `Distinct` uses hashing and should be closer to O(N) than O(N^2). [source](http://stackoverflow.com/a/2799543/480799) – Risky Martin Aug 08 '12 at 15:46
  • ... Well I'll be darned, it does indeed; System.Linq.Set is an internal hashtable implementation used by Distinct, which would be near O(1) access time *assuming* the GetHashCode() implementation of your items is efficient and produces an evenly-distributed hash (the default implementation would do so). However a hashtable does have memory issues; .NET's basic implementation uses two arrays, one of ints and another of linked items, each one at best equal to the number of items in the set and at worst double that. – KeithS Aug 08 '12 at 16:28
3

Amiram Korach solution is indeed tidy. Here's an alternative for the sake of versatility.

var count = dtList.Count;
// Perform a reverse tracking.
for (var i = count - 1; i > -1; i--)
{
    if (dtList[i]==string.Empty) dtList.RemoveAt(i);
}
// Keep only the unique list items.
dtList = dtList.Distinct().ToList();
IneedHelp
  • 1,630
  • 1
  • 27
  • 58
  • 4
    While this would work, the Where clause is faster because it doesn't have to mutate the input collection. You're minimizing the number of "shifts" that must be performed when removing elements from the list, but Where doesn't remove anything from the input; it just skips over elements that don't match. – KeithS Aug 08 '12 at 15:05