1

I'm sure something like this exists, but I don't know what it would be called (or how to find more info on it). If I have an alphabetically sorted list of words, and I'm checking to see if and where the word "test" is in that list, it doesn't make sense to start at the beginning, but to start in the T's, right? And the same for numbers, of course. Is there a way to implement something like this and tailor the the start of the search? Or do hash sets and methods like Contain already do this by themselves?

EDIT:

For example, if I have a list of integers like {1,2,3,5,7,8,9,23..}, is there any automatic way to sort it so that when I check the list for the element "9", it doesn't begin from the beginning...?

Sorry, this is a simple example, but I do intend to search thousands of times through a list that potentially contains thousands of elements

EDIT 2:

From the replies, I learned about Binary search, but since that apparently starts in the middle of your list, is it possible to implement something manually, along the lines of, for example, splitting a list of words into 26 bins such that when you search for a particular word, it can immediately start searching in the best place (or maybe 52 bins if each bin starts to become overpopulated...)

thatandrey
  • 287
  • 5
  • 16
  • To edit 2: so what about http://www.wikiwand.com/en/Bucket_sort? –  Sep 13 '14 at 07:09
  • @pwas Thanks, but the page has this weird ad that I can't get through? – thatandrey Sep 13 '14 at 07:39
  • Sorry, I've forgotten that I use special plugin fgor wiki pages. Here is original link: http://en.wikipedia.org/wiki/Bucket_sort –  Sep 13 '14 at 07:40

2 Answers2

3

When you say you have a sorted list and you want to search it, the algorithm that immediately jumps to my mind is a binary search. Fortunately List<T> already has that implemented.

The example on that link actually looks to do exactly what you want (it's dealing with finding a word in a list of sorted words too).

In essence, you want something like this:

List<string> words = ...;

words.Sort(); // or not depending on the source

var index = words.BinarySearch("word");

if(index > -1)
{
    // word was found, and its index is stored in index
}
else // you may or may not want this part
{    // this will insert the word into the list, so that you don't have to re-sort it.
    words.Insert(~index, "word");
}

This, of course, also works with ints. Simply replace List<string> with List<int> and your BinarySearch argument with an int.

Most Contains-type functions simply loop through the collection until coming across the item you're looking for. That works great in that you don't have to sort the collection first, but it's not so nice when you start off with it sorted. So in most cases, if you're searching the same list a lot, sort it and BinarySearch it, but if you're modifying the list a lot and only searching once or twice, a regular IndexOf or Contains will likely be your best bet.


If you're looking to group words by their first letter, I would probably use a Dictionary<char, List<string>> to store them. I chose List over an array for the purposes of mutability, so make that call on your own--there's also Array.BinarySearch if you choose to use an array. You could get into a proprietary tree model, but that may or may not be overkill. To do the dictionary keyed by first character, you'll want something like this:

Dictionary<char, List<string>> GetDict(IEnumerable<string> args)
{
    return args.GroupBy(c => c[0]).ToDictionary(c => c.Key, c => c.OrderBy(x => x).ToList());
}

Then you can use it pretty simply, similarly to before. The only change would reside in the fetching of your list.

Dictionary<char, List<string>> wordsByKey = GetDict(words);
List<string> keyed;
string word = "word";

if (wordsByKey.TryGetValue(word[0], out keyed))
{
    // same as before
}
else
{
    wordsByKey.Add(word[0], new List<string>() { word }); // or not, again
                                                          // depending on whether you
                                                          // want the list to update.
}
Matthew Haugen
  • 12,916
  • 5
  • 38
  • 54
  • Oooh, that's interesting! I just looked into Binary search. Does it always start in middle of the list? Is it possible to enhance the performance? In the case of strings, could I manually split up the strings into a list of 26 "bins" according to their first letter, so I can hit the "jackpot" quicker? – thatandrey Sep 13 '14 at 06:53
  • @user3408097 interesting. You'd probably want something more custom for that. I'll edit my answer in a couple minutes and give my idea. – Matthew Haugen Sep 13 '14 at 07:04
  • 1
    +1 If not modifying much but searching a lot, you should use `SortedSet` or `SortedList` or one of those sorted stuff - I bet one of them implements `BinarySearch`. Also, splitting into lists by letters would mean you have bazillion lists. You could implement a sorted list for strings which also keep track of the start-indices of a-z in itself, to be able to instantly tell that `[x to `y] are words starting with `h`. – SimpleVar Sep 13 '14 at 07:09
  • @user3408097 I've updated my answer. I also included the line to insert words post-facto without re-sorting, as I saw you requested in another comment. – Matthew Haugen Sep 13 '14 at 07:13
  • @MatthewHaugen Thanks, that makes a lot of sense, that's definitely a great place to start studying. But I was curious about hashsets, do they have something similar I can take advantage of, and do I need to specify it explicitly like `list.BinarySearch`? Because as far as I know, sorting a list and then running `list.Contains` is useless, right? – thatandrey Sep 13 '14 at 07:24
  • @user3408097 Since `HashSet` is theoretically unordered, "sorting it" is meaningless. As Yorye said above, a `SortedList` might be a nice place to look, but I don't think that implements any sort of `BinarySearch` (unless you `ToArray` it and use the `Array` method). As for sorting then doing `Contains`, "useless" is kind of a harsh word, but yes, it's certainly probably not going to do what you want. It would make it faster for words early in the alphabet, and slower for words later in it. So if your words were evenly distributed (which I doubt they are, but is practically true), then don't. – Matthew Haugen Sep 13 '14 at 07:27
  • 1
    Implementing your own binary search is super easy, by the way. So using `SortedList` is totally an option. – SimpleVar Sep 13 '14 at 07:53
1

When list is sorted, then you are looking for BinarySearch: http://msdn.microsoft.com/pl-pl/library/3f90y839%28v=vs.110%29.aspx. The complexity is O(log n) against O(n) in simple Contains.

List<string> myList = GetList();
string elementToSearch = "test";

if (myList.Contains(elementToSearch)) 
{
    // found, O(n), works on unsorted list
}

if (myList.BinarySearch(elementToSearch)) >= 0)
{
    // found, O(log n), works only on sorted list
}

To claryify: What is the difference between Linear search and Binary search?

To your edit:

If your input collection is not sorted, you should use Contains or IndexOf due to mentioned O(n) time. It will loop your collection once. Sorting collection is less efficient - it takes O(n log n). S it's not efficient to sort it in order to search one element.

Some sample to realize the pefromance:

var r = new Random();
var list = new List<int>();

for (var i = 1; i < 10000000; i++)
{
    list.Add(r.Next());
}

// O (log n) - we assume that list is sorted, so sorting is pefromed outside watch
var sortedList = new List<int>(list);
sortedList.Sort();

var elementToSearch = sortedList.Last();

var watcher = new Stopwatch();
watcher.Start();
sortedList.BinarySearch(elementToSearch);
watcher.Stop();
Console.WriteLine("BinarySearch on already sorted: {0} ms",
                           watcher.Elapsed.TotalMilliseconds);

// O(n) - simple search
elementToSearch = list.Last();
watcher.Reset();
watcher.Start();
list.IndexOf(elementToSearch);
watcher.Stop();
Console.WriteLine("IndexOf on  unsorted: {0} ms",
                     watcher.Elapsed.TotalMilliseconds);

// O(n log n) + O (log n)
watcher.Reset();
watcher.Start();
list.Sort();
elementToSearch = list.Last();
list.BinarySearch(elementToSearch);
watcher.Stop();
Console.WriteLine("Sort + binary search on unsorted: {0} ms"
                    , watcher.Elapsed.TotalMilliseconds);

Console.ReadKey();

Result:

BinarySearch on already sorted: 0.0248 ms

IndexOf on unsorted: 6.144 ms

Sort + binary search on unsorted: 1157.3298 ms

Edit to edit 2: I think you are looking for BucketSort rather: You can implement it by own, but I think that solution with Dictionary of Matthew Haugen is simpler and faster to implement :)

Community
  • 1
  • 1
  • If I take a list and do `list.Sort()` and then add an element to it, does it become unsorted or will it append the element in the correct location to keep the sorting intact? – thatandrey Sep 13 '14 at 07:00
  • Ufortunately, element wil be added at the end of list, so list will be no more sorted. You can find index where you should insert your element in `O(log n)` time, but inserting element into middle of the list costs `O (n)` (via `List.Insert(index, element)`). –  Sep 13 '14 at 07:07
  • @user3408097 You would have to insert them in the proper index in order to maintain the order. See Matthew's answer. – SimpleVar Sep 13 '14 at 07:16