1

I have an array of strings from which I want to find the top 10 most frequently occurring strings.

One primitive way of doing this is to of course loop through the array once, get a stack/queue of all the distinct strings, store these distinct strings in an array, then check the number of times each string in this new array occurs in the original array, and finally store the values in 'n' distinct integers, where n is the number of distinct strings.

Obviously this is a horrible method when it comes to time efficiency, so I was wondering if there is a better way of doing this.

alestanis
  • 21,519
  • 4
  • 48
  • 67

3 Answers3

4

If you don't care about memory, you can build a hash map holding the count of each string: you loop through all your strings and for each one you do

myhash[mystring] += 1

if the string is already present in the hash, or

myhash[mystring] = 1

otherwise.

If you consider that looking up a value in a hash map is made in constant time (which could not be true) then this algorithm is "only" O(n) (but it takes up a lot of memory).

alestanis
  • 21,519
  • 4
  • 48
  • 67
  • 1
    This is quite definitely the best O(N) solution to this problem. And I'm not sure you *can* get better than O(N) for this problem. – Colleen Mar 07 '13 at 22:39
  • @Colleen If your initial array is sorted maybe you can do better than `O(n)` but with a random array definitely you can't, you have to go through every value anyway – alestanis Mar 07 '13 at 22:40
  • 1
    You can't, indeed, as you must go through all strings at least 1 time. – Ingo Mar 07 '13 at 22:40
  • Was assuming unsorted. But do tell, how could you make it faster for sorted? – Colleen Mar 07 '13 at 22:41
  • @alestanis - Even if it is sorted, you still need to go through it, – Ingo Mar 07 '13 at 22:41
  • 2
    It's not really O(n), because the entries in the map will have to be sorted to find the 10 largest ones. – JB Nizet Mar 07 '13 at 22:42
  • 1
    @Colleen When you get to the first element of a kind, you do binary search to look for the index of the last element. The average is not faster than O(n) but the best case is :) – alestanis Mar 07 '13 at 22:43
  • @Ingo Not necessarily (the binary search thing) – alestanis Mar 07 '13 at 22:43
  • There is no guarantee you find the last entry with binary search in a set of non-unique keys in fewer steps. The worst case will still be O(n) or more. Consider the case that many strings have just 1 occurence or 2. The binary search will then be more expensive. – Ingo Mar 07 '13 at 22:49
  • @Ingo That's what I said, the average will be more expensive than `O(n)` – alestanis Mar 07 '13 at 22:50
  • Actually thinking about it... You can do something like a reverse binary search: starting from i, check i+1, i+2, i+4, i+8, i+16 and if you go past the last you go back: i+12, i+10, i+9. Might be a pretty good algorithm in average *and* best case. Worst case would be having a lot of letters repeated 2^n+1 times (because you would have the max going back). – alestanis Mar 07 '13 at 22:53
  • @JBNizet you're right, it's n log(n). Though I feel like there is a way to make it O(N). Some kind of heap like structure (splay tree?) – Colleen Mar 07 '13 at 23:28
2

If you care about memory, you can sort the array and then count how many times each string appears easily (each string will appear firstly at position i, i+1, i+2, ..., i+k and nowhere else).

Sorting will take O(n log n), than O(n) for counting occurences of strings.

kyticka
  • 604
  • 8
  • 19
1

You could use a Guava Multiset adding all the strings then call Multisets.copyHighestCountFirst() only looking at the first 10

See this question for an example

Community
  • 1
  • 1
GuessBurger
  • 478
  • 4
  • 9