4

I can't imagine this hasn't been asked before, but I have searched everywhere and could not find the answer.

I have an iterable, which contains duplicate elements. I want to count number of times each element occurs in this iterable and return n-th most frequent one.

I have a working code which does exactly that, but I really doubt its the most optimal way to achieve this.

use std::collections::{BinaryHeap, HashMap};

// returns n-th most frequent element in collection
pub fn most_frequent<T: std::hash::Hash + std::cmp::Eq + std::cmp::Ord>(array: &[T], n: u32) -> &T {
    // intialize empty hashmap
    let mut map = HashMap::new();

    // count occurence of each element in iterable and save as (value,count) in hashmap
    for value in array {
        // taken from https://doc.rust-lang.org/std/collections/struct.HashMap.html#method.entry
        // not exactly sure how this works
        let counter = map.entry(value).or_insert(0);
        *counter += 1;
    }

    // determine highest frequency of some element in the collection
    let mut heap: BinaryHeap<_> = map.values().collect();
    let mut max = heap.pop().unwrap();
    // get n-th largest value
    for _i in 1..n {
        max = heap.pop().unwrap();
    }

    // find that element (get key from value in hashmap)
    // taken from https://stackoverflow.com/questions/59401720/how-do-i-find-the-key-for-a-value-in-a-hashmap
    map.iter()
        .find_map(|(key, &val)| if val == *max { Some(key) } else { None })
        .unwrap()
}

Are there any better ways or more optimal std methods to achieve what I want? Or maybe there are some community made crates that I could use.

Ach113
  • 1,775
  • 3
  • 18
  • 40
  • I think you are pretty close to an optimal solution here. The first step is to count the occurrences of all the elements like you did. Secondly, you want to find the kth smallest / largest element in the unsorted list. For that, I suggest you look at [this thread](https://stackoverflow.com/questions/251781/how-to-find-the-kth-largest-element-in-an-unsorted-array-of-length-n-in-on) and its references. – Niklas Mohrin Oct 08 '20 at 12:27
  • I don't know Rust at all. I was just wondering whether there is a way to perform a GroupBy and a Count on your collection? Ordering by the Count would then allow you to reference the item by index. – Captain Kenpachi Oct 08 '20 at 13:00
  • 1
    @Ach113 don't know if they're better but `BinaryHeap::into_sorted_vec()` would avoid repeatedly pop-ing the heap (which might not be great), alternatively you could collect() the hashmap into a vec of `(count, item)` then sort that by count, and get the nth element directly. – Masklinn Oct 08 '20 at 13:56
  • use a binaryheap to find the most frequent is useless find the max take only O(n), here you do O(n * log(n)) if not more cause you do a lot for not much. Preallocate the hashmap. – Stargateur Oct 08 '20 at 13:56
  • @Stargateur, if I wanted just max element I would do just that, is there way to get n-th largest in O(n)? (without keeping track of n variables) – Ach113 Oct 08 '20 at 14:01
  • @Ach113 you return only one value so I don't follow you, n-th is not clear for me and what you said doesn't match your code – Stargateur Oct 08 '20 at 14:04
  • @Stargateur, I mean something like finding 3rd largest element in array (if n=3) and returning it. I'm not sure if that can be done in O(n) (without keeping track of 3 variables all times) – Ach113 Oct 08 '20 at 14:09
  • ok my bad I didn't fully understand your code – Stargateur Oct 08 '20 at 14:11
  • I would use a binary heap min, only keep n bigger value in it, and keep the reference of the item in the hashmap, not only the count – Stargateur Oct 08 '20 at 14:19

1 Answers1

2

Your implementation has a time complexity of Ω(n log n), where n is the length of the array. The optimal solution to this problem has a complexity of Ω(n log k) for retrieving the k-th most frequent element. The usual implementation of this optimal solution indeed involves a binary heap, but not in the way you used it.

Here's a suggested implementation of the common algorithm:

use std::cmp::{Eq, Ord, Reverse};
use std::collections::{BinaryHeap, HashMap};
use std::hash::Hash;

pub fn most_frequent<T>(array: &[T], k: usize) -> Vec<(usize, &T)>
where
    T: Hash + Eq + Ord,
{
    let mut map = HashMap::new();
    for x in array {
        *map.entry(x).or_default() += 1;
    }

    let mut heap = BinaryHeap::with_capacity(k + 1);
    for (x, count) in map.into_iter() {
        heap.push(Reverse((count, x)));
        if heap.len() > k {
            heap.pop();
        }
    }
    heap.into_sorted_vec().into_iter().map(|r| r.0).collect()
}

(Playground)

I changed the prototype of the function to return a vector of the k most frequent elements together with their counts, since this is what you need to keep track of anyway. If you only want the k-th most frequent element, you can index the result with [k - 1][1].

The algorithm itself first builds a map of element counts the same way your code does – I just wrote it in a more concise form.

Next, we buid a BinaryHeap for the most frequent elements. After each iteration, this heap contains at most k elements, which are the most frequent ones seen so far. If there are more than k elements in the heap, we drop the least frequent element. Since we always drop the least frequent element seen so far, the heap always retains the k most frequent elements seen so far. We need to use the Reverse wrapper to get a min heap, as documented in the documentation of BinaryHeap.

Finally, we collect the results into a vector. The into_sorted_vec() function basically does this job for us, but we still want to unwrap the items from its Reverse wrapper – that wrapper is an implemenetation detail of our function and should not be returned to the caller.

(In Rust Nightly, we could also use the into_iter_sorted() method, saving one vector allocation.)

The code in this answer makes sure the heap is essentially limited to k elements, so an insertion to the heap has a complexity of Ω(log k). In your code, you push all elements from the array to the heap at once, without limiting the size of the heap, so you end up with a complexity of Ω(log n) for insertions. You essentially use the binary heap to sort a list of counts. Which works, but it's certainly neither the easiest nor the fastest way to achieve that, so there is little justification for going that route.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • if n is big this is better https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=938321fd198dfe64bdf0e1b491ecbe7b (in fact peek mut should be better) – Stargateur Oct 08 '20 at 16:27
  • here we go https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=ae2df47e6d345cd10dcd963ae68bbca4 – Stargateur Oct 08 '20 at 16:33
  • also wait a second... why `(usize, &T)` is ordored ? we only want compare usize... – Stargateur Oct 08 '20 at 16:37
  • 1
    [Here's another variation](https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=d02e88bf4551584fa85a7b560d0292c3) that orders by frequency and position in the original array, so you don't need `T: Ord`. (although if `T` had to be `Ord` anyway, you could use `BTreeMap` instead and eliminate `T: Hash`...) – trent Oct 08 '20 at 17:08
  • 1
    @Stargateur I didn't try to include micro-optimisations, but rather to keep the code as simple and readable as possible. The code compares the `(usize, &T)` pair because it doesn't really matter. I first introduced a custom `Item` struct with the count and a reference to the item and added a custom `PartialEq` implementation that only compares the count, but it was a lot more code to explain, and the only benefit was that we could drop `T: Ord`, which the OP had included already in the question. – Sven Marnach Oct 08 '20 at 19:39
  • @trentcl Storing the first index is nice as well. – Sven Marnach Oct 08 '20 at 19:40
  • @SvenMarnach sorry but why the Ord of T would not be included ? I don't get what the binary heap is comparing right now (usize, &T) so there order the max of a str ? Also, I don't think my version is a micro optimisation. – Stargateur Oct 08 '20 at 19:52
  • @Stargateur Pairs are compared lexicographically, so the `&T`s are only compared when the counts are identical, in which case we don't have any specification what element to pick, so we can pick any, including the one with the greate `&T`. It just doesn't matter. And I called your optimization a micro-optimization because it's rather low-level and subtle, and it doesn't change the algorithmic complexity of the code. It's completely fair if "micro-optimization" means something else for you. – Sven Marnach Oct 09 '20 at 09:10