This question is very similar to another question but with little twist. First of all, if somebody ask me this question I would ask a lot of questions. Do I know name of companies in advance? What is number of companies? Is there upper bound of their number? Do you mean time efficiency or memory consumption efficiency or mix of both? What is ratio of trades and top companies requests? It is not specified but I will assume high amount of trades and displaying Top 10 on demand or in some time interval. In case of requesting Top 10 after every trade arrival heap will be useless even for bigger N
than 10 and whole algorithm would can be simpler. I also assume time efficiency. Memory efficiency is then constrained by CPU cache behaviour so we should not waste it anyway.
So we will store top N
in some structure which will give me least member fast. This is for big N
obviously heap. I can use any heap implementation even those which does have bad IncKey
and Merge
operations or doesn't have them at all. I will need only Insert
, Peek
and Remove
operations. Number 10 is pretty small one and I would not even need heap for this especially in compiled languages with good compiler. I can use ordered array or list or even unordered one. So in every place where I will mention heap bellow, you can use ordered or unordered array or list. Heap is necessary only for bigger N
in Top N
.
So this is it, we will store Top N
companies name
and it's volume
when inserted in heap.
Then we need track company trade volume
in some K/V storage. Key is name
. K/V storage for this purpose can be hashmap, trie or Judy. It will be good if we know company names in advance. It will allow us to compute perfect hash for hashmap or construct optimal trie. Otherwise it will be nice if we know upper bound company number otherwise to choose good hash length and number of buckets. Otherwise we will have to make resizable hash or use Judy. There is not know trie implementation for dynamic K/V better than hashmap or Judy. All of this K/V storages has O(k)
access complexity, where k
is length of Key which is name
in this case. In every place, where I will mention hashmap bellow, you can use Judy or trie. You can use trie only when all of company names are known in advance and you can tailor super fast optimized code.
So we sill store company name
as Key and trade volume
so far and flag
indicating storing in heap in hashmap.
So there is algorithm here. We will have state which contain heap, number of companies in heap and hashmap. For each arrived company mane
and volume
we will increase volume
in hashmap. Then if companies in heap is less than N
(10) we will add this company name
and volume
from hashmap to the heap if is not there yet (according to flag and set this flag in hashmap). Otherwise if heap is full and current company is not in heap, we will peek into heap and if current company has less volume
traded so far (in hashmap) than company in heap we can finish this trade and go for next. Otherwise we have to update companies in heap first. While company in top of heap (it means with least volume
) has volume
in heap less than in current one and also different than in hashmap, we will update this volume
. It can be done by removing from heap, and insert right value. Then check again top of heap and so on. Note, that we don't need update all companies in heap and even not all top heap companies which are not up to date. It's pretty lazy. If current company has still bigger volume
than in top of heap, we will remove company from heap and insert current one and update flags in hashmap. That`s all.
Brief recapitulation:
min-heap storing top N
companies ordered by volume
and containing company name
or direct index into hashmap
volume
in heap can be out of date
hashmap with company name
as key and up-to-date volume
and flag indicating heap member as value
first update current company volume
in hashmap and remember
repeatedly update heap top if less than current traded company
remove heap top if still less than current company and add current one in heap
This algorithm gain advantage that trade volume
can be only positive number so volume
in heap can be only less than right value and if top of heap has least value from all of heap and still less than right value and still bigger than any company in hasmap everything is perfect. Otherwise we would have to store all companies in heap, use max heap instead min heap, implement IncKey
and perform this operation for all trades and keep back-references to heap in hashmap and everything is far more complicated.
Processing of new trade time complexity is nice O(1)
. O(1)
is hashmap lookup, O(1)
is Peek
in heap. Insert
and Delete
in heap are amortized O(1) or O(logN)
where N
is constant so still O(1)
. Number of updates in heap is O(N)
so O(1)
. You can also compute upper bound of processing time when there is upper bound of companies number (hashmap size problem mentioned at the beginning) so with good implementation you can consider it real time. Keep in mind that simpler solution (as ordered or unordered list, updating all Top members and so) can bring better performance in compiled code for small N
as 10 especially on modern HW.
This algorithm can be nicely implemented even in functional language except there is not pure functional hash table but trie should have O(1) behavior or there will be some impure module for this. For example Erlang implementation using ordered list as heap and dictionary for hashmap. (Mine favorite functional heap is pairing heap but for 10 it is overkill.)
-module(top10trade).
-record(top10, {
n = 0,
heap = [],
map = dict:new()
}).
-define(N, 10).
-export([new/0, trade/2, top/1, apply_list/2]).
new() ->
#top10{}.
trade({Name, Volume}, #top10{n = N, map = Map} = State)
% heap is not full
when N < ?N ->
case dict:find(Name, Map) of
% it's already in heap so update hashmap only
{ok, {V, true}} ->
State#top10{map = dict:store(Name, {V+Volume, true}, Map)};
% otherwise insert to heap
error ->
State#top10{
n = N+1,
heap = insert({Volume, Name}, State#top10.heap),
map = dict:store(Name, {Volume, true}, Map)
}
end;
% heap is full
trade({Name, Volume}, #top10{n = ?N, map = Map} = State) ->
% look-up in hashmap
{NewVolume, InHeap} = NewVal = case dict:find(Name, Map) of
{ok, {V, In}} -> {V+Volume, In};
error -> {Volume, false}
end,
if InHeap ->
State#top10{map = dict:store(Name, NewVal, Map)};
true -> % current company is not in heap so peek in heap and try update
update(NewVolume, Name, peek(State#top10.heap), State)
end.
update(Volume, Name, {TopVal, _}, #top10{map = Map} = State)
% Current Volume is smaller than heap Top so store only in hashmap
when Volume < TopVal ->
State#top10{map = dict:store(Name, {Volume, flase}, Map)};
update(Volume, Name, {TopVal, TopName}, #top10{heap = Heap, map = Map} = State) ->
case dict:fetch(TopName, Map) of
% heap top is up-to-date and still less than current
{TopVal, true} ->
State#top10{
% store current to heap
heap = insert({Volume, Name}, delete(Heap)),
map = dict:store( % update current and former heap top records in hashmap
Name, {Volume, true},
dict:store(TopName, {TopVal, false}, Map)
)
};
% heap needs update
{NewVal, true} ->
NewHeap = insert({NewVal, TopName}, delete(Heap)),
update(Volume, Name, peek(NewHeap), State#top10{heap = NewHeap})
end.
top(#top10{heap = Heap, map = Map}) ->
% fetch up-to-date volumes from hashmap
% (in impure language updating heap would be nice)
[ {Name, element(1, dict:fetch(Name, Map))}
|| {_, Name} <- lists:reverse(to_list(Heap)) ].
apply_list(L, State) ->
lists:foldl(fun apply/2, State, L).
apply(top, State) ->
io:format("Top 10: ~p~n", [top(State)]),
State;
apply({_, _} = T, State) ->
trade(T, State).
%%%% Heap as ordered list
insert(X, []) -> [X];
insert(X, [H|_] = L) when X < H -> [X|L];
insert(X, [H|T]) -> [H|insert(X, T)].
-compile({inline, [delete/1, peek/1, to_list/1]}).
delete(L) -> tl(L).
peek(L) -> hd(L).
to_list(L) -> L.
It performs nice 600k trades per second. I would expect few millions per second in C implementation depending of number of companies. More companies means slower K/V look-up and update.