Find out top 10 companies with highest volume trades

Question

I have been working on a problem from glass door that was being asked in one of the firm interviews by the firm that I ought to go to. The problem goes as :

If you have all the companies that are traded, and live inputs are coming of which company is being traded and what is the volume, how do you maintain the data, so that you can carry out operation of giving the top 10 most traded companies by volume of shares most efficiently

I thought of following solution for the same. Though I am not sure whether it is the efficient one or not: what about you maintain a binary search tree. With every insert you insert the company name and the volume of shares traded for it.

My basic node for the tree would then be:

class Node
{
String key; // company name
int volume; // volume
Node leftNode;
Node rightNode;
}

So at every new insert I will keep on inserting in the tree. And at the time of final retrieval , I can run the following code until the count of global count reaches 10.

traversal(Node a)
{
 if(a!=null)
  {
   traverse(a.getRightNode());
   System.out.println(a.getKey()+a.getValue());
   traverse(a.getLeftNode());
  }
}

What are your views on this solution?

@Luchian Grigore : Why did you negated the question ? I suppose it is a valid question. I had given my approach for the same as well. — Sameer Sawla, Nov 26 '13 at 09:56
I removed the C++ tag because it is in no way related to C++. — Luchian Grigore, Nov 26 '13 at 10:06
@LuchianGrigore : I could had been satisfied by the solution in C++ as well. Though the sample code that I had put was in Java. But I am working on C++ and Java both simultaneously. I thought it would be bad if I post two similar questions with different language requirements. — Sameer Sawla, Nov 26 '13 at 10:10
If there's no language requirement, don't add any language. You can put "language-agnostic" as the tag. Tags aren't a means to get more people to look at your question, and is in fact annoying. — Luchian Grigore, Nov 26 '13 at 10:12
@LuchianGrigore : Thanks for the "language-agnostic" tag info. Yes I can understand that. — Sameer Sawla, Nov 26 '13 at 10:15

score 0 · Answer 1 · answered Nov 26 '13 at 10:44

0

You can do it using min binary heap data structure where you maintain a heap of size 10 and delete the top element every time you have a new company which has greater volume than top and insert new company into heap. All the element currently in the heap are current top 10 companies.

Note: Add all the first 10 companies at the start.

answered Nov 26 '13 at 10:44

Vikram Bhat

6,106
3
20
19

I think the nature of the question implies a stream processing where each company can repeat more than once in the input, which will require an additional DS to achieve efficiency (for looking up by string). (which is basically what he has atm) – amit Nov 26 '13 at 11:21

score 0 · Answer 2 · edited May 23 '17 at 12:19

Well, there are trade-offs here. You are going to need to choose what you prefer - an efficient look-up (get top K) or an efficient insertion. As it seems, you cannot get both.

You can get O(logN) insertion and lookup by using two Data-structures:

Map<String,Node> - that maps from the company name a node in the second data structure. This will be a trie or a self balancing tree.
Map<Integer,String> - that maps from volumes to the company's name. This can be a map (hash/tree based) or it can also be a heap, since we have a link to the direct node, we can actually delete a node efficiently when needed.

Getting the top 10 can be done on the 2nd data structure in O(logN), and inserting each element requires looking by string - O(|S| * logN) (you can use a trie to get O(|S|) here) - and than modifying the second tree - which is O(logN)

Using a trie totals in O(|S|+logN) complexity for both get top K and insertions.

If number of data inserted is exponential in the number of getTopK() ops - it will be better to just keep a HashMap<String,Integer> and modify it as new data arrives, and when you get a findTopK() - do it in O(N) as described in this thread - using selection algorithm or a heap.

This results in O(|S|) insertion (on average) and O(N + |S|) get top K.

|S| is the length of the input/result string where it appears.
This answer assumes each company can appear more than once in the input stream.

You can have `O(|S|+|N|)` using heap and hash/trie/Judy without any trade of. — Hynek -Pichi- Vychodil, Nov 26 '13 at 19:07

score 0 · Accepted Answer · edited May 23 '17 at 11:56

This question is very similar to another question but with little twist. First of all, if somebody ask me this question I would ask a lot of questions. Do I know name of companies in advance? What is number of companies? Is there upper bound of their number? Do you mean time efficiency or memory consumption efficiency or mix of both? What is ratio of trades and top companies requests? It is not specified but I will assume high amount of trades and displaying Top 10 on demand or in some time interval. In case of requesting Top 10 after every trade arrival heap will be useless even for bigger N than 10 and whole algorithm would can be simpler. I also assume time efficiency. Memory efficiency is then constrained by CPU cache behaviour so we should not waste it anyway.

So we will store top N in some structure which will give me least member fast. This is for big N obviously heap. I can use any heap implementation even those which does have bad IncKey and Merge operations or doesn't have them at all. I will need only Insert, Peek and Remove operations. Number 10 is pretty small one and I would not even need heap for this especially in compiled languages with good compiler. I can use ordered array or list or even unordered one. So in every place where I will mention heap bellow, you can use ordered or unordered array or list. Heap is necessary only for bigger N in Top N.

So this is it, we will store Top N companies name and it's volume when inserted in heap.

Then we need track company trade volume in some K/V storage. Key is name. K/V storage for this purpose can be hashmap, trie or Judy. It will be good if we know company names in advance. It will allow us to compute perfect hash for hashmap or construct optimal trie. Otherwise it will be nice if we know upper bound company number otherwise to choose good hash length and number of buckets. Otherwise we will have to make resizable hash or use Judy. There is not know trie implementation for dynamic K/V better than hashmap or Judy. All of this K/V storages has O(k) access complexity, where k is length of Key which is name in this case. In every place, where I will mention hashmap bellow, you can use Judy or trie. You can use trie only when all of company names are known in advance and you can tailor super fast optimized code.

So we sill store company name as Key and trade volume so far and flag indicating storing in heap in hashmap.

So there is algorithm here. We will have state which contain heap, number of companies in heap and hashmap. For each arrived company mane and volume we will increase volume in hashmap. Then if companies in heap is less than N (10) we will add this company name and volume from hashmap to the heap if is not there yet (according to flag and set this flag in hashmap). Otherwise if heap is full and current company is not in heap, we will peek into heap and if current company has less volume traded so far (in hashmap) than company in heap we can finish this trade and go for next. Otherwise we have to update companies in heap first. While company in top of heap (it means with least volume) has volume in heap less than in current one and also different than in hashmap, we will update this volume. It can be done by removing from heap, and insert right value. Then check again top of heap and so on. Note, that we don't need update all companies in heap and even not all top heap companies which are not up to date. It's pretty lazy. If current company has still bigger volume than in top of heap, we will remove company from heap and insert current one and update flags in hashmap. That`s all.

Brief recapitulation:

min-heap storing top N companies ordered by volume and containing company name or direct index into hashmap
volume in heap can be out of date
hashmap with company name as key and up-to-date volume and flag indicating heap member as value
first update current company volume in hashmap and remember
repeatedly update heap top if less than current traded company
remove heap top if still less than current company and add current one in heap

This algorithm gain advantage that trade volume can be only positive number so volume in heap can be only less than right value and if top of heap has least value from all of heap and still less than right value and still bigger than any company in hasmap everything is perfect. Otherwise we would have to store all companies in heap, use max heap instead min heap, implement IncKey and perform this operation for all trades and keep back-references to heap in hashmap and everything is far more complicated.

Processing of new trade time complexity is nice O(1). O(1) is hashmap lookup, O(1) is Peek in heap. Insert and Delete in heap are amortized O(1) or O(logN) where N is constant so still O(1). Number of updates in heap is O(N) so O(1). You can also compute upper bound of processing time when there is upper bound of companies number (hashmap size problem mentioned at the beginning) so with good implementation you can consider it real time. Keep in mind that simpler solution (as ordered or unordered list, updating all Top members and so) can bring better performance in compiled code for small N as 10 especially on modern HW.

This algorithm can be nicely implemented even in functional language except there is not pure functional hash table but trie should have O(1) behavior or there will be some impure module for this. For example Erlang implementation using ordered list as heap and dictionary for hashmap. (Mine favorite functional heap is pairing heap but for 10 it is overkill.)

-module(top10trade).

-record(top10, {
    n = 0,
    heap = [],
    map = dict:new()
    }).

-define(N, 10).

-export([new/0, trade/2, top/1, apply_list/2]).

new() ->
  #top10{}.

trade({Name, Volume}, #top10{n = N, map = Map} = State)
    % heap is not full
    when N < ?N ->
  case dict:find(Name, Map) of
    % it's already in heap so update hashmap only
    {ok, {V, true}} ->
      State#top10{map = dict:store(Name, {V+Volume, true}, Map)};
    % otherwise insert to heap
    error ->
      State#top10{
        n = N+1,
        heap = insert({Volume, Name}, State#top10.heap),
        map = dict:store(Name, {Volume, true}, Map)
        }
  end;

% heap is full
trade({Name, Volume}, #top10{n = ?N, map = Map} = State) ->
  % look-up in hashmap
  {NewVolume, InHeap} = NewVal = case dict:find(Name, Map) of
    {ok, {V, In}} -> {V+Volume, In};
    error -> {Volume, false}
  end,
  if InHeap ->
      State#top10{map = dict:store(Name, NewVal, Map)};
    true ->  % current company is not in heap so peek in heap and try update
      update(NewVolume, Name, peek(State#top10.heap), State)
  end.

update(Volume, Name, {TopVal, _}, #top10{map = Map} = State)
    % Current Volume is smaller than heap Top so store only in hashmap
    when Volume < TopVal ->
  State#top10{map = dict:store(Name, {Volume, flase}, Map)};
update(Volume, Name, {TopVal, TopName}, #top10{heap = Heap, map = Map} = State) ->
  case dict:fetch(TopName, Map) of
    % heap top is up-to-date and still less than current
    {TopVal, true} ->
      State#top10{
        % store current to heap
        heap = insert({Volume, Name}, delete(Heap)),
        map = dict:store( % update current and former heap top records in hashmap
          Name, {Volume, true},
          dict:store(TopName, {TopVal, false}, Map)
          )
        };
    % heap needs update
    {NewVal, true} ->
      NewHeap = insert({NewVal, TopName}, delete(Heap)),
      update(Volume, Name, peek(NewHeap), State#top10{heap = NewHeap})
  end.

top(#top10{heap = Heap, map = Map}) ->
  % fetch up-to-date volumes from hashmap
  % (in impure language updating heap would be nice)
  [ {Name, element(1, dict:fetch(Name, Map))}
   || {_, Name} <- lists:reverse(to_list(Heap)) ].

apply_list(L, State) ->
  lists:foldl(fun apply/2, State, L).

apply(top, State) ->
  io:format("Top 10: ~p~n", [top(State)]),
  State;
apply({_, _} = T, State) ->
  trade(T, State).

%%%% Heap as ordered list

insert(X, []) -> [X];
insert(X, [H|_] = L) when X < H -> [X|L];
insert(X, [H|T]) -> [H|insert(X, T)].

-compile({inline, [delete/1, peek/1, to_list/1]}).

delete(L) -> tl(L).

peek(L) -> hd(L).

to_list(L) -> L.

It performs nice 600k trades per second. I would expect few millions per second in C implementation depending of number of companies. More companies means slower K/V look-up and update.

Find out top 10 companies with highest volume trades

3 Answers3