0

What will be the best way(algorithm / datastructure to use) to get top-k items ordered in a shopping site, the relevant information is in the log of each of its n servers?
I was thinking of an approach that involves maintaining a doubly linked list of fixed size k each Node having a count variable(may be a range) a set of product Ids that share the same count. With arrival of each event(productId) the list is traversed and count updated and if possible elevated to the next higher count range.
Is the above approach correct ? What are some other better solutions ?

redzedi
  • 1,957
  • 21
  • 31
  • 1. Can you get all logs in the same place, or do you need to analyse the log independently on the individual servers? 2. Can one product be in the logs of multiple servers? – Tobber Jan 29 '15 at 16:44
  • @Tobber 1. the logs being very large need to be analysed separately and their results combined. 2. A product can be on multiple logs – redzedi Jan 29 '15 at 16:59

3 Answers3

1

Your approach is incorrect, you said the list is of fixed size, but that suggests you already know which are the top k elements - which is obviously not the case. Assume you have already a populated list of size k, and you traversed half of the items - now, the next item repeats for the entire collection (n/2 repeats) - it should obviously be in the top k, but you never put it in your list - so the result is wrong.

You can approach the problem in some ways, depending on what are the limitations (what is the size of your log file, mainly).

Approach 1: Build a histogram and find top k elements

First, iterate the list, and build a histogram (hash/tree based map map<item,int>) - then, after you found the number each element reoccur, it is simply finding top k elements, which is covered in this thread in details.
Finding top k is done by maintaining a min heap, iterate your collection, for each item - check if it's higher than the minimal item in your heap, and if it does, pop the element from the heap and insert this item instead.

Building the histogram is done by simply:

histogram = new map<item,int>
for each element x in the list:
  val = (x is a key in map? map.get(x) : 0) + 1
  map.put(x,val)

This approach's complexity is O(nlogn) if using tree based map, or O(nlogk) if using hash based map. This is pretty efficient, but if your log file contains trillions of entries, it might become impossible to finish in reasonable time on a single machine, and you'll need to distribute your work on several machines. This lead us to the next approach.

Approach 2: map-reduce

This approach is for very large log files, and is done by distributing the problem on a large cluster. This is a more sophisticated approach - but for very large files, it might be impossible to find top k elements using a single machine.

map(file):
  for each item in file:
      emit(item,1)
reduce(item,list)
  sum = 0
  for each x in list:
      sum = sum + x
  emit(item,sum)

At this stage you processed the list and built a histogram, now we need to find the top k, the idea will be to split the data so each machine will get a portion, and produce it's local top K elements, and then send all #machines*K elements to a single "master" machine that will chose the global top k

Community
  • 1
  • 1
amit
  • 175,853
  • 27
  • 231
  • 333
  • 1
    The approach 2 may be wrong. Suppose k=1, Machine #1 calculates the tuples {(A, 10), (C, 7)}. Machine #2 calculates {(B, 10), (C, 8)}. The overall top 1 is C, with 15 hits, but both machines will send the wrong tuple to the master node. – Juan Lopes Jan 29 '15 at 17:05
  • thanks amit !, can you plz elaborate on merging top-K items from several servers(let's say p servers) in the master, I was wondering what if a single server can't handle p*k items and this has to be done in stages ? – redzedi Jan 29 '15 at 17:05
  • @JuanLopes Approach 2 specifically says: first find the **global** (item,sum) using Map-Reduce, then find top K on each machine (in here, the local top K is calculated on the global sum), and then find global top K, are you familiar with map-reduce? – amit Jan 29 '15 at 17:07
  • @redzedi In here I assume you mean `p` is the number of servers - this number is typically small (few-dozens usually, only the largest companies has a "huge" clusters - of sizes ~50,000, and that's not hard to handle as well). I have never seen a case where `#machines * k` is too large to be processed on a single server. – amit Jan 29 '15 at 17:09
0

In addition to Amit's answer, there are also probabilistic data structures that handle this kind of query. They may sacrifice precision for less resource usage.

Here is a link to a paper about this:

Efficient Computation of Frequent and Top-k Elements in Data Streams

And a link to an implementation (in Java).

https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/StreamSummary.java

Juan Lopes
  • 10,143
  • 2
  • 25
  • 44
0

Based on your commments I want to add a suggestion. If P is the number of products, I assume that

  • n and k is a lot smaller than P and
  • you have a few popular products and a lot of low frequency products (heavy tailed) – this is usually the case for many natural datasets.
  • you do not have access to map-reduce (in which case Amit's solution 2 is easiest)

If that is the case then your solution could be

  1. Build a histogram and top-k list (as Amit explains) separately on each server.
  2. let x be the local count of the k-th most frequent product.
  3. Send all elements with a local count lc >= x/n from the histogram to a central server.
  4. Merge them and find the temporary global top k elements.
  5. let y by the count of the temporary k-th globally most frequent product
  6. For all products with a global frequency f >= y/n, request all servers to send their local frequencies to the central server.
  7. Merge all elements and find the top k.

The reason why you filter by y/n and x/n is any product must have at least one server where lc >= gc/n. x and y are lower bounds on the global count for the least frequent product we want to find.

This approach will have a much less network traffic than the map-reduce model - but it will also take a lot longer to program.

If you plan to do more log-analysis I will definitely recommend looking into Hadoop (and Hive/Spark/SparkSQL) or maybe Google BigQuery. It will take a little while to setup, but the investment will quickly pay it self back in saved programming hours.

Tobber
  • 7,211
  • 8
  • 33
  • 56