2

Let't suppose we build Analytics for Web Sites and would like to display N most popular pages for today. The algorithm should satisfy two requirements - constant memory and moving counters.

Constant memory

There could be billions of pages, we don't want to keep counts for all of them. The algorithm should use some sort of smart probabilistic counters that use constant memory. There's Count–min sketch but it seems like it tries to estimate counts for all elements, here we don't care about all elements, only about top N, so maybe there's some better and simpler estimator?

Moving counters

Top N pages are different every day, today top 2 pages could be /cats.html and /dogs.html but tomorrow it could be something totally different like /pizza.html and /donuts.html. The simplest approach would be to re-start counters every day, and that's fine, but maybe there's some smarter approach, something like moving average?

Example of the stream of events:

[
 { page: '/cats.html',   time: 'today, 12:00' }, 
 { page: '/cats.html',   time: 'today, 11:00' }, 
 { page: '/dogs.html',   time: 'today, 10:00' }, 
 { page: '/dogs.html',   time: 'today, 09:00' }, 
 { page: '/donuts.html', time: 'today, 08:00' }, 
 { page: '/donuts.html', time: 'yesterday, 20:00' }, 
 { page: '/cats.html',   time: 'yesterday, 19:00' }, 
 ...
]
Alex Craft
  • 13,598
  • 11
  • 69
  • 133

1 Answers1

1

If I remember correctly, you can get the most frequent value with constant memory, but I don't think it will work for multiple values.

If approximate answers are good enough, you might want to look at the HyperLogLog algorithm. It is not the exact same problem, as it counts the number of unique values, but the techniques that are used there could be useful to solve your problem.

This question is also related but it does not have the constant memory constraint.

Community
  • 1
  • 1
Philipp Claßen
  • 41,306
  • 31
  • 146
  • 239