Let't suppose we build Analytics for Web Sites and would like to display N most popular pages for today. The algorithm should satisfy two requirements - constant memory and moving counters.
Constant memory
There could be billions of pages, we don't want to keep counts for all of them. The algorithm should use some sort of smart probabilistic counters that use constant memory. There's Count–min sketch but it seems like it tries to estimate counts for all elements, here we don't care about all elements, only about top N, so maybe there's some better and simpler estimator?
Moving counters
Top N pages are different every day, today top 2 pages could be /cats.html
and /dogs.html
but tomorrow it could be
something totally different like /pizza.html
and /donuts.html
. The simplest approach would be to re-start counters every day, and that's fine, but maybe there's some smarter approach, something like moving average?
Example of the stream of events:
[
{ page: '/cats.html', time: 'today, 12:00' },
{ page: '/cats.html', time: 'today, 11:00' },
{ page: '/dogs.html', time: 'today, 10:00' },
{ page: '/dogs.html', time: 'today, 09:00' },
{ page: '/donuts.html', time: 'today, 08:00' },
{ page: '/donuts.html', time: 'yesterday, 20:00' },
{ page: '/cats.html', time: 'yesterday, 19:00' },
...
]