4

My goal here is to generate a system similar to that of the front page of reddit.

I have things and for the sake of simplicity these things have votes. The best system I've generated is using time decay. With a halflife of 7 days, if a vote is worth 20 points today, then in seven days, it it worth 10 points, and in 14 days it will only be worth 5 points.

The problem is, that while this produces results I am very happy with, it doesn't scale. Every vote requires me to effectively recompute the value of every other vote.

So, I thought I might be able to reverse the idea. A vote today is worth 1 point. A vote seven days from now is worth 2 points, and 14 days from now is worth 4 points and so on. This works well because for each vote, I only have to update one row. The problem is that by the end of the year, I need a datatype that can hold fantastically huge numbers.

So, I tried using a linear growth which produced terrible rankings. I tried polynomial growth (squaring and cubing the number of days since site launch and submission) and it produced slightly better results. However, as I get slightly better results, I'm quickly re-approaching unmaintainable numbers.

So, I come to you stackoverflow. Who's got a genius idea or link to an idea on how to model this system so it scales well for a web application.

Piper Merriam
  • 2,774
  • 2
  • 24
  • 30
  • http://www.seomoz.org/blog/reddit-stumbleupon-delicious-and-hacker-news-algorithms-exposed is slightly helpful, but at a glance, none of it appears to scale either. – Piper Merriam Jan 06 '12 at 06:57
  • 1
    I don't see any system allowing non-linear decay where you won't have to recompute the score at some moment. The question is, do you need to do that on every voting, or maybe a background cron job will do? – a sad dude Jan 06 '12 at 07:01
  • cron jobs suck. It would do, but I'm kind of determined to find a non persistent process style solution. – Piper Merriam Jan 06 '12 at 07:02
  • From a different angle, I don't usually like to see others suggest a change in the tech you're using, let alone myself, but I will here as the simplest answer would be to use a computed-column/functional-index, which isn't provided by MySQL. Just on the off chance it's _that_ important to you and you have the possibility of using other databases. – ian Jan 06 '12 at 17:51
  • I'm on postgres. Being lazy and not looking things up myself, but if you've got a postgres solution, that'd be awesome. – Piper Merriam Jan 06 '12 at 17:57
  • does time of voting matter anything i mean some one vote a post 10 days after the creation of post – Ravinder Payal Oct 16 '15 at 07:36

4 Answers4

2

I've been trying to do this as well. I found what looks like a solution, but unfortunately, I forgot how to do math, so I'm having trouble understanding it.

The idea is to store the log of your score and sort by that, so the numbers won't overflow.

This doc describes the math. https://docs.google.com/View?id=dg7jwgdn_8cd9bprdr

And the comment where I found it is here: http://blog.notdot.net/2009/12/Most-popular-metrics-in-App-Engine#comment-25910828

aburgel
  • 83
  • 7
0

Okay, thought of one solution to do that on every vote. The catch is that it requires a linked list with atomic pop/push on both sides to store votes (e.g. Redis list, but you probably don't want it in RAM).

It also requires that decay interval is constant (e.g. 1 hour)

It goes like this:

  1. On every vote, update the score push the next time of decay of this vote to the tail of the list
  2. Then pop the first vote from the head of the list
  3. If it's not old enough to decay, push it back to the head
  4. Otherwise, subtract the required amount from the total score and push the updated information to the tail
  5. Repeat from step 2 until you hit a fresh enough vote (step 3)

You'll still have to check the heads in background to clear the posts that no one votes on anymore, of course.

a sad dude
  • 2,775
  • 17
  • 20
  • This idea has some merit, but it still seems to suffer from the need for a cron job to do cleanup, an architectural decision I tend to try and only touch with ten foot poles. I swear the gears in my head start turning faster when I run into a fun problem like this. – Piper Merriam Jan 06 '12 at 16:02
0

It's late here so I'm hoping someone can check my math. I think this is equivalent to exponential decay.

MySQL has a BIGINT max of 2^64

For simplicity, lets use 1 day as our time interval. Let n be the number of days since the site launched.

  1. Create an integer variable. Lets call it X and start it at 0
  2. If an add operation would bring a score over 2^64, first, update every score by dividing it by 2^n, then set X equal to n.
  3. On every vote, add 2^(n-X) to the score.

So, mentally, this makes better sense to me using base 10. As we add things up, our number gets longer and longer. We stop caring about the numbers in the lower digit places because the values we're incrementing scores by have a lot of digits. Which means that the lower digits kind of stop counting for very much. So if they don't count, why not just slide the decimal place over to a point that we care about and truncate the digits below the decimal place at some point. To do this, we need to slide the decimal place over on the amount we're adding each time as well.

I can't help but feel like there's something wrong with this.

Piper Merriam
  • 2,774
  • 2
  • 24
  • 30
  • 2
    First of all, what do you do while you're updating every row, and index, and X? Disable the site? :) Because otherwise if anyone votes in this fraction of a second (or more?), that post will be really lucky. – a sad dude Jan 06 '12 at 07:47
  • In my case, this could be a workable solution. The update can be run as a transaction, and my 'votes' don't come in at such a rate for this to be an issue. It would be nice however to find a solution that involved updating a single row for a single vote and that being the end of it. – Piper Merriam Jan 06 '12 at 15:57
  • There's no penalty here for an old story receiving new votes. If you had an old story that kept receiving up votes every day it would stay on the front page. – andy boot Jul 05 '12 at 20:22
0

Here are two possible pseudo queries that you could use. I know that they don't really address scalability, but I think that they do provide methods so that you can

SELECT article.title AS title, SUM(vp.point) AS points
FROM article
LEFT JOIN (SELECT 1 / DATEDIFF(NOW(), vote.created_at) as point, article_id
  FROM vote GROUP BY article_id) AS vp  
ON vp.article_id = article.id

or (not in a join, which will be a bit faster I think, but harder to hydrate),

SELECT SUM(1 / DATEDIFF(NOW(), created_at)) AS points, article_id
FROM vote
WHERE article_id IN (...) GROUP BY article_id

The benefit of these queries is that they can be run at any time with the same data and they will always return the same answers. They don't destroy any data.

If you need to, you can also run the queries in a background job and they will still give the same result.

rockymeza
  • 419
  • 3
  • 12