3

I'm curious how to do normalizing of numbers for a ranking algorithm

let's say I want to rank a link based on importance and I have two columns to work with

so a table would look like

url | comments | views

now I want to rank comments higher than views so I would first think to do comments*3 or something to weight it, however if there is a large view number like 40,000 and only 4 comments then the comments weight gets dropped out.

So I'm thinking I have to normalize those scores down to a more equal playing field before I can weight them. Any ideas or pointers to how that's usually done?

thanks

James
  • 15,085
  • 25
  • 83
  • 120

3 Answers3

5

For each url, you could first normalize the comments and views to a percentile. For example,

 comment_percentile = (comments - min(comments)) / (max(comments) - min(comments))
 views_percentile = (views - min(views)) / (max(views) - min(views))

Then you could assign weights to each of the percentile values to compute the overall score.

 url_score = (comment_percentile_weight * comment_percentile) + (views_percentile_weight * views_percentile)

Additional strategies may involve eliminating outliers if the values cluster toward one end of the range.

btreat
  • 1,554
  • 8
  • 10
  • 2
    I don't think that's how percentile works but I could be wrong – Joe Phillips Jun 17 '10 at 04:42
  • You are correct d03boy! Thanks for the catch. Hopefully the updated post works better. – btreat Jun 17 '10 at 04:47
  • Along the same lines, you could normalize each column to be equal to the % of the maximum, or even normalize them so that all items in a column sum to 1 (that is, make each one the % of total sum). – Justin L. Jun 17 '10 at 05:08
1

Importance is really a way of notifying the user about how interested he could be in the forum topic or a blog spot. In this case, you can't just multiply two numbers by different factors and add :)

What can you say about a blogpost with 2000 views and only one comment. Well, perhaps it's a spam post, or it was viewed by web-crawlers, or it's so boring that no one decided to comment on it.

In this case, we might want to look at a ratio of comments versus views. My original post would have an "interest ratio" of 1/2000 while this post, which got 28 views and 1 comment right now, it would get a score of 1/28.

The biggest ratio wins. By the way, if you are having ratios over one... well, start looking for bugs :)

0

A similar problem was discussed a few weeks ago in this SO topic: "Algorithm to calculate a page importance based on its views / comments".

I'll give the same advice I offered there: use linear regression on a representative distribution of comment/view counts for web pages to work out a weighting function.

Community
  • 1
  • 1
Joel Hoff
  • 1,993
  • 20
  • 22