Pearson Algorithm from Programming Collective Intelligence still not working

Question

I ran the code to calculate the Pearson Correlation Coefficient and the function (pasted below) stubbornly returns a 0.

In line with earlier suggestions on this issue here on SO (see #1, #2 below), I did make sure that the function is able to perform floating point calculations but that hasn't helped. I would appreciate some guidance with this.

    from __future__ import division
    from math import sqrt

    def sim_pearson(prefs,p1,p2):
    # Get the list of mutually rated items
       si={}
       for item in prefs[p1]:
          if item in prefs[p2]: si[item]=1


          # Find the number of elements
          n=float(len(si))


          # if they are no ratings in common, return 0
          if n==0: return 0


          # Add up all the preferences
          sum1=float(sum([prefs[p1][it] for it in si]))
          sum2=float(sum([prefs[p2][it] for it in si]))

          # Sum up the squares
          sum1Sq=sum([pow(prefs[p1][it],2) for it in si])
          sum2Sq=sum([pow(prefs[p2][it],2) for it in si])

          # Sum up the products
          pSum=sum([prefs[p1][it]*prefs[p2][it] for it in si])


          # Calculate Pearson score
          num=pSum-(1.0*sum1*sum2/n)
          den=sqrt((sum1Sq-1.0*pow(sum1,2)/n)*(sum2Sq-1.0*pow(sum2,2)/n))
          if den==0: return 0

          r=num/den

          return r

My Data Set:

 # A dictionary of movie critics and their ratings of a small
 # set of movies

 critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
       'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,
       'The Night Listener': 3.0},
     'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,
      'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,
      'You, Me and Dupree': 3.5},
     'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
      'Superman Returns': 3.5, 'The Night Listener': 4.0},
     'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
      'The Night Listener': 4.5, 'Superman Returns': 4.0,
      'You, Me and Dupree': 2.5},
     'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
      'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
      'You, Me and Dupree': 2.0},
     'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
      'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
     'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}

Other Similar Questions:

Please include a sample data set. Presumably data sets exist for which the result should be zero. — Brian Cain, Nov 26 '12 at 04:12
You also might like to include links to the previous questions, to give more detailed context. — Jeff, Nov 26 '12 at 04:14
Yes, please provide the `prefs` that are defined in the `for` loop in line 3. If they are null, or if prefs[p1] is null, specifically, then `len(si)` will be 0 and hence n will be 0 as well. — rofls, Nov 26 '12 at 04:15
I found [this function](http://pastebin.com/k98e8pr1) in a statistics script I wrote for a class. It takes in a list of two-tuples and uses some formula I found online somewhere, and it gave me the right answer when I used it for homework. It might be useful to look at. — Waleed Khan, Nov 26 '12 at 04:25
@rofls I am new to Python programming. How do I provide those prefs defined in the for loop? — user1828663, Nov 26 '12 at 04:35
@user1828663, welcome to Python, and SO. BrianCain did that for you! — rofls, Nov 26 '12 at 06:47

rofls · Answer 1 · 2012-11-26T06:45:50.133

Thanks to everyone's help in the comments I've identified the problem. Just kidding. There were tons of problems. In the end, I noticed the for loop wasn't doubled up (in line 6) and it needed to be. One stage before the end, I frantically surrounded everything in float, sorry. Anyway, you want floats. Before that, there was the fact that he wasn't referencing the keys() for the critics, which he needed to be. Also, the pearson coefficient was calculated incorrectly, so much so that it required a mathematician to fix it (I have a BA in math). Now his tested example for Gene Seymour and Lisa Rose works correctly. Anyway, save this as pearson.py, or something:

from __future__ import division
from math import sqrt

def sim_pearson(prefs,p1,p2):
# Get the list of mutually rated items
   si={}
   for item in prefs[p1].keys():
      for item in prefs[p2].keys():
         if item in prefs[p2].keys():
            si[item]=1


      # Find the number of elements
      n=float(len(si))


      # if they are no ratings in common, return 0
      if n==0:
         print 'n=0'
         return 0


      # Add up all the preferences
      sum1=float(sum([prefs[p1][it] for it in si.keys()]))
      sum2=float(sum([prefs[p2][it] for it in si.keys()]))
      print 'sum1=', sum1, 'sum2=', sum2
      # Sum up the squares
      sum1Sq=float(sum([pow(prefs[p1][it],2) for it in si.keys()]))
      sum2Sq=float(sum([pow(prefs[p2][it],2) for it in si.keys()]))
      print 'sum1s=', sum1Sq, 'sum2s=', sum2Sq
      # Sum up the products
      pSum=float(sum([prefs[p1][it]*prefs[p2][it] for it in si.keys()]))


      # Calculate Pearson score
      num=(pSum/n)-(1.0*sum1*sum2/pow(n,2))
      den=sqrt(((sum1Sq/n)-float(pow(sum1,2))/float(pow(n,2)))*((sum2Sq/n)-float(pow(sum2,2))/float(pow(n,2))))
      if den==0:
         print 'den=0'
         return 0

      r=num/den

      return r

critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
   'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,
   'The Night Listener': 3.0},
 'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,
  'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,
  'You, Me and Dupree': 3.5},
 'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
  'Superman Returns': 3.5, 'The Night Listener': 4.0},
 'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
  'The Night Listener': 4.5, 'Superman Returns': 4.0,
                                                                                                                                                             1,1

Then, type:

import pearson
pearson.sim_pearson(pearson.critics, pearson.critics.keys()[1], pearson.critics.keys()[2])

or simply:

import pearson
pearson.sim_pearson(pearson.critics, 'Lisa Rose', 'Gene Seymour')

Let me know if you have any problems getting it to work. I left the print statements that I used to troubleshoot it, just so you can see how I solved it, but they're obviously not needed.

If you run into anymore problems in this book and you can't fix them, with SO's help that is, email me: raphael[at]postacle.com, and I should be able to get back to you. I downloaded it too a while ago, just a little lazy ;)

Thank you very much for your help! I had an issue with your code where it worked for the Lisa Rose/Gene Seymour combo but it still gave me error msgs for other combos. The error messages looked like "KeyError: 'Lady in the Water'" for my input of "pearson.sim_pearson(pearson.critics, 'Toby', 'Gene Seymour')". That being said, after re-writing the denominator part in another way and fixing the for-loop issue, things started running a lot more smoothly. Thanks again for your time! — user1828663, Nov 26 '12 at 09:53

score 0 · Answer 2 · answered Nov 26 '12 at 09:46

@rofls was right -

For Loop: My for-loop was an issue. Floats: Some of the terms needed to be converted to a float type from the int type.
Keys: This is something I definitely missed the first time around and @rofls caught.
Pearson Coefficient: I tightened the denominator component of the Pearson coefficient to line it with the expression on the Wikipedia page; it's the last expression under the Mathematical Properties section.

The code works now. I tried it for different combinations of the inputs.

    from __future__ import division
    from math import sqrt

    def sim_pearson(prefs,p1,p2):
    # Get the list of mutually rated items
       si={}
       for item in prefs[p1].keys():
           if item in prefs[p2].keys():
               print 'item=', item
               si[item]=1

       # Find the number of elements
       n=float(len(si))
       print 'n=', n

       # if they are no ratings in common, return 0
       if n==0:
           print 'n=0'
           return 0

       # Add up all the preferences
       sum1=float(sum([prefs[p1][it] for it in si.keys()]))
       sum2=float(sum([prefs[p2][it] for it in si.keys()]))
       print 'sum1=', sum1, 'sum2=', sum2

       # Sum up the squares
       sum1Sq=float(sum([pow(prefs[p1][it],2) for it in si.keys()]))
       sum2Sq=float(sum([pow(prefs[p2][it],2) for it in si.keys()]))
       print 'sum1s=', sum1Sq, 'sum2s=', sum2Sq

       # Sum up the products
       pSum=float(sum([prefs[p1][it]*prefs[p2][it] for it in si.keys()]))
       print 'pSum=', pSum

       # Calculate Pearson score
       num=(n*pSum)-(1.0*sum1*sum2)
       print 'num=', num
       den1=sqrt((n*sum1Sq)-float(pow(sum1,2)))
       print 'den1=', den1
       den2=sqrt((n*sum2Sq)-float(pow(sum2,2)))
       print 'den2=', den2
       den=1.0*den1*den2

      if den==0:
           print 'den=0'
           return 0

       r=num/den
       return r

    critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
          'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,
          'The Night Listener': 3.0},
         'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,
          'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,
          'You, Me and Dupree': 3.5},
         'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
          'Superman Returns': 3.5, 'The Night Listener': 4.0},
         'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
          'The Night Listener': 4.5, 'Superman Returns': 4.0,
          'You, Me and Dupree': 2.5},
         'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
          'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
          'You, Me and Dupree': 2.0},
         'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
          'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
         'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}

    # Done

Pearson Algorithm from Programming Collective Intelligence still not working

2 Answers2