1

following problem: matching users based on a compatibility score through data provided by filling out a profile indicating personality, lifestyle, interests etc.

Each of the attributes are tags (e.g. attribute calm for personality) that are either true (1) or false (0). Let's assume we want to find the compatibility of two users.

Extract from panda DataFrame for personality

User 2 is subtracted from User 3, differences are squared and the sum of the differences is put in relation to the maximum possible deviation (number of attributes for a category etc. personality). The reciprocal is then a score of similarity. The same is done for all categories (e.g. lifestyle)

def similarityScore (pandaFrame, name1, name2):

    profile1 = pandaToArray(pandaFrame, name1)#function changing DataFrane to array
    profile2 = pandaToArray(pandaFrame, name2)

    newArray = profile1 - profile2
    differences = 0
    for element in newArray:
        element = (element)**2
        differences += element
    maxDifference = len(profile1)
    similarity = 1 - (differences/maxDifference)
    return similarity

Every user is compared with every other user in the DataFrame:

def scorecalc(fileName):
    data = csvToPanda(fileName)
    scorePanda = pd.DataFrame([], columns=userList, index=userList)
    for user1 in userList:
        firstUser = user1

        for user2 in userList:
            secondUser = user2
            score = similarityScore(data, firstUser, secondUser)
            scorePanda.iloc[[userList.index(firstUser)],[userList.index(secondUser)]] = score
    return(scorePanda)

Based on how important it is for the user that there is a similarity for a specific category, the similarity score is weighted by multiplying the similarity score with a dataframe of preferences:

def weightedScore (personality, lifestyle,preferences):

    personality = personality.multiply(preferences['personality'])
    lifestyle = lifestyle.multiply(preferences['lifestyle'])

    weightscore = (personality + lifestyle) 
    return(weightscore)

The result would be a compatibility score ranging from 0 to 1.

It works all fine, but takes quite a bit of time to run it especially if the number of users compared (100+) increases. Any suggestions to speed this up, make the code easier?

Andrew McDowell
  • 2,860
  • 1
  • 17
  • 31
pauly
  • 11
  • 2
  • Is all your data in binary indicator variable format, as it appears in the little photo you shared? – Dylan Jan 22 '19 at 16:01
  • @dylan Yes it is – pauly Jan 23 '19 at 19:20
  • While this is a lot of good work you have done, I would probably recommend moving towards nearest based neighbors solution with a distance metric appropriate for binary variables, such as dice sorensen. sklearn has quite a bit of c/cython optimization, and may help. If not that, perhaps a numpy solution? – Dylan Jan 23 '19 at 20:16

2 Answers2

0

Hopefully I have the problem statement correct:

I have DataFrame X, of binary indicator variables. (0,1) For each row of X (which represents a different user) I would like to find the most similar user/rows among the other user/rows.

I will use the NearestNeighbors class in sklearn,from here:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.neighbors import NearestNeighbors
X = np.array([[0,0,0,0,1],
              [0,0,0,0,1],
              [1,1,1,0,0],
              [1,0,0,1,1]])

Looking at X, we can see that idx = 1 and idx = 2 are the most similiar. They match perfectly. They should match with one another as "most similiar."

# two nbrs since first match is self match
nbrs = NearestNeighbors(n_neighbors=2, metric='dice').fit(X)
distances, indices = nbrs.kneighbors(X) 
print(indices) 

#remember first val in this array per line is self match
[[0 1]
[0 1]
[2 3]
[3 1]]

To incorporate your weighted score, I am not super sure. My first idea was to take your array of binary data, multiply by "how important this is to me" then just use a different metric in the nearest neighbors search, like "euclidean" or whatever. It kind of requires more info about specifically what is contained in those other dataframes.

So lets say user 1 and 2 (by their index locations) indicated that the 3rd column was super important (a "10" on a 0-10), and that the third column was filled out here as such:

X = np.array([[0,0,0,0,1],
             [0,0,1,0,1],
             [1,1,1,0,0],
             [1,0,0,1,1]])
# notice they match now on that 3rd col, but disagree elsewhere

#ugly hack for replacing two vals
np.put(X[1], [2], [10]) # grab second row, third col, place [10]
np.put(X[2], [2], [10])

print(X)

[[ 0  0  0  0  1]
[ 0  0 10  0  1]
[ 1  1 10  0  0]
[ 1  0  0  1  1]]

Now they both agree that that question is super important. Now try the neighbors calc with a different metric:

nbrs = NearestNeighbors(n_neighbors=2, metric='euclidean').fit(X)

d, i = nbrs.kneighbors(X)
print(d)
print(i)

[[0.         1.41421356]
 [0.         1.73205081]
 [0.         1.73205081]
 [0.         1.41421356]]
[[0 3]
 [1 2]
 [2 1]
 [3 0]]

With the [1,2] and [2,1] indicating that the second row and third row are now the closest together to one another. (Remember the first val in the array i is the self match)

There are fine details here that I am glossing over which may make nearest neighbors unsuitable, but you can read about them in other various places

Dylan
  • 417
  • 4
  • 14
0

@Dylan The only problem I had with NearestNeighbours is that it would render different results to the approach I have taken. An example:

from sklearn.neighbors import NearestNeighbors
import numpy as np

X = np.array([[0,0,0,0,1],
             [0,0,1,1,0]])

nbrs = NearestNeighbors(n_neighbors=2, metric = 'euclidean').fit(X)
distances, indices = nbrs.kneighbors(X)
print(distances)
print(1/ (1+distances)) # returns a similarity score between 0 and 1

Th similarity score is at 0.366, whereas it should be 40%, as their absolute deviation is 3 out of 5 variables --> 60%

pauly
  • 11
  • 2
  • I think that is because you have used the euclidean distance metric on a boolean (1s and 0s array). Notice in my answer that I use the "dice" distance metric when working with pure 1s and 0s arrays. Try that perhaps? – Dylan Feb 04 '19 at 15:02