following problem: matching users based on a compatibility score through data provided by filling out a profile indicating personality, lifestyle, interests etc.
Each of the attributes are tags (e.g. attribute calm for personality) that are either true (1) or false (0). Let's assume we want to find the compatibility of two users.
Extract from panda DataFrame for personality
User 2 is subtracted from User 3, differences are squared and the sum of the differences is put in relation to the maximum possible deviation (number of attributes for a category etc. personality). The reciprocal is then a score of similarity. The same is done for all categories (e.g. lifestyle)
def similarityScore (pandaFrame, name1, name2):
profile1 = pandaToArray(pandaFrame, name1)#function changing DataFrane to array
profile2 = pandaToArray(pandaFrame, name2)
newArray = profile1 - profile2
differences = 0
for element in newArray:
element = (element)**2
differences += element
maxDifference = len(profile1)
similarity = 1 - (differences/maxDifference)
return similarity
Every user is compared with every other user in the DataFrame:
def scorecalc(fileName):
data = csvToPanda(fileName)
scorePanda = pd.DataFrame([], columns=userList, index=userList)
for user1 in userList:
firstUser = user1
for user2 in userList:
secondUser = user2
score = similarityScore(data, firstUser, secondUser)
scorePanda.iloc[[userList.index(firstUser)],[userList.index(secondUser)]] = score
return(scorePanda)
Based on how important it is for the user that there is a similarity for a specific category, the similarity score is weighted by multiplying the similarity score with a dataframe of preferences:
def weightedScore (personality, lifestyle,preferences):
personality = personality.multiply(preferences['personality'])
lifestyle = lifestyle.multiply(preferences['lifestyle'])
weightscore = (personality + lifestyle)
return(weightscore)
The result would be a compatibility score ranging from 0 to 1.
It works all fine, but takes quite a bit of time to run it especially if the number of users compared (100+) increases. Any suggestions to speed this up, make the code easier?