0

I have a big data frame with over 1000 rows. I am able to find the most similar rows to a certain index using cosine similarity and weight them accordingly. So my similar_rows data frame looks like this...

eg. similar_rows(60):

    A  B  C   Weight
0   5  6  7     0.2
1   8  3  2     0.3
2   1  4  6     0.1

I multiply each value by the weight column, and then find the average of all rows, so my result would be like so:

    A      B     C  
0  1.16  0.83  0.86

How can I apply this function to all 1000 rows so I'm left with a data frame like this for example:

      A       B     C
0    0.1     0.24  0.5
1    0.3     0.2   0.3 
.     .       .     . 
.     .       .     . 
1000  0.12   0.45  0.67

Thanks in advance...

toothsie
  • 245
  • 3
  • 10

2 Answers2

2

Look at the apply function from pandas.DataFrame :

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html#pandas-dataframe-apply

You can make a function with it that will update every single row with whatever result you want by the same operations (just like the builtin map function on lists)

Also note that this function will be applied along an axis, so take care of which one you choose

Nenri
  • 477
  • 3
  • 18
  • 1
    Thanks, this makes sense but it is returning 'positional index is out of bounds'. Any ideas why this is? – toothsie Feb 25 '19 at 11:15
  • @toothsie Can you show me the code you're trying to execute ? – Nenri Feb 25 '19 at 11:17
  • Write code as answer,. Please post links in the comment section – bigbounty Feb 25 '19 at 11:23
  • @bigbounty at the time i didn't have the good reputation amount (50) to post comments, so i had to do an answer, sorry. – Nenri Feb 25 '19 at 11:25
  • @bigbounty - it's fine to put links in the answer. I'd say, in fact, **do** put links in the answer - but only as additional material, and not as the main source. See. e.g. [this answer](https://stackoverflow.com/a/4366748/1364007). It has a link, and it has code. – Wai Ha Lee Feb 25 '19 at 12:26
  • When I run the function for say index 25, it will return a single row with index 0, is it possible to return the same row with but with index 25 instead? – toothsie Feb 25 '19 at 12:49
  • You can try this instead, this will maybe be better : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.applymap.html#pandas.DataFrame.applymap But normally, all the line is returned as is if you don't edit it inside your function – Nenri Feb 25 '19 at 13:19
1

You can refer the below code:

import pandas as pd
#import numpy as np

df = df = pd.DataFrame({'A':[5,8,1],"B":[6,3,4],"C":[7,2,6],"Weight":[0.2,0.3,0.1]}) 
print(df)

Out[47]: 
   A  B  C  Weight
0  5  6  7     0.2
1  8  3  2     0.3
2  1  4  6     0.1

No need to use apply here:

temp = pd.DataFrame({'A':df['A']*df['Weight'],'B':df['B']*df['Weight'],'C':df['C']*df['Weight']})
print(temp)

     A    B    C
0  1.0  1.2  1.4
1  2.4  0.9  0.6
2  0.1  0.4  0.6

Next apply mean function

temp.mean(axis=1)

0    1.200000
1    1.300000
2    0.366667
dtype: float64

I have applied to only 3 values for each column.

bigbounty
  • 16,526
  • 5
  • 37
  • 65
  • Good solution, but you've put your imports before the codeblock and numpy is also useless in that case as much as i can see – Nenri Feb 25 '19 at 11:21
  • Thanks, unfortunately this doesn't seem feasible as I actually have 40+ columns (question was a simplified version of my data frame). And I am also looking for the average on columns rather than rows. – toothsie Feb 25 '19 at 11:34
  • If you are looking for average on columns, then change the `axis` argument to `0`. I'll write the code and repost the answer – bigbounty Feb 25 '19 at 12:18