3

I am new to Pandas and I am trying to do the following thing::

  • I have a dataframe called comms with columns articleID and commentScore (among others)
  • I have another dataframe called arts with column articleID

I need to create in arts a new column called articleScore. Each article must have the articleScore which is the sum of all commentScores related to that article (same articleID), divided by sqrt(n_comms + 1), where n_comms is the number of comments with that specific ID.

I already managed to do this but In a very inefficient way (pictured below)

for article in arts:
    n, tempScore = 0
    
    for i, value in comms.iterrows():
        if value['articleID'] == article['articleID']:
            tempScore + = value['commentScore']
            n += 1    
    article['articleScore'] /= math.sqrt(n+1)

Edit: Here's an example of what I would like to happen:

comms:
__________________________
| # | artID | commScore  |
| 0 | 1x5w  |     2      |
| 1 | 77k3  |     1      |
| 2 | 77k3  |    -1      |
| 3 | 3612  |     5      |
| 4 | 1x5w  |     3      |
--------------------------

arts:
___________________________
| # | artID | artScore (?) |
| 0 | 1x5w  |    2.89      |
| 1 | 77k3  |     0        |
| 2 | 3612  |    3.54      |
-------------------------

I need to (create and) fill the artScore column. Each artScore is the sum of the commentScores, but only of the comments with the same artID of the article, divided by sqrt(n+1).

Can anybody help me? Thanks a lot!

Andrea

Sala
  • 480
  • 4
  • 19

3 Answers3

1

I think you can use groupby followed by a merge on 'artID':

grpd = comms.groupby('artID')
to_merge = grpd.sum().divide(np.sqrt(grpd.count()+1)).reset_index().rename(columns={'commScore': 'artScore'})[['artID', 'artScore']]
arts.merge(to_merge, on='artID')
gofvonx
  • 1,370
  • 10
  • 20
  • Hi @gofvonx, thanks for your answer. I did what you recommended, but now on my *arts* data frame there are a lot of unwanted columns. Plus, I don't see in your code snippet where you assign a name to the new column articleScore – Sala Apr 17 '21 at 16:38
  • @Sala I have included explicit renaming of the column and also selected only the required sub-frame to merge. (Your sample data doesn't have any additional columns.) Does that work? – gofvonx Apr 17 '21 at 16:46
  • Yes. Thanks a lot! – Sala Apr 17 '21 at 20:51
1
#article count and sum
df = df.groupby('artID').agg(['sum', 'count'])

#create new column and utilize your formula
df['artScore'] = df['commScore']['sum'] / math.sqrt(df['commScore']['count']+1)


    commScore   artScore
       sum  count   
artID           
1x5w    5   2   5.0
3612    5   1   5.0
77k3    0   2   0.0
tyasird
  • 926
  • 1
  • 12
  • 29
1

You can use groupby with agg and a custom lambda function to apply to each group:

comms.groupby('artID').agg(
    {'commScore': lambda x: x.sum() / np.sqrt(len(x) + 1)}
).reset_index().rename(columns={'commScore': 'artScore'})

Result:

  artID  artScore
0  1x5w  2.886751
1  3612  3.535534
2  77k3  0.000000
PieCot
  • 3,564
  • 1
  • 12
  • 20