New Column Based on the Size of One Column

Question

I am trying to create a new column 'score/id.size' for my current dataframe

np.random.seed(1234)
test = pd.DataFrame({'id':np.random.randint(1,5,10),
                     'score':np.random.uniform(0,1,10)})

test = test.sort(['id'])

test
   id     score
4   1  0.875933
5   1  0.357817
6   1  0.500995
3   2  0.958139
7   2  0.683463
9   2  0.370251
2   3  0.801872
0   4  0.272593
1   4  0.276464
8   4  0.712702

I want my new dataframe to be this:

   id     score       score/id.size
4   1  0.875933       0.875933 / 3
5   1  0.357817       0.357817 / 3
6   1  0.500995       0.500995 / 3
3   2  0.958139       0.958139 / 3
7   2  0.683463       0.683463 / 3
9   2  0.370251       0.370251 / 3
2   3  0.801872       0.801872 / 1
0   4  0.272593       0.272593 / 3
1   4  0.276464       0.276464 / 3
8   4  0.712702       0.712702 / 3

Sorry if this question is too basic, I am new to Python.

Thanks!

do you want the first value in the new column literally to be '0.875933 / 3', or do you want it to be 0.29197766 . . . the result of dividing 0.87 . . . by 3? — abcd, Apr 21 '15 at 04:27
sorry for the confusion, i want it to be 0.29197766. I did it that way for visualization purposes/ — collarblind, Apr 21 '15 at 04:35
of relevance: http://stackoverflow.com/questions/27140860/count-occurrences-of-number-by-column-in-pandas-data-frame — abcd, Apr 21 '15 at 04:51

Javier de la Rosa · Answer 1 · 2015-04-21T04:50:53.160

1

As I see it, you need to group by id and count, and then use that as a key to do the operation in the new column.

counts = test.groupby("id").count()
test["score/id.size"] = test.apply(lambda x: x["score"] / float(counts[counts.index==x["id"]].score), axis=1)

test
   id     score  score/id.size
4   1  0.875933       0.291978
5   1  0.357817       0.119272
6   1  0.500995       0.166998
3   2  0.958139       0.319380
7   2  0.683463       0.227821
9   2  0.370251       0.123417
2   3  0.801872       0.801872
0   4  0.272593       0.090864
1   4  0.276464       0.092155
8   4  0.712702       0.237567

edited Apr 21 '15 at 04:50

answered Apr 21 '15 at 04:41

Javier de la Rosa

669
5
12

your column label is inverted -- he asks for `score / id.size`, not `id/score` (but your values are correct). – abcd Apr 21 '15 at 04:47
1

Fixed. He commented while I was answering :) – Javier de la Rosa Apr 21 '15 at 04:50

score 1 · Answer 2 · answered Apr 21 '15 at 04:44

1

This'll do the job:

test['score / id.size'] = test.score / [(test.id == i).sum() for i in test.id]

answered Apr 21 '15 at 04:44

abcd

10,215
15
51
85

score 1 · Accepted Answer · answered Apr 21 '15 at 05:59

I think this answer makes better use of panda's automagic grouping and alignment features than some of those already posted, just group and divide by the size of the group:

test['score_normalized'] = test.groupby('id', group_keys=False).apply(
    lambda g: g['score'] / len(g)
)

test
Out[9]: 
   id     score  score_normalized
4   1  0.875933          0.291978
5   1  0.357817          0.119272
6   1  0.500995          0.166998
3   2  0.958139          0.319380
7   2  0.683463          0.227821
9   2  0.370251          0.123417
2   3  0.801872          0.801872
0   4  0.272593          0.090864
1   4  0.276464          0.092155
8   4  0.712702          0.237567

score 1 · Answer 4 · answered Apr 21 '15 at 07:51

When you want to add a calculated column from a groupby you should use transform:

In [116]:

np.random.seed(1234)
test = pd.DataFrame({'id':np.random.randint(1,5,10),
                     'score':np.random.uniform(0,1,10)})

test = test.sort(['id'])
test
Out[116]:
   id     score
4   1  0.875933
5   1  0.357817
6   1  0.500995
3   2  0.958139
7   2  0.683463
9   2  0.370251
2   3  0.801872
0   4  0.272593
1   4  0.276464
8   4  0.712702
In [117]:

test['score/id.size'] = test.groupby('id')['score'].transform(lambda x: x / x.count())
test
Out[117]:
   id     score  score/id.size
4   1  0.875933       0.291978
5   1  0.357817       0.119272
6   1  0.500995       0.166998
3   2  0.958139       0.319380
7   2  0.683463       0.227821
9   2  0.370251       0.123417
2   3  0.801872       0.801872
0   4  0.272593       0.090864
1   4  0.276464       0.092155
8   4  0.712702       0.237567

transform returns a series aligned with the original df

New Column Based on the Size of One Column

4 Answers4