0

I am trying to get the probability of testers_time and add back to the df. I have the following:

dict = {'id': ['a','b','c','d'], 'testers_time': [10, 30, 15, None], 'stage_1_to_2_time': [30, None, 30, None], 'activated_time' : [40, None, 45, None],'stage_2_to_3_time' : [30, None, None, None],'engaged_time' : [70, None, None, None]} 
df = pd.DataFrame(dict, columns=['id', 'testers_time', 'stage_1_to_2_time', 'activated_time', 'stage_2_to_3_time', 'engaged_time'])

unique, counts = np.unique(df['testers_time'].dropna().sort_values() , return_counts=True) 
print(pd.DataFrame(counts/float(len(counts))))

Expected output (last column):

  id  testers_time  stage_1_to_2_time  activated_time  stage_2_to_3_time  \
0  a          10.0               30.0            40.0               30.0   
1  b          30.0                NaN             NaN                NaN   
2  c          15.0               30.0            45.0                NaN   
3  d           NaN                NaN             NaN                NaN   

   engaged_time  prob
0          70.0  0.333333
1           NaN  0.333333
2           NaN  0.333333
3           NaN  NaN 

However I am stuck at how to add this back into the df. Can you assist?

user8834780
  • 1,620
  • 3
  • 21
  • 48
  • Please show us the **precise** output you desire. This may help: [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – jpp Jun 22 '18 at 16:37
  • 1
    I would also avoid naming your dictionary `dict` – rahlf23 Jun 22 '18 at 16:38

1 Answers1

1

You likely want to map some normalized value_counts output, like this.

df['prob'] = df['testers_time'].map(
    df.testers_time.value_counts(normalize=True))

df
  id  testers_time  stage_1_to_2_time  activated_time  stage_2_to_3_time  engaged_time      prob
0  a          10.0               30.0            40.0               30.0          70.0  0.333333
1  b          30.0                NaN             NaN                NaN           NaN  0.333333
2  c          15.0               30.0            45.0                NaN           NaN  0.333333
3  d           NaN                NaN             NaN                NaN           NaN       NaN
cs95
  • 379,657
  • 97
  • 704
  • 746
  • If plotting, would you say I should just use `testers_time` as x and `prob` as y and do `plt.plot(x, y, marker='.', linestyle='none')`? Or `plt.hist()` is a better idea? – user8834780 Jun 22 '18 at 16:52
  • @user8834780 it depends. Go for whatever is more self-descriptive in its visualization, or whatever clearly depicts what you are trying to visualise. As for choice of axes, time always on X. – cs95 Jun 22 '18 at 16:53