1

I have a dataframe, and I created a CDF of the days column:

...
#create DF from SQL
df = pd.read_sql_query(query, conn)

days = df['days'].dropna()

#create CDF definition
def ecdf(data):
    n = len(data)
    x = np.sort(data)
    y = np.arange(1.0, n+1) / n
    return x, y

#unpack x and y
x, y = ecdf(days)
sns.set()

#plot CDF
ax = plt.plot(x, y, marker='.', linestyle='none') 

#Overlay quartiles
percentiles= np.array([25,50,75])
x_p = np.percentile(days, percentiles)
y_p = percentiles/100.0
ax = plt.plot(x_p, y_p, marker='D', color='red', linestyle='none') # Overlay percentiles

#get current axes and add annotation and quartile points
ax=plt.gca()
for x,y in zip(x_p, y_p):                                        
    ax.annotate('%s' % x, xy=(x,y), xytext=(15,0), textcoords='offset points')

At the 50% mark, the datapoint in the overlay of the CDF is showing me 120 average, however print(np.mean(df['days_to_engaged'])) gives me 154.

Why the discrepancy?

print(df['days'].dropna()):

389
350
130
344
392
92
51
28
309
357
64
380
332
109
284
105
50
66
156
116
75
315
155
34
155
241
320
50
97
41
274
99
133
95
306
62
187
56
110
338
102
285
386
231
238
145
216
148
105
368
176
155
106
107
36
16
28
6
322
95
122
82
64
35
72
214
192
91
117
277
101
159
96
325
79
154
314
142
147
138
48
50
178
146
224
282
141
75
151
93
135
82
125
111
49
113
165
19
118
105
92
133
77
54
72
34
user8834780
  • 1,620
  • 3
  • 21
  • 48

1 Answers1

3

You're comparing the median to the mean. This boils down to the following:

a = np.array([1, 1, 2, 4])

ecdf is just the second element (1). While the mean is (4 + 2 + 1 + 1) / 4 == 2.

Alex
  • 18,484
  • 8
  • 60
  • 80
  • Thank you @Alex! Is there a similar visual concept to CDF I can use so that I can show distribution where x at 50% is the mean? – user8834780 Feb 09 '18 at 00:43
  • @user8834780 You can just plot the values and add a marker where the mean is. – Alex Feb 09 '18 at 00:46
  • @user8834780 Google it :) https://stackoverflow.com/questions/16180946/drawing-average-line-in-histogram-matplotlib – Alex Feb 09 '18 at 00:48
  • that's perfect thank you. I cant figure out how to add annotation at the mean though- I need the y coordinate, ie. P(x) and not sure how to specify it in: `ax.annotate(x.mean(), xy=(x.mean(),???), xytext=(15,0), textcoords='offset points')` – user8834780 Feb 09 '18 at 14:46