2

I have a frequency table of test scores:

score    count
-----    -----
  77      1105
  78       940
  79      1222
  80      4339
etc

I want to show basic statistics and a boxplot for the sample which is summarized by the frequency table. (For example, the mean of the above example is 79.16 and the median is 80.)

Is there a way to do this in Pandas? All the examples I have seen assume a table of individual cases.

I suppose I could generate a list of individual scores, like this --

In [2]: s = pd.Series([77] * 1105 + [78] * 940 + [79] * 1222 + [80] * 4339)
In [3]: s.describe()
Out[3]: 
count    7606.000000
mean       79.156324
std         1.118439
min        77.000000
25%        78.000000
50%        80.000000
75%        80.000000
max        80.000000
dtype: float64

-- but I am hoping to avoid that; total frequencies in the real non-toy dataset are well up in the billions.

Any help appreciated.

(I think this is a different question from Using describe() with weighted data, which is about applying weights to individual cases.)

JJJ
  • 1,009
  • 6
  • 19
  • 31
wleftwich
  • 403
  • 5
  • 10
  • Possible duplicate of http://stackoverflow.com/questions/17689099/using-describe-with-weighted-data – Alicia Garcia-Raboso Sep 18 '16 at 15:02
  • I think it *is* the same as the question I linked to: you want weighted descriptive statistics of the `score` column with weights given by the `count` column. Alas, I don't think that question has a satisfactory answer. – Alicia Garcia-Raboso Sep 18 '16 at 15:32
  • I agree they are asking very similar things but I don't know how SAS proc works so I'll post my answer here as it may not satisfy those requirements. – ayhan Sep 18 '16 at 15:50

3 Answers3

5

Here's a small function that calculates decriptive statistics for frequency distributions:

# from __future__ import division (for Python 2)
def descriptives_from_agg(values, freqs):
    values = np.array(values)
    freqs = np.array(freqs)
    arg_sorted = np.argsort(values)
    values = values[arg_sorted]
    freqs = freqs[arg_sorted]
    count = freqs.sum()
    fx = values * freqs
    mean = fx.sum() / count
    variance = ((freqs * values**2).sum() / count) - mean**2
    variance = count / (count - 1) * variance  # dof correction for sample variance
    std = np.sqrt(variance)
    minimum = np.min(values)
    maximum = np.max(values)
    cumcount = np.cumsum(freqs)
    Q1 = values[np.searchsorted(cumcount, 0.25*count)]
    Q2 = values[np.searchsorted(cumcount, 0.50*count)]
    Q3 = values[np.searchsorted(cumcount, 0.75*count)]
    idx = ['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max']
    result = pd.Series([count, mean, std, minimum, Q1, Q2, Q3, maximum], index=idx)
    return result

A demo:

np.random.seed(0)

val = np.random.normal(100, 5, 1000).astype(int)

pd.Series(val).describe()
Out: 
count    1000.000000
mean       99.274000
std         4.945845
min        84.000000
25%        96.000000
50%        99.000000
75%       103.000000
max       113.000000
dtype: float64

vc = pd.value_counts(val)
descriptives_from_agg(vc.index, vc.values)

Out: 
count    1000.000000
mean       99.274000
std         4.945845
min        84.000000
25%        96.000000
50%        99.000000
75%       103.000000
max       113.000000
dtype: float64

Note that this doesn't handle NaN's and is not properly tested.

ayhan
  • 70,170
  • 20
  • 182
  • 203
  • Thanks! And your fast response saved me another couple of hours trying to find a built-in way to do it. – wleftwich Sep 18 '16 at 17:37
2

In my original question I said I didn't want to reconstruct raw values from the frequency table, but as long as it fits in memory I now think I will go that route, especially since my actual use case involves more columns.

In case anyone is interested, here is my function for transforming a frequency table into cases.

In [5]: def freqs2cases(df, freq_col, cases_cols):
   ...:     def itcases():
   ...:         for i, row in df.iterrows():
   ...:             for j in range(int(row[freq_col])):
   ...:                 yield row[cases_cols]
   ...:     return pd.DataFrame(itcases())
   ...: 

In [8]: freq_df
Out[8]: 
  course  score  freq
0   math     75     3
1   math     81     4
2   chem     92     2
3   chem     66     3

In [9]: freqs2cases(freq_df, 'freq', ['course', 'score'])
Out[9]: 
  course  score
0   math     75
0   math     75
0   math     75
1   math     81
1   math     81
1   math     81
1   math     81
2   chem     92
2   chem     92
3   chem     66
3   chem     66
3   chem     66
wleftwich
  • 403
  • 5
  • 10
0

You can do something like:

  1. Using groupby, You can divide 'score' columns.
  2. You can add [['score'] times count]
  3. sum(add) is list of lists. so using itertools.chain, you make it flatten list.
  4. using pd.Series(), You can use .describe()
    import itertools
    sum_add = []
    for idx,grp in df.groupby('score'):
        sum_add.append((list(grp['score']) * grp['count'].iloc[0]) )
    pd.Series(list(itertools.chain.from_iterable(sum_add))).describe()

mtgarden
  • 77
  • 1
  • 5
  • Hi, welcome to Stack Overflow. When answering a question that already has many answers, please be sure to add some additional insight into why the response you're providing is substantive and not simply echoing what's already been vetted by the original poster. This is especially important in "code-only" answers such as the one you've provided. – chb Mar 11 '19 at 02:47
  • Thanks Your Advice. I am not fluent in English. but I will try. – mtgarden Mar 11 '19 at 03:09