4

I have a list of tuples [(val1, freq1), (val2, freq2) .... (valn, freqn)]. I need to get measures of central tendencies (mean, median ) and measures of deviation (variance , std) for the above data.I would also like to plot a boxplot for the values.

I see that numpy arrays have direct methods for getting mean / median and standard deviation (or variance) from list of values.

Does numpy (or any other well-known library) have a direct means to operate on such a frequency distribution table ?

Also: What is the best way to programmatically expand the above list of tuples to one list? (e.g if freq dist is [(1,3) , (50,2)], best way to get a list [1,1,1,50,50] to use np.mean([1,1,1,50,50]))?

I see a custom function here, but I would like to use a standard implementation if possible.

petezurich
  • 9,280
  • 9
  • 43
  • 57
jithu83
  • 539
  • 6
  • 11
  • @ayhan I have attributed your solution to the description ... and clarified what I am looking for . Can you remove the duplicate tag ? – jithu83 Sep 07 '17 at 02:26

3 Answers3

13

First, I'd change that messy list into two numpy arrays like @user8153 did:

val, freq = np.array(list_tuples).T

Then you can reconstruct the array (using np.repeat prevent looping):

data = np.repeat(val, freq)

And use numpy statistical functions on your data array.


If that causes memory errors (or you just want to squeeze out as much performance as possible), you can also use some purpose-built functions:

def mean_(val, freq):
    return np.average(val, weights = freq)

def median_(val, freq):
    ord = np.argsort(val)
    cdf = np.cumsum(freq[ord])
    return val[ord][np.searchsorted(cdf, cdf[-1] // 2)]

def mode_(val, freq): #in the strictest sense, assuming unique mode
    return val[np.argmax(freq)]

def var_(val, freq):
    avg = mean_(val, freq)
    dev = freq * (val - avg) ** 2
    return dev.sum() / (freq.sum() - 1)

def std_(val, freq):
    return np.sqrt(var_(val, freq))
Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • Error at "return dev.sum() / (freq.sum() - 1) " in function "var_" , 'Float64Index' object has no attribute 'sum' – ppau2004 Feb 15 '21 at 02:19
  • @ppau2004 I'm sure `pandas` has its own, much better-implemented versions of these. Probably want to make a question asking how if you can't find it (I'm not a `pandas` expert but there are plenty around here) – Daniel F Feb 15 '21 at 08:08
  • Actually, @AditjaRajgor below has a `pandas` answer – Daniel F Feb 15 '21 at 08:10
3
  • To convert the (value, frequency) list to a list of values:

    freqdist =  [(1,3), (50,2)]
    sum(([val,]*freq for val, freq in freqdist), []) 
    

    gives

    [1, 1, 1, 50, 50]
    
  • To compute the mean you can avoid the building of the list of values by using np.average which takes a weights argument:

    vals, freqs = np.array(freqdist).T
    np.average(vals, weights = freqs)
    

    gives 20.6 as you would expect. I don't think this works for the mean, variance, or standard deviation, though.

user8153
  • 4,049
  • 1
  • 9
  • 18
3
import pandas as pd
import math
import numpy as np

Frequency Distributed Data

    class   freq
0   60-65   3
1   65-70   150
2   70-75   335
3   75-80   135
4   80-85   4

Create Middle point column for classes

df[['Upper','Lower']]=df['class'].str.split('-',expand=True)
df['Xi']=(df['Upper'].astype(float)+df['Lower'].astype(float))/2
df.drop(['Upper','Lower'],axis=1,inplace=True)

Therefore

    class   freq  Xi
0   60-65   3     62.5
1   65-70   150   67.5
2   70-75   335   72.5
3   75-80   135   77.5
4   80-85   4     82.5

Mean

mean = np.average(df['Xi'], weights=df['freq'])
mean
72.396331738437

Standard Deviation

std = np.sqrt(np.average((df['Xi']-mean)**2,weights=df['freq']))
std
3.5311919641103877