2

I have data on: a Name the number of times the name came up (Count), and a Score for each name. I want to create a box and whisker plot of Score, weighting each name's Score by its Count.

The result should be the same as if I had the data in raw (not frequency) form. But I don't want to actually transform the data to such a form because it would blow up in size very quickly.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = {
    "Name":['Sara', 'John', 'Mark', 'Peter', 'Kate'],
    "Count":[20, 10, 5, 2, 5], 
    "Score": [2, 4, 7, 8, 7]
}
df = pd.DataFrame(data)
print(df)
   Count   Name  Score
0     20   Sara      2
1     10   John      4
2      5   Mark      7
3      2  Peter      8
4      5   Kate      7

I am not sure how to tackle this in Python. Any help is appreciated!

thewhitetie
  • 303
  • 2
  • 12

2 Answers2

3

Late to this question, but in case it's useful to anyone who comes across it--

When your weights are an integer you can use reindex to expand by counts and then use a boxplot call directly. I've been able to do this on dataframes with several thousand that become a few hundred thousand without memory challenges, particularly if the actual reindexed dataframe is wrapped into a second function that does not assign it in memory.

import pandas as pd
import seaborn as sns

data = {
    "Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],
    "Count": [20, 10, 5, 2, 5],
    "Score": [2, 4, 7, 8, 7]
}
df = pd.DataFrame(data)

def reindex_df(df, weight_col):
    """expand the dataframe to prepare for resampling
    result is 1 row per count per sample"""
    df = df.reindex(df.index.repeat(df[weight_col]))
    df.reset_index(drop=True, inplace=True)
    return(df)

df = reindex_df(df, weight_col = 'Count')

sns.boxplot(x='Name', y='Score', data=df)

or if you are concerned about memory

def weighted_boxplot(df, weight_col):
    sns.boxplot(x='Name', 
                y='Score', 
                data=reindex_df(df, weight_col = weight_col))
    
weighted_boxplot(df, 'Count')
  • This works great. I would add that if your weights are large and memory is an issue you can scale the weights by a constant to bring the reindexed data back to a usable size. – tharen Oct 13 '22 at 20:15
  • Great answer! Yep @tharen super easy to do - just change `df[weight_col]` to `df[weight_col] // 10` or whatever scaling factor you want. – gtmtg Nov 03 '22 at 15:37
0

Here are two ways for the question. You might expect the first, however it's not a good solution for while computing confidence intervals of the median, it has the followng code that is using the sample data, referring matplotlib/cbook/__init__.py. Therefore the Second is much better than any others for it's well tested, comparing any other customized code.

def boxplot_stats(X, whis=1.5, bootstrap=None, labels=None,
                  autorange=False):
    def _bootstrap_median(data, N=5000):
        # determine 95% confidence intervals of the median
        M = len(data)
        percentiles = [2.5, 97.5]

        bs_index = np.random.randint(M, size=(N, M))
        bsData = data[bs_index]
        estimate = np.median(bsData, axis=1, overwrite_input=True)

First:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

data = {
    "Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],
    "Count": [20, 10, 5, 2, 5],
    "Score": [2, 4, 7, 8, 7]
}

df = pd.DataFrame(data)
print(df)


def boxplot(values, freqs):
    values = np.array(values)
    freqs = np.array(freqs)
    arg_sorted = np.argsort(values)
    values = values[arg_sorted]
    freqs = freqs[arg_sorted]
    count = freqs.sum()
    fx = values * freqs
    mean = fx.sum() / count
    variance = ((freqs * values ** 2).sum() / count) - mean ** 2
    variance = count / (count - 1) * variance  # dof correction for sample variance
    std = np.sqrt(variance)
    minimum = np.min(values)
    maximum = np.max(values)
    cumcount = np.cumsum(freqs)

    print([std, variance])
    Q1 = values[np.searchsorted(cumcount, 0.25 * count)]
    Q2 = values[np.searchsorted(cumcount, 0.50 * count)]
    Q3 = values[np.searchsorted(cumcount, 0.75 * count)]

    '''
    interquartile range (IQR), also called the midspread or middle 50%, or technically
    H-spread, is a measure of statistical dispersion, being equal to the difference
    between 75th and 25th percentiles, or between upper and lower quartiles,[1][2]
    IQR = Q3 −  Q1. In other words, the IQR is the first quartile subtracted from
    the third quartile; these quartiles can be clearly seen on a box plot on the data.
    It is a trimmed estimator, defined as the 25% trimmed range, and is a commonly used
    robust measure of scale.
    '''

    IQR = Q3 - Q1

    '''
    The whiskers add 1.5 times the IQR to the 75 percentile (aka Q3) and subtract
    1.5 times the IQR from the 25 percentile (aka Q1).  The whiskers should include
    99.3% of the data if from a normal distribution.  So the 6 foot tall man from
    the example would be inside the whisker but my 6 foot 2 inch girlfriend would
    be at the top whisker or pass it.
    '''
    whishi = Q3 + 1.5 * IQR
    whislo = Q1 - 1.5 * IQR

    stats = [{
        'label': 'Scores',  # tick label for the boxplot
        'mean': mean,  # arithmetic mean value
        'iqr': Q3 - Q1,  # 5.0,
#         'cilo': 2.0,  # lower notch around the median
#         'cihi': 4.0,  # upper notch around the median
        'whishi': maximum,  # end of the upper whisker
        'whislo': minimum,  # end of the lower whisker
        'fliers': [],  # '\array([], dtype=int64)',  # outliers
        'q1': Q1,  # first quartile (25th percentile)
        'med': Q2,  # 50th percentile
        'q3': Q3  # third quartile (75th percentile)
    }]

    fs = 10  # fontsize
    _, axes = plt.subplots(nrows=1, ncols=1, figsize=(6, 6), sharey=True)
    axes.bxp(stats)
    axes.set_title('Default', fontsize=fs)
    plt.show()


boxplot(df['Score'], df['Count'])

Second:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


data = {
    "Name": ['Sara', 'John', 'Mark', 'Peter', 'Kate'],
    "Count": [20, 10, 5, 2, 5],
    "Score": [2, 4, 7, 8, 7]
}

df = pd.DataFrame(data)
print(df)

labels = ['Scores']

data = df['Score'].repeat(df['Count']).tolist()

# compute the boxplot stats
stats = cbook.boxplot_stats(data, labels=labels, bootstrap=10000)

print(['stats :', stats])

fs = 10  # fontsize

fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(6, 6), sharey=True)
axes.bxp(stats)
axes.set_title('Boxplot', fontsize=fs)

plt.show()

References:

caot
  • 3,066
  • 35
  • 37
  • Interesting. I am actually new to using weights. Would this be essentially what passing in an array of weights does? – thewhitetie Sep 22 '19 at 22:50
  • @thewhitetie Added `print(df)` behind the `sns.boxplot(...` to help you understand the dataframe. – caot Sep 22 '19 at 23:00
  • @thewhitetie DataFrame keeps data in a python dictionary. If you look into the source code `class DataFrame(NDFrame):` in `pandas/core/frame.py`, you will get it. – caot Sep 22 '19 at 23:08
  • EDIT: I mean I wanted a boxplot not at the name level, but at the AGGREGATE level -- a box and whisker plot showing mean, median, Q25 etc. In other words, I want to summarize the entire data. This shows a different thing. – thewhitetie Sep 22 '19 at 23:25
  • This, for example, would get the desired mean. Still not sure how to create boxplot from it: `desired_mean = sum((df['Count'] * df['Score'])) / sum(df['Count'])` – thewhitetie Sep 22 '19 at 23:37
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/199871/discussion-between-thewhitetie-and-caot). – thewhitetie Sep 23 '19 at 23:53