Weights for histogram in pandas

Question

I have a pandas dataframe (call it data) with categorical and continuous values that look like this:

INDEX  AGE  SEX  INCOME  COUNTRY  INSTANCE_WEIGHT
1      25   M    30000   USA      120
2      53   F    42000   FR       95
3      37   F    22000   USA      140
4      18   M    0       FR       110
.
.
.
15000  29  F     39000   USA      200

The instance weight indicates the number of people in the population that each record represents due to stratified sampling.

What I would like to do is plotting the distribution of each of the variable into an histogram. The problem is that I can't just plot an histogram of this current dataframe since it's not representative of the real distribution. To be representative, I have to multiply each row by its intance_weight before plotting it. The problem sounds easy but I can't find a good way of doing that.

A solution would be to duplicate each row instance_weight times but the real dataframe is 300k rows and instance_weightis around 1000.

This is the code I have for now to plot an histogram of each of the column.

fig = plt.figure(figsize=(20,70))
cols = 4
rows = ceil(float(data.shape[1]) / cols)
for i, column in enumerate(data.drop(["instance_weight","index"], axis=1).columns):
    ax = fig.add_subplot(rows, cols, i + 1)
    ax.set_title(column)
    # Check if data categorical or not
    if data.dtypes[column] == np.object:
        data[column].value_counts().plot(kind="bar", axes=ax,
                                         alpha=0.8, color=sns.color_palette(n_colors=1))
    else:
        data[column].hist(axes=ax, alpha=0.8)
        plt.xticks(rotation="vertical")
plt.subplots_adjust(hspace=1, wspace=0.2)

How to consider the weight now?

You could multiply the numeric columns [like so](http://stackoverflow.com/a/22702814/1292641), but that won't help with the non-numeric ones... — Norman, Apr 13 '16 at 09:02

score 19 · Answer 1 · answered Aug 06 '18 at 15:13

You should use the 'weights' argument of the matplotlib 'hist' function, which is also available through the pandas 'plot' function.

In your example, to plot the distribution of the variable 'AGE' weighted on the variable 'INSTANCE_WEIGHT', you should do:

df["AGE"].plot(kind="hist", weights=df["INSTANCE_WEIGHT"])

Weights for histogram in pandas

1 Answers1

Linked

Related