0

I am trying to use matplotlib to graph the distribution of salary grouped by region, with the y-axis showing the % of people having that salary.

So far I have been able to come up with:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
records = pd.DataFrame({'Name': ['John','Rachel','Tom','Stan','Jack','Ben','Joe','Juliet','Nigel','Veronica','Cam'], 
         'salary': [40104,29401,57383,38494,99302,44733,40242,49555,13934,44011,88920],
                     'Country': ['USA', 'USA', 'USA', 'France', 'France', 'France', 'China', 'China', 'Japan', 'Japan', 'France'],
                     'Region': ['America', 'America', 'America', 'Europe', 'Europe', 'Europe', 'Asia','Asia','Asia','Asia','Europe']})
recordsgraph = records.pivot(columns='Region',values='salary')
recordsgraph.plot.density()

enter image description here

But I want the y-axis to show the percentage of the population for each region that has that salary. Any ideas on how to accomplish that?

Sean R
  • 173
  • 1
  • 8

2 Answers2

1

Pandas density plot draws a kernel density estimation (kde) approximating the probability density function. Such a function is scaled such that the total area equals 1. The y-axis will show the estimated probability of one unit on the x-axis. So, in this case, a value of 2.5e-5 means there is a 0.0025 % chance that the salary is exactly some given x-value plus or minus a half. E.g. that it is between 44,220 and 44,221.

You'll probably want a more coarse range. Let's say you'd want a range of 5000. If you multiply the values on the y-axis with 5000 (and with 100), you'll get the percentual probability that a salary will be in the range of a given x-value plus or minus 2500. The code below shows a way to multiply the y-values without touching the data, using a FuncFormatter.

import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
import pandas as pd

def percentage_base_5000(x, pos):
    return f'{x * 5000 * 100:.1f} %'

records = pd.DataFrame(
    {'Name': ['John', 'Rachel', 'Tom', 'Stan', 'Jack', 'Ben', 'Joe', 'Juliet', 'Nigel', 'Veronica', 'Cam'],
     'salary': [40104, 29401, 57383, 38494, 99302, 44733, 40242, 49555, 13934, 44011, 88920],
     'Country': ['USA', 'USA', 'USA', 'France', 'France', 'France', 'China', 'China', 'Japan', 'Japan', 'France'],
     'Region': ['America', 'America', 'America', 'Europe', 'Europe', 'Europe', 'Asia', 'Asia', 'Asia', 'Asia', 'Europe']})
recordsgraph = records.pivot(columns='Region', values='salary')
ax = recordsgraph.plot.density()
ax.yaxis.set_major_formatter(FuncFormatter(percentage_base_5000))
ax.set_ylabel('Percentage for a 5000-range salary')
plt.tight_layout()
plt.show()

resulting plot

Note that a side-effect of a FuncFormatter is that the values shown in the status bar will get the same transformation. So, the cursor value of x=3.67e4 y=11.7 % means that a salary between 36700-2500 and 36700+2500 has an estimated probabilty of about 11.7 %.

JohanC
  • 71,591
  • 8
  • 33
  • 66
0

If all you need is to change the y-axis to show the percentage, then this can be easily solved by applying a Formatter from matplotlib to the yaxis tick labels. Specifically, you can try the following:

from matplotlib.ticker import PercentFormatter

recordsgraph = records.pivot(columns='Region', values='salary')
ax = recordsgraph.plot.density()
ax.yaxis.set_major_formatter(PercentFormatter(xmax=1)) # xmax means the percent is calculated out of a max value of 0

Now, since this is a kernel density plot, you might find that the % points are very very low. For example:

enter image description here

If you need some more descriptive values (e.g. 10% of salaries fall into this bin), you might want to consider using a histogram instead. With the amount of data you have shown us there is little data to construct bins, but if you have more data you can it like this:

plt.hist([recordsgraph[col] for col in recordsgraph.columns], 
         bins=30,
         label=[col for col in recordsgraph.columns])

(per the answers here)

tania
  • 2,104
  • 10
  • 18