I have df with 70000 ages I want to group them by age like this
18-30
30-50
50-99
and compare them with other column which tells us revenue:
If you have a dataframe like this one:
N = 1000
df = pd.DataFrame({'age': np.random.randint(18, 99, N),
'revenue': 20 + 200*np.abs(np.random.randn(N))})
age revenue
0 69 56.776670
1 32 40.019089
2 89 38.045533
3 78 176.214654
4 38 527.738220
5 92 124.790533
6 92 137.617365
7 41 46.680172
8 20 234.199293
9 39 136.560120
You can cut the dataframe in age groups with pandas.cut
:
df['group'] = pd.cut(df['age'], bins = [18, 30, 50, 99], include_lowest = True, labels = ['18-30', '30-50', '50-99'])
age revenue group
0 69 56.776670 50-99
1 32 40.019089 30-50
2 89 38.045533 50-99
3 78 176.214654 50-99
4 38 527.738220 30-50
5 92 124.790533 50-99
6 92 137.617365 50-99
7 41 46.680172 30-50
8 20 234.199293 18-30
9 39 136.560120 30-50
Then you can group the age groups with pandas.DataFrame.groupby
:
df = df.groupby(by = 'group').mean()
age revenue
group
18-30 23.534091 184.895077
30-50 40.529183 185.348380
50-99 73.902998 170.889141
Now, finally, you are ready to plot the data:
fig, ax = plt.subplots()
ax.bar(x = df.index, height = df['revenue'])
plt.show()
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
N = 1000
df = pd.DataFrame({'age': np.random.randint(18, 99, N),
'revenue': 20 + 200*np.abs(np.random.randn(N))})
df['group'] = pd.cut(df['age'], bins = [18, 30, 50, 99], include_lowest = True, labels = ['18-30', '30-50', '50-99'])
df = df.groupby(by = 'group').mean()
fig, ax = plt.subplots()
ax.bar(x = df.index, height = df['revenue'])
plt.show()