So I have a used Audi car database from Kaggle.
Here is what my code looks like to import the dataset:
### headers ###
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
### path name ###
data_file_pathname = "etc.."
### import ###
my_data_frame = pd.read_csv(data_file_pathname, sep=",", header=0)
### preview of dataset ###
print(my_data_frame.head())
model year price transmission mileage fuelType tax mpg engineSize
0 A1 2017 12500 Manual 15735 Petrol 150 55.4 1.4
1 A6 2016 16500 Automatic 36203 Diesel 20 64.2 2.0
2 A1 2016 11000 Manual 29946 Petrol 30 55.4 1.4
3 A4 2017 16800 Automatic 25952 Diesel 145 67.3 2.0
4 A3 2019 17300 Manual 1998 Petrol 145 49.6 1.0
The years range from 1997 to 2021 and I have 3 fuelTypes (Petrol, Diesel and Hybrid). What I want is dataset with fuelType percentages per year.
Example:
year fuel percentage
2003 Petrol 20.00
Diesel 65.00
Hybrid 15.00
So far I've managed to get the number of cars, per fuelType, per year and I'm not even sure of how I did it so... Here is the code for that:
Fuel_Year = my_data_frame.groupby(['year', 'fuelType'])
df = pd.concat([Fuel_Year.fuelType.count()], axis=1, keys="Counts")
print(df)
C
year fuelType
1997 Petrol 1
1998 Petrol 1
2002 Petrol 2
2003 Diesel 4
Petrol 2
2004 Diesel 1
Petrol 4
2005 Diesel 2
Petrol 5
2006 Diesel 3
Petrol 6
2007 Diesel 7
Petrol 9
I've tried different things and landed on the transform('sum') function on many web pages I checked, but this hasn't been a success for me. My ultimate goal with this is to create some sort of stacked histogram plot with the years on the x-axis and percentages on the y-axis to show evolution of fuelType per year.
Could anybody please help me find the proper code to generate a column for fuelType percentages per year?
Thank you very much! :)