1

So I have a used Audi car database from Kaggle.

Here is what my code looks like to import the dataset:

### headers ###
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### path name ###
data_file_pathname = "etc.."

### import ###
my_data_frame = pd.read_csv(data_file_pathname, sep=",", header=0)

### preview of dataset ###
print(my_data_frame.head())

  model  year  price transmission  mileage fuelType  tax   mpg  engineSize
0    A1  2017  12500       Manual    15735   Petrol  150  55.4         1.4
1    A6  2016  16500    Automatic    36203   Diesel   20  64.2         2.0
2    A1  2016  11000       Manual    29946   Petrol   30  55.4         1.4
3    A4  2017  16800    Automatic    25952   Diesel  145  67.3         2.0
4    A3  2019  17300       Manual     1998   Petrol  145  49.6         1.0

The years range from 1997 to 2021 and I have 3 fuelTypes (Petrol, Diesel and Hybrid). What I want is dataset with fuelType percentages per year.

Example:

 year  fuel    percentage    
 2003  Petrol  20.00
       Diesel  65.00
       Hybrid  15.00

So far I've managed to get the number of cars, per fuelType, per year and I'm not even sure of how I did it so... Here is the code for that:

Fuel_Year = my_data_frame.groupby(['year', 'fuelType'])
df = pd.concat([Fuel_Year.fuelType.count()], axis=1, keys="Counts")
print(df)
                  C
year fuelType      
1997 Petrol       1
1998 Petrol       1
2002 Petrol       2
2003 Diesel       4
     Petrol       2
2004 Diesel       1
     Petrol       4
2005 Diesel       2
     Petrol       5
2006 Diesel       3
     Petrol       6
2007 Diesel       7
     Petrol       9

I've tried different things and landed on the transform('sum') function on many web pages I checked, but this hasn't been a success for me. My ultimate goal with this is to create some sort of stacked histogram plot with the years on the x-axis and percentages on the y-axis to show evolution of fuelType per year.

Could anybody please help me find the proper code to generate a column for fuelType percentages per year?

Thank you very much! :)

JacobMarlo
  • 87
  • 7

2 Answers2

3
df.groupby(['year'])['fuelType'].value_counts(normalize=True) * 100
    
year  fuelType
2016  Diesel       50.0
      Petrol       50.0
2017  Diesel       50.0
      Petrol       50.0
2019  Petrol      100.0
Name: fuelType, dtype: float64
EBDS
  • 1,244
  • 5
  • 16
  • Thank you for the answer, it's useful! The only problem is that it olny gives 3 different percentage values; 0.333333, 0.500000, 0.100000 which isn't the true percentage of each fuelType per year... could there be a step missing? – JacobMarlo Oct 07 '21 at 16:03
  • oh. so you want to mileage * mpg, then value count on this new var ? – EBDS Oct 07 '21 at 16:15
  • humm, I want the count of each fuelType per year / sum of counts of each fuelType per year * 100 to give a percentage (hopefully that is abit clearer) – JacobMarlo Oct 07 '21 at 16:18
1

You could leverage the value_counts() function of pandas dataframe for this purpose. It will be handy for you in many situations.

round((df.groupby(['year'])['fuelType'].value_counts()/df.groupby('year')['fuelType'].count()) * 100,2)

year  fuelType
1997  Petrol      100.00
1998  Petrol      100.00
2002  Petrol      100.00
2003  Diesel       66.67
      Petrol       33.33
2004  Petrol       80.00
      Diesel       20.00
2005  Petrol       71.43
      Diesel       28.57
2006  Petrol       66.67
      Diesel       33.33

Please let me know in case of any queries. Cheers!

  • 1
    Thanks!, feels like I'm getting there but it gives me about the same thing as the first answer, except this time with 2 decimal points (100.00, 50.00 and 33.33) ... maybe something is wrong with my compiler? – JacobMarlo Oct 07 '21 at 16:14
  • 1
    blueboy and @EBDS Problem resolved, both your answers worked! I was using my new df dataframe instead of the original dataframe to apply your functions haha. Thank you both for the help cheers! – JacobMarlo Oct 07 '21 at 16:25