1

I'm trying to get real prices for my data in pandas. Right now, I am just playing with one year's worth of data (3962050 rows) and it took me 443 seconds to inflate the values using the code below. Is there a quicker way to find the real value? Is it possible to use pooling? I have many more years and if would take too long to wait every time.

Portion of df:
    year    quarter fare
0   1994    1      213.98
1   1994    1      214.00   
2   1994    1      214.00
3   1994    1      214.50 
4   1994    1      214.50   
import cpi
import pandas as pd

def inflate_column(data, column):
    """
    Adjust for inflation the series of values in column of the   
    dataframe data. Using cpi library.
    """
    print('Beginning to inflate ' + column)
    start_time = time.time()
    
    df = data.apply(lambda x: cpi.inflate(x[column], 
                      x.year), axis=1)
    
    print("Inflating process took", time.time() - start_time, " seconds to run")  
    return df

df['real_fare'] = inflate_column(df, 'fare')
Roim
  • 2,986
  • 2
  • 10
  • 25
Danlo9
  • 133
  • 3
  • 12

1 Answers1

2

You have multiple values for each year: you can just call one for every year, store it in dict and then use the value instead of calling to cpi.inflate everytime.

all_years = df["year"].unique()
dict_years = {}
for year in all_years:
    dict_years[year] = cpi.inflate(1.0, year)

df['real_fare'] = # apply here: dict_years[row['year']]*row['fare'] 

You can fill the last line using apply, or try do it in some other way like df['real_fare']=df['fare']*...

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Roim
  • 2,986
  • 2
  • 10
  • 25
  • `df['real_fare'] = df.apply(dict_years[df.year]*df.fare)` What am I doing wrong here? – Danlo9 Jul 15 '20 at 17:55
  • apply works on each row/column, you can't use `df.year` inside of it. If you choose to use apply, do it with lambda function (like in your original code, but use dict_years instead of cpi.inflate) – Roim Jul 15 '20 at 19:03
  • It seems that df.aply() is taking VERY long time to calculate the inflation fares (It took 3.5 minutes to calculate one year's worth of data and I have almost 30 years). Is there a faster way? How would your alternative solution (not using apply) look like? I have `df['real_fare'] = df['fare'] * dict_years[df['year']]`, but I know this doesn't work. How do I get the year from that particular row to search for the multiplier from dict_years? – Danlo9 Jul 16 '20 at 15:33
  • You can use `.loc` inside a loop. I tried it in my pc and it seems it's about the same time. Another option is to use more cores: https://stackoverflow.com/questions/45545110/make-pandas-dataframe-apply-use-all-cores – Roim Jul 16 '20 at 16:32