I am playing around with a dataset of weather data (To reproduce; data can be found here unzip it and run the code below), and I wanted to normalize the data. To do this, I tried the second answer of this question;
Normalize columns of pandas data frame
Which boils down to normalized_df=(df-df.mean(axis=0))/df.std(axis=0)
However, it takes a very long time to execute this code. Therefore, I started investigating, and it seems that the time that the df.mean()
call takes is increasing exponentially.
I've used the following code to test run-times:
import pandas as pd
import time
jena_climate_df = pd.read_csv("jena_climate_2009_2016.csv")
start = time.time()
print(jena_climate_df[:200000].mean(axis=0)) #Modify the number of rows here to observe the increase in time
stop = time.time()
print(f"{stop-start} Seconds for mean calc")
I ran some tests, selecting increasing the number of rows I use for the mean calculation gradually. See the results below:
0.004987955093383789 Seconds for mean calc ~ 10 observations
0.009006738662719727 Seconds for mean calc ~ 1000 observations
0.0837397575378418 Seconds for mean calc ~ 10000 observations
1.789750337600708 Seconds for mean calc ~ 50000 observations
7.518809795379639 Seconds for mean calc ~ 60000 observations
19.989460706710815 Seconds for mean calc ~ 70000 observations
71.97900629043579 Seconds for mean calc ~ 100000 observations
375.04513001441956 Seconds for mean calc ~ 200000 observations
It seems to me that the time is increasing exponentially. I don't know why this is happening, AFAIK adding all values and dividing them by the number of observations shouldn't be too computationally intensive but maybe I am wrong here. Some explanation would be greatly appreciated!