4

I am playing around with a dataset of weather data (To reproduce; data can be found here unzip it and run the code below), and I wanted to normalize the data. To do this, I tried the second answer of this question;

Normalize columns of pandas data frame

Which boils down to normalized_df=(df-df.mean(axis=0))/df.std(axis=0)

However, it takes a very long time to execute this code. Therefore, I started investigating, and it seems that the time that the df.mean() call takes is increasing exponentially.

I've used the following code to test run-times:

import pandas as pd
import time

jena_climate_df = pd.read_csv("jena_climate_2009_2016.csv")
start = time.time()
print(jena_climate_df[:200000].mean(axis=0)) #Modify the number of rows here to observe the increase in time
stop = time.time()
print(f"{stop-start} Seconds for mean calc")

I ran some tests, selecting increasing the number of rows I use for the mean calculation gradually. See the results below:

0.004987955093383789 Seconds for mean calc ~ 10 observations
0.009006738662719727 Seconds for mean calc ~ 1000 observations
0.0837397575378418 Seconds for mean calc ~ 10000 observations
1.789750337600708 Seconds for mean calc ~ 50000 observations
7.518809795379639 Seconds for mean calc ~ 60000 observations
19.989460706710815 Seconds for mean calc ~ 70000 observations
71.97900629043579 Seconds for mean calc ~ 100000 observations
375.04513001441956 Seconds for mean calc ~ 200000 observations

It seems to me that the time is increasing exponentially. I don't know why this is happening, AFAIK adding all values and dividing them by the number of observations shouldn't be too computationally intensive but maybe I am wrong here. Some explanation would be greatly appreciated!

Psychotechnopath
  • 2,471
  • 5
  • 26
  • 47
  • notice that ```axis=0``` means to calculate mean of every row and not column. Was that your intention? I guess not if your goal is to normalize. I know it is not an answer to your question, but take notice of that. Also, you can normalize using ```sklearn.preprocessing.StandardScaler``` – Roim May 11 '20 at 10:42
  • I know, but when I don't specify an argument (e.g. `normalized_df=(df-df.mean())/df.std()`) it also does this right? So the answer to the other question is also calculating means row-wise. Isn't that what normalization is supposed to do? – Psychotechnopath May 11 '20 at 11:07
  • normalization should be features-type, that is column in your case. We want to normalize the data to be able to compare between features, for examples like in classifier like KNN which are sensitive to distances. Let's say you have 2x2 set, one row filled with 1 and second row filled with 2. If you normalize by mean, you get that both are filled with zero. That is not something you want to do – Roim May 11 '20 at 11:55
  • 1
    sorry, I made a mistake. ```axis=0``` is the right way to use it. It returns the mean for each column – Roim May 11 '20 at 12:01

1 Answers1

5

I did some tests, and it seems that the culprit, in this case, is "Date Time" - the non-numeric column.

First, when calculating the mean for different columns on their own, there's clearly no exponential behavior (see chart below - the X axis is the number of rows, the y-axis is time). enter image description here

Second, I then tried to calculate means for the entire data frame in the following three scenarios (each with 80K rows), and timed it with %%timeit:

  • jena_climate_df[0:80000].mean(axis=0) : 10.2 seconds.
  • Setting the date/time column to an index: jena_climate_df.set_index("Date Time")[0:80000].mean(axis=0) - 40 ms (about 0.4% of the previous test).
  • And finally, dropping the date/time column: jena_climate_df.drop("Date Time", axis=1)[0:80000].mean(axis=0) - 19.8 ms (0.2% of the original time).

Hope this helps.

Nicolas
  • 6,611
  • 3
  • 29
  • 73
Roy2012
  • 11,755
  • 2
  • 22
  • 35
  • Of course it was the non-numeric DateTime column! How could I have missed something so obvious. TYVM. – Psychotechnopath May 12 '20 at 17:54
  • @Psychotechnopath my pleasure. You may want to change the subject line if the question to refer to pandas performance with non-numeric types - so others can fine it in the future. – Roy2012 May 12 '20 at 18:29