I have two pieces of code that seem to do the same thing but one is almost a thousand times faster than the other one.
This is the first piece:
t1 = time.time()
df[new_col] = np.where(df[col] < j, val_1, val_2)
t2 = time.time()
ts.append(t2 - t1)
In ts
I have values like:
0.0007321834564208984, 0.0002918243408203125, 0.0002799034118652344
In contrast, this part of the code:
t1 = time.time()
df['new_col'] = np.where((df[col] >= i1) & (df[col] < i2), val, df.new_col)
t2 = time.time()
ts.append(t2 - t1)
Creates ts
populated with the values like:
0.11008906364440918, 0.09556794166564941, 0.08580684661865234
I cannot figure out what the essential difference is between the first and second assignments.
In both cases df
should be the same.
ADDED
It turned out that the essential difference was not in the place where I was looking. In the fast version of the code I had:
df = inp_df.copy()
in the beginning of the class method (where inp_df
was the input data frame of the method). In the slow version, I was operating directly on the input data frame. It became fast after copying the input data frame and operating on it.