Suppose we have a function bar(df)
that returns a numpy array of length len(df)
given a dataframe df
.
Now consider the idiom
def foo(df):
for i in range(N):
df['FOO_' + str(i)] = bar(df)
return df
A recent pandas update started to cause the following warning
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling
frame.insert
many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, usenewframe = frame.copy()
As far as I understand, a way to mitigate this is to change the above code to the following idiom
def foo2(df):
frames = [df]
for i in range(N):
frames += [pd.Series(bar(df), index=df.index)]
return pd.concat(frames, axis=1)
The above code fixes the warning, but results in much worse execution times.
In [110]: %timeit foo()
1.73 s ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [111]: %timeit foo2()
2.51 s ± 25.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A fix to suppress a performance warning that introduces extra overhead seems silly. Therefore my question is
How can I fix the warning while also obtaning better performance. In other words, is there a way to improve the function foo2 to offer better performance than foo?