I have two pandas data frames with the same column keys, e.g df1, df2. The columns of df1 are time series whereas the entries of df2 relate to linear combinations of the entries of df1 on the day given by the index (not very important for the logic).
I want to do the following: for each nonzero entry element in df2 with position e.g [day,"key"], compute
df1["key"]*df2[day,"key"]+x
where x is some preordained vector.
I'm wondering what the fastest way to do this is. I'm not very familiar with vectorisation operations in Python & my instinct from Fortran to use simple loops is obviously horrendous for performance here.
I want to do this as fast as possible: this is going to be run most likely 10^5/10^6 times and the tables are approximately 1000 by 1000, so naive loops are unbearably slow.
for day in df1.index:
for key in df2.loc[day].nonzero():
df1[key]*df2.at[day,key]+x
Is it faster to convert from the dataframes to a numpy array and index by using the index of the keys in df.columns? I discovered a massive speedup simply by using df.at[]
vs df.loc[][]
. What's going on under the hood here? I assume it must be some of the cleverness that goes into taking a slice of a dataframe causing bloat and performance issues?