2

I have two pandas data frames with the same column keys, e.g df1, df2. The columns of df1 are time series whereas the entries of df2 relate to linear combinations of the entries of df1 on the day given by the index (not very important for the logic).

I want to do the following: for each nonzero entry element in df2 with position e.g [day,"key"], compute

df1["key"]*df2[day,"key"]+x 

where x is some preordained vector.

I'm wondering what the fastest way to do this is. I'm not very familiar with vectorisation operations in Python & my instinct from Fortran to use simple loops is obviously horrendous for performance here.

I want to do this as fast as possible: this is going to be run most likely 10^5/10^6 times and the tables are approximately 1000 by 1000, so naive loops are unbearably slow.

    for day in df1.index:

        for key in df2.loc[day].nonzero():

            df1[key]*df2.at[day,key]+x

Is it faster to convert from the dataframes to a numpy array and index by using the index of the keys in df.columns? I discovered a massive speedup simply by using df.at[] vs df.loc[][]. What's going on under the hood here? I assume it must be some of the cleverness that goes into taking a slice of a dataframe causing bloat and performance issues?

kenlukas
  • 3,616
  • 9
  • 25
  • 36
D.Dog
  • 21
  • 3
  • have you try merge ? – BENY Jun 27 '19 at 21:38
  • do you mean merging the two dataframes and then running an apply? I think this would be very confusing to keep track of the indices as I want to be able to consider the columns of df2 as vectors – D.Dog Jun 27 '19 at 21:41
  • Please see [how to create good, reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and provide a [mcve] including sample input and output – G. Anderson Jun 27 '19 at 21:48

0 Answers0