2

Derived from another question, here

I got a 2 million rows DataFrame, something similar to this

final_df = pd.DataFrame.from_dict({
    'ts':    [0,1,2,3,4,5],
    'speed': [5,4,1,4,1,4],
    'temp':  [9,8,7,8,7,8],
    'temp2':  [2,2,7,2,7,2],
    })

I need to run calculations with the values on each row and append the results as new columns, something similar to the question in this link.

I know that there a lot of combinations of speed, temp, and temp2 that are repeated if I drop_duplicates the resulting DataFrame is only 50k rows length, which takes significantly less time to process, using an apply function like this:

def dafunc(row):
    row['r1'] = row['speed'] * row['temp1'] * k1
    row['r2'] = row['speed'] * row['temp2'] * k2

nodup_df = final_df.drop_duplicates(['speed,','temp1','temp2'])
nodup_df = dodup_df.apply(dafunc,axis=1)

The above code is super simplified of what I actually do.

So far I'm trying to use a dictionary where I store the results and a string formed of the combinations is the key, if the dictionary already has those results, I get them instead of making the calculations again.

Is there a more efficient way to do this using Pandas' vectorized operations?

EDIT: In the end, the resulting DataFrame should look like this:

#assuming k1 = 0.5, k2 = 1
resulting_df = pd.DataFrame.from_dict({
        'ts':     [0,1,2,3,4,5],
        'speed':  [5,4,1,4,1,4],
        'temp':   [9,8,7,8,7,8],
        'temp2':  [2,2,7,2,7,2],
        'r1':     [22.5,16,3.5,16,3.5,16],
        'r2':     [10,8,7,8,7,8],
        })
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
ulitosCoder
  • 1,919
  • 1
  • 17
  • 22
  • 1
    I think this question would be easier to answer if you gave an example operation and what you wanted your output to look like for that input. – cs95 Sep 08 '17 at 23:16
  • The operation is very similar, just a lot more operations are executed for each row, I will edit the question to add the sample output I need. – ulitosCoder Sep 09 '17 at 03:51
  • Just do the operations on the deduped dataset and join it back to the original. – Paul Sep 09 '17 at 04:17
  • I need the values to appear in the same order as in the original DF, because the timestamp is important, also the original DF has other data in the rows not used in this example, but will be used in the future. – ulitosCoder Sep 09 '17 at 05:08
  • Would it be possible to create a column of k-values? – brennan Sep 09 '17 at 11:48
  • Yes. Those k values will be always the same. But there are a lot more operations to do yet. – ulitosCoder Sep 09 '17 at 14:59
  • To restore original order after a join, put a row number column on before the join, then sort by that row number after the join. – Paul Sep 10 '17 at 02:21
  • And pandas apply is about the slowest thing you could possibly do. – Paul Sep 10 '17 at 12:53

1 Answers1

1

Well if you can access the columns from a numpy array based on the column index it would be a lot faster i.e

final_df['r1'] = final_df.values[:,0]*final_df.values[:,1]*k1
final_df['r2'] = final_df.values[:,0]*final_df.values[:,2]*k2

If you want to create multiple columns at once you can use a for loop for that and speed will be similar like

k = [0.5,1]
for i in range(1,3):
     final_df['r'+str(i)] = final_df.values[:,0]*final_df.values[:,i]*k[i-1]

If you drop duplicates it will be much faster.

Output:

  speed  temp  temp2  ts    r1    r2
0      5     9      2   0  22.5  10.0
1      4     8      2   1  16.0   8.0
2      1     7      7   2   3.5   7.0
3      4     8      2   3  16.0   8.0
4      1     7      7   4   3.5   7.0
5      4     8      2   5  16.0   8.0

For small dataframe

%%timeit
final_df['r1'] = final_df.values[:,0]*final_df.values[:,1]*k1
final_df['r2'] = final_df.values[:,0]*final_df.values[:,2]*k2

1000 loops, best of 3: 708 µs per loop

For large dataframe

%%timeit
ndf = pd.concat([final_df]*10000)
ndf['r1'] = ndf.values[:,0]*ndf.values[:,1]*k1
ndf['r2'] = ndf.values[:,0]*ndf.values[:,2]*k2

1 loop, best of 3: 6.19 ms per loop
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108