Accelerate a pandas operation involving several dataframes

Question

Hello everyone

For a school project, I am stuck with the duration of an operation with Pandas Dataframe.

I have one dataframe df which shape is (250 000 000, 200). This dataframe contains values of variable describing the behaviours of sensors on a machine. They are organized by 'Cycle' (everytime the machine begins a new cycle, this variable is incremented by one). And in this cycle, 'CycleTime' describes the position of the row within the 'Cycle'.

In the 'mean' DataFrame, I compute the mean of each variables group by the 'CycleTime'

The 'anomaly_matrix' DataFrame represents the global anomaly of each cycle which is the sum of the square difference of each rows belonging to the Cycle with the mean of corresponding CycleTime.

An example of my code is below

df = pd.DataFrame({'Cycle': [0, 0, 0, 1, 1, 1, 2, 2], 'CycleTime': [0, 1, 2, 0, 1, 2, 0, 1], 'variable1': [0, 0.5, 0.25, 0.3, 0.4, 0.1, 0.2, 0.25], 'variable2':[1, 2, 1, 1, 2, 2, 1, 2], 'variable3': [100, 5000, 200, 900, 100, 2000, 300, 300]})
mean = df.drop(['Cycle'], axis = 1).groupby("CycleTime").agg('mean')
anomali_matrix = df.drop(['CycleTime'], axis = 1).groupby("Cycle").agg('mean')
anomaly_matrix = anomali_matrix - anomali_matrix

for index, row in df.iterrows():
    cycle = row["Cycle"]
    time = row["CycleTime"]
    anomaly_matrix.loc[cycle] += (row - mean.loc[time])**2


>>>anomaly_matrix
   variable1    variable2   variable3
Cycle           
0   0.047014    0.25       1.116111e+07
1   0.023681    0.25       3.917778e+06
2   0.018889    0.00       2.267778e+06

This calculation for my (250 000 000, 200) DataFrame last 6 hours, it is due to anomaly_matrix.loc[cycle] += (row - mean.loc[time])**2

I tried to improve by using an apply function but I do not succeed in adding other DataFrame in that apply function. Same thing trying to vectorize pandas.

Do you have any idea how to accelerate this process ? Thanks

Another name for "sum of the squared difference from the mean" is `variance`. Perhaps there are some relevant optimized functions? — tehhowch, Feb 23 '18 at 13:16
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.var.html — forayer, Feb 23 '18 at 13:29
`anomaly_matrix = anomali_matrix - anomali_matrix` is correct? — jezrael, Feb 23 '18 at 13:42
anomaly_matrix = anomali_matrix - anomali_matrix is certainly a bad way to set all values of my dataframe to 0 — laurent Bimont, Feb 23 '18 at 15:00
I tried with optimized function but I did not get good results for the moment. I am going to persevere in that way. — laurent Bimont, Feb 23 '18 at 15:04

jezrael · Accepted Answer · 2018-02-23T15:02:35.500

1

You can use:

df1 = df.set_index(['Cycle', 'CycleTime'])

mean = df1.sub(df1.groupby('CycleTime').transform('mean'))**2
df2 = mean.groupby('Cycle').sum()
print (df2)
       variable1  variable2     variable3
Cycle                                    
0       0.047014       0.25  1.116111e+07
1       0.023681       0.25  3.917778e+06
2       0.018889       0.00  2.267778e+06

edited Feb 23 '18 at 15:02

answered Feb 23 '18 at 13:37

jezrael

822,522
95
1,334
1,252

@laurentBimont - I use your solution, only omit `anomaly_matrix = anomali_matrix - anomali_matrix`, because it dont have sense. – jezrael Feb 23 '18 at 14:55
Thanks for your solution, I update my post with the result of anomaly_matrix. It does not seem to be equal to your df2. I am going to study more your solution to figure out if it can work. – laurent Bimont Feb 23 '18 at 14:59
@laurentBimont - thank you, then solution is more simplify ;) – jezrael Feb 23 '18 at 15:03

Accelerate a pandas operation involving several dataframes

1 Answers1