3

Hello everyone

For a school project, I am stuck with the duration of an operation with Pandas Dataframe.

I have one dataframe df which shape is (250 000 000, 200). This dataframe contains values of variable describing the behaviours of sensors on a machine. They are organized by 'Cycle' (everytime the machine begins a new cycle, this variable is incremented by one). And in this cycle, 'CycleTime' describes the position of the row within the 'Cycle'.

In the 'mean' DataFrame, I compute the mean of each variables group by the 'CycleTime'

The 'anomaly_matrix' DataFrame represents the global anomaly of each cycle which is the sum of the square difference of each rows belonging to the Cycle with the mean of corresponding CycleTime.

An example of my code is below

df = pd.DataFrame({'Cycle': [0, 0, 0, 1, 1, 1, 2, 2], 'CycleTime': [0, 1, 2, 0, 1, 2, 0, 1], 'variable1': [0, 0.5, 0.25, 0.3, 0.4, 0.1, 0.2, 0.25], 'variable2':[1, 2, 1, 1, 2, 2, 1, 2], 'variable3': [100, 5000, 200, 900, 100, 2000, 300, 300]})
mean = df.drop(['Cycle'], axis = 1).groupby("CycleTime").agg('mean')
anomali_matrix = df.drop(['CycleTime'], axis = 1).groupby("Cycle").agg('mean')
anomaly_matrix = anomali_matrix - anomali_matrix

for index, row in df.iterrows():
    cycle = row["Cycle"]
    time = row["CycleTime"]
    anomaly_matrix.loc[cycle] += (row - mean.loc[time])**2


>>>anomaly_matrix
   variable1    variable2   variable3
Cycle           
0   0.047014    0.25       1.116111e+07
1   0.023681    0.25       3.917778e+06
2   0.018889    0.00       2.267778e+06

This calculation for my (250 000 000, 200) DataFrame last 6 hours, it is due to anomaly_matrix.loc[cycle] += (row - mean.loc[time])**2

I tried to improve by using an apply function but I do not succeed in adding other DataFrame in that apply function. Same thing trying to vectorize pandas.

Do you have any idea how to accelerate this process ? Thanks

1 Answers1

1

You can use:

df1 = df.set_index(['Cycle', 'CycleTime'])

mean = df1.sub(df1.groupby('CycleTime').transform('mean'))**2
df2 = mean.groupby('Cycle').sum()
print (df2)
       variable1  variable2     variable3
Cycle                                    
0       0.047014       0.25  1.116111e+07
1       0.023681       0.25  3.917778e+06
2       0.018889       0.00  2.267778e+06
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • @laurentBimont - I use your solution, only omit `anomaly_matrix = anomali_matrix - anomali_matrix`, because it dont have sense. – jezrael Feb 23 '18 at 14:55
  • Thanks for your solution, I update my post with the result of anomaly_matrix. It does not seem to be equal to your df2. I am going to study more your solution to figure out if it can work. – laurent Bimont Feb 23 '18 at 14:59
  • @laurentBimont - thank you, then solution is more simplify ;) – jezrael Feb 23 '18 at 15:03