I have a pandas dataframe that can vary in length but will have ~100 columns and anywhere from 40,000 to a couple hundred thousand rows.
We are analyzing automotive engine cylinder events, so each row is a cylinder event and the columns are various statistics associated with these events such as kurtosis, std deviation, etc. Each of these stats have 8 columns (one for each cylinder). The example table below shows just one of these statistics but the others follow the same pattern.
timestamps | Analysis_Cylinder | kurtosis_Tree[CylinderA] | kurtosis_Tree[CylinderB] | kurtosis_Tree[CylinderC] |
---|---|---|---|---|
0 | CylinderA | 1.7876638174057 | 2.2034273147583 | 1.73644828796387 |
0.0060031 | CylinderB | 1.7876638174057 | 2.2034273147583 | 1.73644828796387 |
0.0120792 | CylinderC | 1.7876638174057 | 2.2034273147583 | 1.73644828796387 |
This data is then fed into a decision tree function. And the stats that correspond to the "Analysis Tree" column are copied to a "results" dataframe.
My problem is that it is currently taking ~5+ minutes to get through the whole dataframe. I know .iterrows is not the fastest but I'm not sure what a faster way to do this would be.
Here's my current python code:
result = pd.DataFrame(columns=('Timestamp', 'Analysis Cylinder', 'Kurtosis', 'IMEP', 'Finite Difference', 'Fundamental Frequency', 'Standard Deviation', 'Actual Tree Node', 'Expected Tree Node', 'Expected Class'))
for index, rows in stats.iterrows():
data = []
#data_columns = ['Timestamp', 'Analysis Cylinder', 'Actual Tree Node', 'Expected Tree Node', 'Expected Class']
analysis_cyl = rows['Analysis_Cylinder']
expected_tree, expected_class = get_expected_decision_tree(analysis_cyl, num_nodes, calibrations, rows)
tree_node_label = [col for col in rows.index if ('TreeNodeID_Cyl' in col and analysis_cyl in col)]
kurtosis_col = [col for col in rows.index if ('Kurtosis' in col and 'Tree' in col and analysis_cyl in col)]
kurtosis_value = rows[kurtosis_col[0]]
imep_col = [col for col in rows.index if ('IMEP' in col and 'Tree' in col and analysis_cyl in col)]
imep_value = rows[imep_col[0]]
fd_col = [col for col in rows.index if ('FiniteDiff' in col and 'Tree' in col and analysis_cyl in col)]
fd_value = rows[fd_col[0]]
freq_col = [col for col in rows.index if ('FundamentalFrequency' in col and 'Tree' in col and analysis_cyl in col)]
freq_value = rows[freq_col[0]]
sd_col = [col for col in rows.index if ('Std' in col and 'Tree' in col and analysis_cyl in col)]
sd_value = rows[sd_col[0]]
data = [rows['timestamps'], analysis_cyl, kurtosis_value, imep_value, fd_value, freq_value, sd_value, rows[tree_node_label][0], expected_tree, expected_class]
result.loc[index] = data
Thanks!