0

I have a pandas dataframe that can vary in length but will have ~100 columns and anywhere from 40,000 to a couple hundred thousand rows.

We are analyzing automotive engine cylinder events, so each row is a cylinder event and the columns are various statistics associated with these events such as kurtosis, std deviation, etc. Each of these stats have 8 columns (one for each cylinder). The example table below shows just one of these statistics but the others follow the same pattern.

timestamps Analysis_Cylinder kurtosis_Tree[CylinderA] kurtosis_Tree[CylinderB] kurtosis_Tree[CylinderC]
0 CylinderA 1.7876638174057 2.2034273147583 1.73644828796387
0.0060031 CylinderB 1.7876638174057 2.2034273147583 1.73644828796387
0.0120792 CylinderC 1.7876638174057 2.2034273147583 1.73644828796387

This data is then fed into a decision tree function. And the stats that correspond to the "Analysis Tree" column are copied to a "results" dataframe.

My problem is that it is currently taking ~5+ minutes to get through the whole dataframe. I know .iterrows is not the fastest but I'm not sure what a faster way to do this would be.

Here's my current python code:

    result = pd.DataFrame(columns=('Timestamp', 'Analysis Cylinder', 'Kurtosis', 'IMEP', 'Finite Difference', 'Fundamental Frequency', 'Standard Deviation', 'Actual Tree Node', 'Expected Tree Node', 'Expected Class'))
    for index, rows in stats.iterrows():
        data = []
        #data_columns = ['Timestamp', 'Analysis Cylinder', 'Actual Tree Node', 'Expected Tree Node', 'Expected Class']
        analysis_cyl = rows['Analysis_Cylinder']
        expected_tree, expected_class = get_expected_decision_tree(analysis_cyl, num_nodes, calibrations, rows)
        tree_node_label = [col for col in rows.index if ('TreeNodeID_Cyl' in col and analysis_cyl in col)]

        kurtosis_col = [col for col in rows.index if ('Kurtosis' in col and 'Tree' in col and analysis_cyl in col)]
        kurtosis_value = rows[kurtosis_col[0]]

        imep_col = [col for col in rows.index if ('IMEP' in col and 'Tree' in col and analysis_cyl in col)]
        imep_value = rows[imep_col[0]]

        fd_col = [col for col in rows.index if ('FiniteDiff' in col and 'Tree' in col and analysis_cyl in col)]
        fd_value = rows[fd_col[0]]

        freq_col = [col for col in rows.index if ('FundamentalFrequency' in col and 'Tree' in col and analysis_cyl in col)]
        freq_value = rows[freq_col[0]]

        sd_col = [col for col in rows.index if ('Std' in col and 'Tree' in col and analysis_cyl in col)]
        sd_value = rows[sd_col[0]]

        data = [rows['timestamps'], analysis_cyl, kurtosis_value, imep_value, fd_value, freq_value, sd_value, rows[tree_node_label][0], expected_tree, expected_class]
        result.loc[index] = data

Thanks!

mwglow
  • 1
  • 2
  • I tried running your code but got an error `KeyError: 'Analysis_Cylinder'`. With just a few tweaks I think you'll have a simple and reproducible input, code, and output that will help people answer this question – mitoRibo Mar 02 '22 at 19:20
  • Without looking at it too closely, you will probably need to vectorize `get_expected_decision_tree`. Can you run cProfile on this to check if `get_expected_decision_tree` is slow or this loop? – xjcl Mar 02 '22 at 19:24
  • isn't `iterrows` a bad practice, especially when performance is a concern? – Lei Yang Mar 03 '22 at 01:32

1 Answers1

0
  1. Divide your df into N sub-dfs, keeping only rows and columns relevant to one Cylinder.

  2. Apply your for-loop for each sub-df. In the for-loop, you can get rid of all [col for col in .....] because now only relevant columns exist in the sub-df.

  3. Refer to this SO post for performance discussion around iterrows. itertuples can be your immediate improvement, but vectorization is certainly the best, but it depends on get_expected_decision_tree which is a black box to us so I cannot comment further.

  4. as an alternative, to use apply instead of iterrows, you may restructure your df so that you have one column for one stat (instead of 3 columns for the same kurtosis_Tree). It is also equivalent to pd.concat the resulting sub-dfs from my step 1 above.

Raymond Kwok
  • 2,461
  • 2
  • 9
  • 11