How can I speed up this method?

Question

I'm doing a optimization in a data science code and I have found a slow method. I would like some tips to improve it. Right now I'm testing with a data frame of 43000 rows and is taking around 50 seconds to execute.

I have read about the methods .loc, .iloc, .at, .iat, .iterrows and .itertuples to get a better performance iterating in the data frame, and I think it would be the case here since actually the method is running in a for loop.

def slow_method(sliced_data_frame, labels_nd_array):
    sliced_data_frame['column5'] = -1  # creating a new column
    for label in np.unique(labels_nd_array):
        sliced_data_frame['column5'][labels_nd_array == label] = label,
    return sliced_data_frame

Also I'm having a hard time to understand what is happening inside that for loop with that [labels_nd_array == label], the first statement sliced_data_frame['column5'] is selecting the column just created, but the next statement made me confused.

The code in the question seems useless. I guees you may get the same result with just `sliced_data_frame['column5'] = labels_nd_array`. Correct me if I'm wrong or add sample data and expected result. — , Aug 30 '19 at 14:14

Peruz · Answer 1 · 2019-08-31T12:13:35.733

I agree with Poolka's comment, the code in the question seems to do nothing more than sliced_data_frame['column5'] = labels_nd_array. This is because, answering to your dupts about [labels_nd_array == label], you first select the created column and than access its indexes where labels_nd_array == label, and here change its value from -1 to label.

In general looping over rows, particularly in Pandas, should be avoided when possible, even DataFrame.iterrows() creates a series for each row. As you noticed this topic is commonly addressed in Stack Overflow, for example. While here your are looping through a numpy array, this doesn't seem necessary, also considering what condition you are checking at each iteration.

In general, if there are other specific reasons for iterating over rows, I suggest using DataFrame.to_numpy() (or similar options) and work in NumPy. In NumPy iterating over rows is generally faster by default, but always try to vectorize first. Finally, once in NumPy, you can use Numba if looping through rows is really necessary and performance is a priority.

How can I speed up this method?

1 Answers1