How to improve the performance of looping large dataframe over large dataset

Question

I have a time point dataset with 1,174,697 rows and a dataframe containing 876,923 rows. The dataframe consists of following columns : time, target, type.

I wanna iterate over the dataframe such that in each row checks the "time" with time points in the dataset, finds all the timepoints in dataset with equal value to "time" and then between those chooses the "target"th one. for example if there are 5 items with the same value to the "time" and the target is equal to 3, it chooses the 4th one starting from beggining because target acts like index.

I will put my code here. The problem is it takes forever to iterate over two loops. I wanna know how can I improve the performance.

    timepoint_ds = file['/timepoints']

    df = track_df.loc[:, ['time', 'target', 'type']]

    label_imgindex_df = pd.DataFrame()

    for index, row in df.iterrows():

        print("---Row--------------:",index)
        hdf_index = row["target"]

        label = row["type"]

        time=  row["time"]

        image_index_list, label_list, time_target =[], [], []

        for i, value in enumerate(timepoint_ds):

            if value == time:

                image_index_list.append(i)  
                label_list.append(label)


        label_index_df = pd.DataFrame({'index':image_index_list[hdf_index] , 'label': label} , index= [i])

    with open('/home/usr/label_imgindex_df.pkl', 'wb') as f:
        pk.dump(label_imgindex_df, f)

I have figured out the solution. I have to remove the second loop and instead using the numpy.where to find out the indices which their value is equal to the time. — ga97rasl, Sep 13 '16 at 08:09
@ga97rasl, you should have provided a sample data set and a desired data set. You should avoid looping through your data sets when working with Pandas - it's extremely slow compared to vectorized methods. So if you post a sample data set (DF) and expected DF into your question, there is a high probability that you will get a proper answer which will use vectorized approach. — MaxU - stand with Ukraine, Sep 13 '16 at 10:36
[how-to-make-good-reproducible-pandas-examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — MaxU - stand with Ukraine, Sep 13 '16 at 10:37

How to improve the performance of looping large dataframe over large dataset

0 Answers0