Loop on dataframe takes a lot of time

Question

The dataframe subset feature is being used in a for loop across the dataframe rows. The result seems accurate however, the time taken to complete the loop on 2000 odd rows is more than 4 minutes. Any advice or guidance on the quality of the code?

Datasets:

DF1 input   customer_id 31-12-2019 00:00    31-12-2018 00:00    31-12-2017 00:00    31-12-2016 00:00    31-12-2015 00:00    31-12-2014 00:00    31-12-2013 00:00    31-12-2012 00:00    31-12-2011 00:00    31-12-2010 00:00
    70464016                                        
    70453975                                        
    79983381                                        
    76615995                                        
    73543785                                        
    78226476                                        
    70117143                                        
    76448285                                        
    73980212                                        
    74540790    

File input
upload_date customer_id date    rating  rating_agency
05-03-2019  70464016    31-Dec-18   3   INTERNAL
05-03-2019  70453975    31-Dec-18   4+  INTERNAL
05-03-2019  79983381    31-Dec-18   3   INTERNAL
05-03-2019  76615995    31-Dec-18   4   INTERNAL
05-03-2019  73543785    31-Dec-18   4   INTERNAL
05-03-2019  78226476    31-Dec-18   4   INTERNAL
05-03-2019  70117143    31-Dec-18   4-  INTERNAL
05-03-2019  76448285    31-Dec-18   4-  INTERNAL
05-03-2019  73980212    31-Dec-18   5   INTERNAL
05-03-2019  74540790    31-Dec-18   5   INTERNAL
05-03-2019  76241783    31-Dec-18   5   INTERNAL
05-03-2019  76323368    31-Dec-18   5+  INTERNAL
05-03-2019  70732832    31-Dec-18   5   INTERNAL
05-03-2019  70453263    31-Dec-18   4-  INTERNAL
05-03-2019  73807515    31-Dec-18   5   INTERNAL
05-03-2019  71584306    31-Dec-18   5+  INTERNAL
05-03-2019  71017190    31-Dec-18   5   INTERNAL
05-03-2019  79142410    31-Dec-18   5   INTERNAL
05-03-2019  70455229    31-Dec-18   5   INTERNAL

The code is as follows:

for j in df1.itertuples(index=True, name='Pandas'):
    for i in range(1,len(df1.columns)):
        #for j in range(len(df1)):
            flag = file[(file['customer_id'] == j.customer_id) & (file['year'] == df1.columns[i].year)]
            flag = flag[(flag['date']== flag['date'].max())]

            if len(flag) != 0:
                df1.iat[j.Index,i] = flag.rating.iloc[0]
            else:
                pass

[Please don't post images of code/data (or links to them)](http://meta.stackoverflow.com/questions/285551/why-may-i-not-upload-images-of-code-on-so-when-asking-a-question) — jezrael, Jun 13 '19 at 08:10
To me this question belongs more to [Code Review](https://codereview.stackexchange.com/) SE forum, not StackOverflow. — sophros, Jun 13 '19 at 08:13

score 0 · Accepted Answer · answered Jun 13 '19 at 11:20

0

I understood that you have some code that gets flags from somewhere else and you want to see what the flags are for every value in your dataframe. I would recommend writing a function that returns the flag from a value of a DataFrame and then using df.applymap to apply the function to every value of the DataFrame.

df.applymap returns a DataFrame and it should be significantly faster. Looping DataFrames in general is not very efficient and is usually avoidable.

def get_flags(val):
    flag = # Your code for the value of the flag here
    return flag

flags = df.applymap(get_flags)

If there is only one flag per row or column, use df.apply instead. More details in the docs.

answered Jun 13 '19 at 11:20

pnovotnyq

547
3
12

To explain. For e.g. in case of customer "70464016", the FILE input would be used to first identify 2018 as year and rating as 3. Once identified, this would get saved in dataset DF1 in row with customer ID 70464016 and column of 31-12-2018 00:00 – Zedi10 Jun 13 '19 at 11:35
I think I understand now. And the values of DF1 will be the ratings? in that case, you want to [pivot](https://stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe). `df1 = file_input.pivot(index='customer_id', columns='upload_date', values='rating')` Is this what you wanted? It assumes that for each `customer_id` and `upload_date`, there is only one rating. – pnovotnyq Jun 13 '19 at 11:46

Loop on dataframe takes a lot of time

1 Answers1