0

I have 2 dataframes. The first one (900 lines) contains corrections that have been applied to a deal. The second dataframe (140 000 lines) contains the list of deals with corrected values. What I am trying to do is to put the old value back.

To link the corrected deals to the corrections I have to compare a number of attributes. In the correction dataframe (900 lines) I have the old and the new value for each corrected attribute. But each correction can be corrected on a different attribute, therefore I check every possible corrected attribute (in the correction dataframe) to compare the new value with the old one and check if this attribute was corrected. If it was I put the old value back. I'm precise that a correction can apply on several deals that share the same data in the fields used to identify.

To finish, I create a new column on the Deals dataframe (140 000 lines) where I put a boolean that true when a deals has been uncorrected, false otherwise.

My code right now is quite gross, I wanted to factorize a bit but the iteration process blocked me. It is running but it has to go through 900*140 000 lines. I launched it on a quad core VM with 12Gb RAM and it went through through it in about 1h20min.

How can I improve performance? Is multithreading possible to use in this case ? Here is shorted version of my code, just imagine that the number of if statements is 10 times bigger.

def CreationUniqueid(dataframe,Correction):

#creating new column to mark the rows we un corrected
dataframe['Modified']=0
#getting the link between the corrections and deals
b=0
for index in Correction.index:
    b+=1 #just values to see progression of the program
    c=0
    for index1 in dataframe.index:
        c+=1
        a=0

        print('Handling correction '+str(b)+' and deal '+str(c)) # printing progress
        if (Correction.get_value(index,'BO Branch Code')==dataframe.get_value(index1,'Wings Branch') and Correction.get_value(index,'Profit Center')==dataframe.get_value(index1,'Profit Center'))
            print('level 1 success')
            if ((Correction.get_value(index,'BO Trade Id')==dataframe.get_value(index1,'Trade Id') and Correction.get_value(index,'BO Trade Id')!='#') or
                    (Correction.get_value(index,'Emetteur Trade Id')==dataframe.get_value(index1,'Emetteur Trade Id')=='#' and Correction.get_value(index,'BO Trade Id')==dataframe.get_value(index1,'Trade Id'))):
                print('identification success')
                # putting the dataframe to the old state, we need the data in the bad shape to make the computer learn what is a bad trade and what is normal
                if Correction.get_value(index,'Risk Category')!=Correction.get_value(index,'Risk Categgory _M') and Correction.get_value(index,'Risk Category _M')!='':
                    dataframe.set_value(index1,'Risk Category',Correction.get_value(index,'Risk Category'))
                    a=1
                    print('Corr 1 ')
                if Correction.get_value(index,'CEC Ricos')!=Correction.get_value(index,'CEC Ricos _M') and Correction.get_value(index,'CEC Ricos _M')!='':
                    dataframe.set_value(index1,'CEC Ricos',Correction.get_value(index,'CEC Ricos'))
                    a=1
                    print('Corr 2')
                if Correction.get_value(index,'Product Line')!= Correction.get_value(index,'Product Line _M') and Correction.get_value(index,'Product Line _M')!='':
                    dataframe.set_value(index1,'Product Line Code Ricos',Correction.get_value(index,'Product Line'))
                    a=1
                    print ('corr 3')


    return dataframe
Mayeul sgc
  • 1,964
  • 3
  • 20
  • 35
  • 1
    Read [my take on writing a good question](http://stackoverflow.com/a/38466059/2336654). This is definitely a question where my eyes glaze over. – piRSquared Mar 01 '17 at 08:23
  • Well thats the point of the question here, optimization of the performance and architecture isn't it... – Mayeul sgc Mar 01 '17 at 08:28
  • If that's the case, it may be off topic and better suited for the code review exchange site. – piRSquared Mar 01 '17 at 08:32
  • I putted in the code review for the architecture, here i am focused on the possibility to use multithreading for algorithm like mine. is the iteration I am doing compatible with this technique? – Mayeul sgc Mar 01 '17 at 08:34
  • 1
    I would boil it down to a minimal example. [MCVE](http://stackoverflow.com/help/mcve) – piRSquared Mar 01 '17 at 08:36
  • I modified the code as asked – Mayeul sgc Mar 01 '17 at 08:49
  • Pandas is built on top of numpy array and numpy array is performing operation on the whole array at once, never row-by-row. So when you do a `for loop`, you are taking away the speed. If you want to compare a `series/column` to another one, you can just do df1['data 1'] == df2['data 2'] which will produce a series of `True/False`. For more detail you can read [this question regarding looping](http://stackoverflow.com/q/7837722/1865106) – SSC Mar 01 '17 at 10:19
  • But how do you get the data to change then? – Mayeul sgc Mar 01 '17 at 10:28
  • Read this [this question](http://stackoverflow.com/q/17729853/1865106) and [this answer](http://stackoverflow.com/a/30208749/1865106), and you should get the idea. For more info, you should read the [`pandas`](http://pandas.pydata.org/pandas-docs/stable/indexing.html#) document regarding indexing and selecting especially the `.loc` – SSC Mar 01 '17 at 10:40
  • I am definitely not using `.loc` item, see my code review post about the improvements I did on my code [here](http://codereview.stackexchange.com/questions/156584/undoing-corrections-to-a-big-dataframe) – Mayeul sgc Mar 01 '17 at 10:44

0 Answers0