I have 2 dataframes. The first one (900 lines) contains corrections that have been applied to a deal. The second dataframe (140 000 lines) contains the list of deals with corrected values. What I am trying to do is to put the old value back.
To link the corrected deals to the corrections I have to compare a number of attributes. In the correction dataframe (900 lines) I have the old and the new value for each corrected attribute. But each correction can be corrected on a different attribute, therefore I check every possible corrected attribute (in the correction dataframe) to compare the new value with the old one and check if this attribute was corrected. If it was I put the old value back. I'm precise that a correction can apply on several deals that share the same data in the fields used to identify.
To finish, I create a new column on the Deals dataframe (140 000 lines) where I put a boolean that true when a deals has been uncorrected, false otherwise.
My code right now is quite gross, I wanted to factorize a bit but the iteration process blocked me. It is running but it has to go through 900*140 000 lines. I launched it on a quad core VM with 12Gb RAM and it went through through it in about 1h20min.
How can I improve performance? Is multithreading possible to use in this case ? Here is shorted version of my code, just imagine that the number of if statements is 10 times bigger.
def CreationUniqueid(dataframe,Correction):
#creating new column to mark the rows we un corrected
dataframe['Modified']=0
#getting the link between the corrections and deals
b=0
for index in Correction.index:
b+=1 #just values to see progression of the program
c=0
for index1 in dataframe.index:
c+=1
a=0
print('Handling correction '+str(b)+' and deal '+str(c)) # printing progress
if (Correction.get_value(index,'BO Branch Code')==dataframe.get_value(index1,'Wings Branch') and Correction.get_value(index,'Profit Center')==dataframe.get_value(index1,'Profit Center'))
print('level 1 success')
if ((Correction.get_value(index,'BO Trade Id')==dataframe.get_value(index1,'Trade Id') and Correction.get_value(index,'BO Trade Id')!='#') or
(Correction.get_value(index,'Emetteur Trade Id')==dataframe.get_value(index1,'Emetteur Trade Id')=='#' and Correction.get_value(index,'BO Trade Id')==dataframe.get_value(index1,'Trade Id'))):
print('identification success')
# putting the dataframe to the old state, we need the data in the bad shape to make the computer learn what is a bad trade and what is normal
if Correction.get_value(index,'Risk Category')!=Correction.get_value(index,'Risk Categgory _M') and Correction.get_value(index,'Risk Category _M')!='':
dataframe.set_value(index1,'Risk Category',Correction.get_value(index,'Risk Category'))
a=1
print('Corr 1 ')
if Correction.get_value(index,'CEC Ricos')!=Correction.get_value(index,'CEC Ricos _M') and Correction.get_value(index,'CEC Ricos _M')!='':
dataframe.set_value(index1,'CEC Ricos',Correction.get_value(index,'CEC Ricos'))
a=1
print('Corr 2')
if Correction.get_value(index,'Product Line')!= Correction.get_value(index,'Product Line _M') and Correction.get_value(index,'Product Line _M')!='':
dataframe.set_value(index1,'Product Line Code Ricos',Correction.get_value(index,'Product Line'))
a=1
print ('corr 3')
return dataframe