1

I currently have a working algorithm that allows me to update the lines of my database (BaseA) according of their ID and the value of their DateB and the values in the new export (BaseB). The problem is that my algorithm is highly unefficiant. This is just an example, the real code must be adaptable for any number of columns and no matter what their name is (the only columns found everywhere are ID, DateB and NbTreated) (it is possible to list them just too).

How can I make the calculation way faster ? (currently take almost an hour on the real data)

BaseA :

  ID      DateA         DateB          DateC    Nb Treated
  A     11/07/2017   11/07/2017     11/07/2017      1
  B     12/07/2017   10/05/2017     12/07/2017      1
  B     12/07/2017   12/07/2017     12/07/2017      2
  C     13/07/2017   13/07/2017     13/07/2017      1
  D     14/07/2017   14/07/2017     14/07/2017      1
  E     15/07/2017                  15/07/2017      0
  F     16/07/2017   16/07/2017     16/07/2017      1
  G     17/07/2017   17/07/2017     17/07/2017      2
  J     18/07/2017                  18/07/2017      0
  G     17/07/2017   15/09/2016     17/07/2017      1

BaseB :

  ID       DateA           DateB           DateC
  A     11/07/2017      11/07/2017      11/07/2017
  B     13/06/2017      13/06/2017      13/06/2017
  C     14/06/2017      14/06/2017      14/06/2017
  E     15/07/2017      15/07/2017      15/07/2017
  F     16/07/2017      16/07/2017      16/07/2017
  H     11/06/2017      11/06/2017
  I     12/06/2017      12/06/2017      12/06/2017

What I want to get :

  ID      DateA           DateB           DateC     Nb Treated
  A     11/07/2017      11/07/2017      11/07/2017      1
  B     13/06/2017      12/07/2017      13/06/2017      2
  B     13/06/2017      13/06/2017      13/06/2017      3
  B     13/06/2017      10/05/2017      13/06/2017      1
  C     14/06/2017      14/06/2017      14/06/2017      2
  C     14/06/2017      13/07/2017      14/06/2017      1
  D     14/07/2017      14/07/2017      14/07/2017      1
  E     15/07/2017      15/07/2017      15/07/2017      1
  F     16/07/2017      16/07/2017      16/07/2017      1
  G     17/07/2017      17/07/2017      17/07/2017      2
  G     17/07/2017      15/09/2016      17/07/2017      1
  H     11/06/2017                      11/06/2017      0
  I     12/06/2017      12/06/2017      12/06/2017      1
  J     18/07/2017                      18/07/2017      0

My code in general without the algorithm :

import pandas as pd
import numpy as np

database = pd.read_excel("baseA.xlsx")
dataset = pd.read_excel("baseB.xlsx")

 # INSERT THE ALGORITHM HERE

datater = pd.concat([database,dataset])
datater.drop_duplicates(["ID","DateB"], inplace = True)
datater["Nb Treated"] = np.where(pd.isnull(datater["Nb Treated"]) & pd.isnull(datater["DateB"]), 0,datater["Nb Treated"]) 
datatri = datater.groupby(["ID"], sort=False)["Nb Treated"].max()
dicoREFNbDevis = datatri.to_dict()
datater["Nb Treated"] = np.where(pd.isnull(datater["Nb Treated"]), datater["ID"],datater["Nb Treated"]) 
dicoREFNbDevis = {k: v+1 for k, v in dicoREFNbDevis.items()}
datater["Nb Treated"].replace(dicoREFNbDevis, inplace=True)
datater["Nb Treated"]=datater["Nb Treated"].fillna(1)

datater=datater.sort(["ID"])

datater=datater[["ID","DateA","DateB","DateC","Nb Treated"]]
writer = pd.ExcelWriter('NewBase.xlsx', engine='xlsxwriter') 
datater.to_excel(writer, sheet_name='Base', index=False)
writer.save()

Algorithm that I currently use and that is working :

database = database[~((database["ID"].isin(dataset["ID"].unique())) & (pd.isnull(database["DateB"])))]
for i in dataset :
    if i != "ID" and i != "DateB" and i != "Nb Treated" :
        dicoDate = dataset.set_index("ID")[i].to_dict() 
        database[i]=np.where(database["ID"].isin(dataset["ID"].unique()),database["ID"],database[i])
        database[i].replace(dicoDate, inplace=True)
        database[i]=database[i].apply(lambda x : pd.to_datetime(x))

Attempt almost working :

Inspired by Modifying a subset of rows in a pandas dataframe

database = database[~((database["ID"].isin(dataset["ID"].unique())) & (pd.isnull(database["DateB"])))]
database.ix[database["ID"].isin(dataset["ID"].unique()), ['DateA','DateC']] =  dataset.ix[dataset["ID"].isin(database["ID"].unique()), ['DateA','DateC']] 

That give me this output :

  ID       DateA          DateB            DateC     Nb Treated
  A     11/07/2017      11/07/2017      11/07/2017      1
  B     14/06/2017      12/07/2017      14/06/2017      2
  B     13/06/2017      13/06/2017      13/06/2017      3
  B     13/06/2017      10/05/2017      13/06/2017      1
  C                     13/07/2017                      1
  C     14/06/2017      14/06/2017      14/06/2017      2
  D     14/07/2017      14/07/2017      14/07/2017      1
  E     15/07/2017      15/07/2017      15/07/2017      1
  F                     16/07/2017                      1
  G     17/07/2017      15/09/2016      17/07/2017      1
  G     17/07/2017      17/07/2017      17/07/2017      2
  H     11/06/2017                      11/06/2017      0
  I     12/06/2017      12/06/2017      12/06/2017      1
  J     18/07/2017                      18/07/2017      0

Edit :

What the algorithm is supposed to go is to update the BaseA with the value of the new export BaseB. The value in BaseB are the newest version of the cases on the database. If I have the same ID in BaseA and in BaseB there could be different cases :
- Usually you delete the row from BaseA and put the one from BaseB instead - But if the row in BaseA has a DateB that is different from the one of BaseB, both rows should be in BaseA. But you always have to update the other columns every single time .

AZJB
  • 76
  • 8
  • You have not tried to explain your logic. We are left to pick apart your code and figure it out. I personally don't want to do that. I suspect that others don't either. My suggestion is to take the time to explain why you should get the results you are looking for in addition to providing the code. This should help get you the answer you're looking for. – piRSquared Aug 01 '17 at 15:57
  • Thanks, I have added the logic explanation – AZJB Aug 02 '17 at 06:57

0 Answers0