I currently have a working algorithm that allows me to update the lines of my database (BaseA) according of their ID
and the value of their DateB
and the values in the new export (BaseB). The problem is that my algorithm is highly unefficiant. This is just an example, the real code must be adaptable for any number of columns and no matter what their name is (the only columns found everywhere are ID
, DateB
and NbTreated
) (it is possible to list them just too).
How can I make the calculation way faster ? (currently take almost an hour on the real data)
BaseA :
ID DateA DateB DateC Nb Treated
A 11/07/2017 11/07/2017 11/07/2017 1
B 12/07/2017 10/05/2017 12/07/2017 1
B 12/07/2017 12/07/2017 12/07/2017 2
C 13/07/2017 13/07/2017 13/07/2017 1
D 14/07/2017 14/07/2017 14/07/2017 1
E 15/07/2017 15/07/2017 0
F 16/07/2017 16/07/2017 16/07/2017 1
G 17/07/2017 17/07/2017 17/07/2017 2
J 18/07/2017 18/07/2017 0
G 17/07/2017 15/09/2016 17/07/2017 1
BaseB :
ID DateA DateB DateC
A 11/07/2017 11/07/2017 11/07/2017
B 13/06/2017 13/06/2017 13/06/2017
C 14/06/2017 14/06/2017 14/06/2017
E 15/07/2017 15/07/2017 15/07/2017
F 16/07/2017 16/07/2017 16/07/2017
H 11/06/2017 11/06/2017
I 12/06/2017 12/06/2017 12/06/2017
What I want to get :
ID DateA DateB DateC Nb Treated
A 11/07/2017 11/07/2017 11/07/2017 1
B 13/06/2017 12/07/2017 13/06/2017 2
B 13/06/2017 13/06/2017 13/06/2017 3
B 13/06/2017 10/05/2017 13/06/2017 1
C 14/06/2017 14/06/2017 14/06/2017 2
C 14/06/2017 13/07/2017 14/06/2017 1
D 14/07/2017 14/07/2017 14/07/2017 1
E 15/07/2017 15/07/2017 15/07/2017 1
F 16/07/2017 16/07/2017 16/07/2017 1
G 17/07/2017 17/07/2017 17/07/2017 2
G 17/07/2017 15/09/2016 17/07/2017 1
H 11/06/2017 11/06/2017 0
I 12/06/2017 12/06/2017 12/06/2017 1
J 18/07/2017 18/07/2017 0
My code in general without the algorithm :
import pandas as pd
import numpy as np
database = pd.read_excel("baseA.xlsx")
dataset = pd.read_excel("baseB.xlsx")
# INSERT THE ALGORITHM HERE
datater = pd.concat([database,dataset])
datater.drop_duplicates(["ID","DateB"], inplace = True)
datater["Nb Treated"] = np.where(pd.isnull(datater["Nb Treated"]) & pd.isnull(datater["DateB"]), 0,datater["Nb Treated"])
datatri = datater.groupby(["ID"], sort=False)["Nb Treated"].max()
dicoREFNbDevis = datatri.to_dict()
datater["Nb Treated"] = np.where(pd.isnull(datater["Nb Treated"]), datater["ID"],datater["Nb Treated"])
dicoREFNbDevis = {k: v+1 for k, v in dicoREFNbDevis.items()}
datater["Nb Treated"].replace(dicoREFNbDevis, inplace=True)
datater["Nb Treated"]=datater["Nb Treated"].fillna(1)
datater=datater.sort(["ID"])
datater=datater[["ID","DateA","DateB","DateC","Nb Treated"]]
writer = pd.ExcelWriter('NewBase.xlsx', engine='xlsxwriter')
datater.to_excel(writer, sheet_name='Base', index=False)
writer.save()
Algorithm that I currently use and that is working :
database = database[~((database["ID"].isin(dataset["ID"].unique())) & (pd.isnull(database["DateB"])))]
for i in dataset :
if i != "ID" and i != "DateB" and i != "Nb Treated" :
dicoDate = dataset.set_index("ID")[i].to_dict()
database[i]=np.where(database["ID"].isin(dataset["ID"].unique()),database["ID"],database[i])
database[i].replace(dicoDate, inplace=True)
database[i]=database[i].apply(lambda x : pd.to_datetime(x))
Attempt almost working :
Inspired by Modifying a subset of rows in a pandas dataframe
database = database[~((database["ID"].isin(dataset["ID"].unique())) & (pd.isnull(database["DateB"])))]
database.ix[database["ID"].isin(dataset["ID"].unique()), ['DateA','DateC']] = dataset.ix[dataset["ID"].isin(database["ID"].unique()), ['DateA','DateC']]
That give me this output :
ID DateA DateB DateC Nb Treated
A 11/07/2017 11/07/2017 11/07/2017 1
B 14/06/2017 12/07/2017 14/06/2017 2
B 13/06/2017 13/06/2017 13/06/2017 3
B 13/06/2017 10/05/2017 13/06/2017 1
C 13/07/2017 1
C 14/06/2017 14/06/2017 14/06/2017 2
D 14/07/2017 14/07/2017 14/07/2017 1
E 15/07/2017 15/07/2017 15/07/2017 1
F 16/07/2017 1
G 17/07/2017 15/09/2016 17/07/2017 1
G 17/07/2017 17/07/2017 17/07/2017 2
H 11/06/2017 11/06/2017 0
I 12/06/2017 12/06/2017 12/06/2017 1
J 18/07/2017 18/07/2017 0
Edit :
What the algorithm is supposed to go is to update the BaseA with the value of the new export BaseB. The value in BaseB are the newest version of the cases on the database. If I have the same ID in BaseA and in BaseB there could be different cases :
- Usually you delete the row from BaseA and put the one from BaseB instead
- But if the row in BaseA has a DateB
that is different from the one of BaseB, both rows should be in BaseA. But you always have to update the other columns every single time .