I'm trying to build a history of my data where i add rows to my current dataset only if at least one column value has changed rather than adding every row everyday then clean the data by removing duplicates.
The code I wrote looks through every row of the new and current dataset and focuses on rows where IDs match. Then when ID matches, scans through every column (each dataset has the same columns) and looks for a difference. If there is one, the row is added to the current set. My current code adds too many rows and I can't figure out what to change to fix this. I also don't know how to add the new IDs that are not in the current dataset yet
Current dataset :
ID | Age | City | Last update |
---|---|---|---|
1 | 13 | London | 01/02/2022 |
2 | 19 | Bath | 01/01/2022 |
New dataset :
ID | Age | City |
---|---|---|
1 | 13 | London |
2 | 19 | Brighton |
3 | 15 | Leeds |
Current dataset after code update :
ID | Age | City | Last update |
---|---|---|---|
1 | 13 | London | 01/02/2022 |
2 | 19 | Bath | 01/01/2022 |
2 | 19 | Brighton | 22/02/2022 |
3 | 15 | Leeds | 22/02/2022 |
I add today's date to every row from the new dataset before running this loop, but when checking columns, i do not check it (as it is always different from last update).
Current code :
for i in range (len(current)):
count=0
for j in range (len(new)):
#ID match
if current.iloc[i,0] == new.iloc[j,0]:
#checking every column for discrepancy
for c in range (0,9):
if (current.iloc[i,c] != new.iloc[j,c]) & (current.iloc[i,c] == current.iloc[i,c]) & (new.iloc[j,c] == new.iloc[j,c]) : #A==A used to ignore nan
count+=1
if count != 0:
current = current.append(new.iloc[j,:])
From what I understand, the code works but if I have say ID 1 in two different rows already (because column 3 is different in first and second row for example), the code will work for both row and therefore add my new row twice. How can I fix this ? Also, how do I deal with IDs that are not in the current dataset yet ?
Thanks !