0


Context
I have a database made of csv files. I'm getting new files every month and I need to update my database with those files. However I can't overwrite changes about one important data because I need to keep track of the history of it.

Problem
I have 2 data-frames that look like this - the first 3 columns are the key :

Database dataframe (DD)
ID1  ID2  ID3  important_data  some_date1  some_data2  date
1    2    3    10              X            Y          2019-09-19 14:53:16.107278
4    5    6    10              M            N          2019-07-15 14:53:16.107278

Database client (DC)
ID1  ID2  ID3  important_data  some_date1  some_data2  date
1    2    3    15              A            B          2019-10-19 14:53:16.107278
4    5    6    10              O            P          2019-09-18 14:53:16.107278

The first DF is the one in my DB. The second one is the new one given by my client.
If the "important_data" has not been changed (case of the id 4 5 6) compared to the last matching tuple in date, I overwrite the old data with the new one :

New Database dataframe (DD)
ID1  ID2  ID3  important_data  some_date1  some_data2  date
1    2    3    10              X            Y          2019-09-19 14:53:16.107278
4    5    6    10              O            P          2019-09-18 14:53:16.107278

Else (ID 1 2 3 - or if the raw is completely new) I need to write a whole new raw in the database frame :

New Database dataframe (DD)
ID1  ID2  ID3  important_data  some_date1  some_data2  date
1    2    3    10              X            Y          2019-09-19 14:53:16.107278
1    2    3    15              A            B          2019-10-19 14:53:16.107278
4    5    6    10              M            N          2019-07-15 14:53:16.107278

Question
I can do it with a regular nested ifs like this (algorithm) :

Select last_in_date rows in DD (for each IDs) in new dataframe DD_pending // we only need the last entries
For rowC in DC
new = true
 For rowD in DD_pending
  If matching IDs then
   new = false
   If same "important_data" then overwrite_everything
   Else create_new_row in DD
if new then create_new_row in DD
Merge DD_pending with DD, drop duplicates

But is it possible to simplify it, in terms of resources needed by the machine, using pandas merging ? I tried to look at this but it's not quite the same. I'm looking at concat also but I can't find a way to do it.

Thanks !

Ps : I tried to be as clear as possible, with examples, but if it's not clear enough feel free to ask for clarifications !

Rfayolle
  • 3
  • 4
  • https://stackoverflow.com/questions/48647534/python-pandas-find-difference-between-two-data-frames/48647840#48647840 – BENY Oct 02 '19 at 14:04

1 Answers1

0

This should do it.

# imports
import pandas as pd
import datetime

# ignore (just for setting up the problem)
db_df = pd.DataFrame({
    "ID1": [1, 4],
    "ID2": [2, 5],
    "ID3": [3, 6],
    "important_data": [10, 10],
    "some_date1": ["X", "M"],
    "some_date2": ["Y", "N"],
    "date": [datetime.datetime.strptime("2019-09-19 14:53:16.107278", '%Y-%m-%d %H:%M:%S.%f'),
             datetime.datetime.strptime("2019-09-19 14:53:16.107278", '%Y-%m-%d %H:%M:%S.%f')]
})

# ignore (just for setting up the problem)
db_df_client = pd.DataFrame({
    "ID1": [1, 4],
    "ID2": [2, 5],
    "ID3": [3, 6],
    "important_data": [15, 10],
    "some_date1": ["A", "O"],
    "some_date2": ["B", "P"],
    "date": [datetime.datetime.strptime("2019-09-19 14:53:16.107278", '%Y-%m-%d %H:%M:%S.%f'),
             datetime.datetime.strptime("2019-09-19 14:53:16.107278", '%Y-%m-%d %H:%M:%S.%f')]
})

# the line you care about
pd.concat([db_df, db_df_client]).drop_duplicates(subset=['ID1','ID2','ID3','important_data'], keep='first')
Ian
  • 3,605
  • 4
  • 31
  • 66
  • The client changed his mind and asked to keep all data, so I used @WeNYoBen solution, but your answer is right for my main question, thanks :) ! – Rfayolle Oct 03 '19 at 08:36