I need to check what has changed over time in two datasets with different date:
Prior date:
Date ID Value Category Subcategory
30-Nov 0001 100.00 A A100
30-Nov 0002 200.00 B B120
30-Nov 0003 300.00 C C300
30-Nov 0004 450.00 D D900
30-Nov 0005 500.00 D D900
Current date:
Date ID Value Category Subcategory
31-Dec 0001 100.00 A A100
31-Dec 0002 200.00 B B101
31-Dec 0003 300.00 C C300
31-Dec 0004 400.00 E E900
31-Dec 0006 600.00 D D900
Now I need to create 4 dataframes:
- Changes in Value:
Date ID Value Category Subcategory Prior Value
31-Dec 0004 400.00 E E900 450.00
- Changes in Category:
Date ID Value Category Subcategory Prior Category
31-Dec 0004 400.00 E E900 D
- Changes in Subcategory, but only if category didn't change:
Date ID Value Category Subcategory Prior Subcategory
31-Dec 0002 200.00 B B101 B120
- Items population change:
Date ID Value Category Subcategory
31-Dec 0006 600.00 D D900
30-Nov 0005 500.00 D D900
I think I should first run the population check and exclude those breaks, so I will have only two datasets with identical ID sets. I will follow the examples from here: Comparing two dataframes and getting the differences
For comparing 1to1 values I found a numpy piece of code but it compares it by default index, not by ID, how to do it using my ID column as record identifier? This is going to be a large dataset and I can't base it on default index.
value_df = current_df
value_df['prior value'] = np.where(prior_df['Value'] == current_df['Value'], 'Match', prior_df['Value'])
value_df = value_df[value_df['prior value'] != 'Match']
For multiple conditions do I have to filter it out step by step (first filter out category change, then filter subcategory change) or can I use AND to concatenate the conditions?
below is the code for creating dataframes:
prior_data = {'Date': ['30-Nov','30-Nov','30-Nov','30-Nov', '30-Nov'],
'ID': ['0001','0002','0003','0004', '0005'],
'Value' : [100.00, 200.00, 300.00, 450.00, 500.00],
'Category' : ['A','B','C','D','D'],
'Subcategory' : ['A100','B120','C300','D900','D900']}
current_data = {'Date': ['31-Dec','31-Dec','31-Dec','31-Dec','31-Dec'],
'ID': ['0001','0002','0003','0004', '0006'],
'Value' : [100.00, 200.00, 300.00, 400.00, 600.00],
'Category' : ['A','B','C','E','D'],
'Subcategory' : ['A100','B101','C300','E900','D900']}
prior_df = pd.DataFrame(prior_data)
current_df = pd.DataFrame(current_data)