I have an old dataframe with following columns and lot of rows and look like this
>old_df
date/time Name detect_ID category ID
12/1/2023 XXX 1 B 1400
12/1/2023 XXY 1,3,7 B 1402
12/1/2023 XXY 4 A 1403
12/1/2023 XXY 4 B 1407
.....
I have some information on new_df which has similar column and based on this I want to update the old_df. New dataframe is:
>new_df
date/time Name detect_ID category ID
13/1/2023 XXX 1 B 1400
14/1/2023 XXY 1,3,8 B 1402
14/1/2023 XXY 1 B 1405
.....
For updating I want following conditions:
- I want to iterate through the rows of old_df while checking every rows of new_df and its information. But I want to check only those rows of old_df where Category column value is 'B".
- First the program will keep in mind the first row's ID value of new_df for first iteration (and go through all rows of new_df with consecutive iteration). While iterating through rows of old_df. If the ID of this first row of new_df doesnt match matches with any ID of old_df it will take that full row from new_df and add it as a new row of old_df while creating a new column in old_df named Identify and assigning a value new.
If the ID of this new_df matches with any ID of old_df, it will go through that specific row's detect_ID column value. Now few things can happen :
A. If that specific detect_ID value of old_df matches with that first row's detect_ID value of new_df, it will take that specific row of new_df and replace the matched row of old_df, while the newly created column identify will have value updated. In this case. Also, as you can see here detect_ID has multiple values : 1,2,3 I want to check for each of them separately and some of these digits maybe integer. so basically split them with , and converting them to integer.
B. If that detect_ID value of old_df doesnt match with that first row's detect_ID value of new_df, it will take that full row from new_df and add it as a new row of old_df while going in column identify and assigning a value new.
- For the rows of old_df, that "ID" value did not match with any rows of new_df having same 'ID or same 'ID' but no match on ' detect_ID' value, will remain unchanged in old_df and have value unchanged in identify column.
I want this to iterate through all rows of old_df until every row of new_df is updated in old_df .
For the givene example, I want output dataframe like following:
>output
date/time Name detect_ID category ID identify
13/1/2023 XXX 1 B 1400 updated [Case A]
14/1/2023 XXY 1 B 1402 updated [Case A with multiple detect_ID]
14/1/2023 XXY 3 B 1402 updated
12/1/2023 XXY 7 B 1402 unchanged [Step 3, Id matches but detect_id do not ]
14/1/2023 XXY 8 B 1402 new [Case B]
12/1/2023 XXY 4 A 1403 unchanged
12/1/2023 XXY 4 B 1407 unchanged [Step3 , id not found in new_df]
I am using following code but it seems not working as the way I want. It gives a lot of duplicate and doesn't iterate through a lot of rows of old_df too.
old_df = pd.read_csv('old.csv')
new_df = pd.read_csv('new.csv')
# Create a set of tuples representing the unique (ID, Detector Id) pairs in the old dataframe
unique_pairs = set()
for _, row in old_df.iterrows():
detector_ids = [int(x) for x in str(row['Detect_ID']).split(',')]
for detector_id in detector_ids:
unique_pairs.add((row['ID'], detect_id))
# Iterate over the rows in the new dataframe and check if their (ID, Detector Id) pair is in the set of unique pairs
new_rows = []
updated_rows = []
for _, row in new_df.iterrows():
detector_ids = [int(x) for x in str(row['Detect_ID']).split(',')]
for detector_id in detector_ids:
if (row['ID'], detector_id) in unique_pairs:
old_row = old_df.loc[(old_df['ID'] == row['ID']) & (old_df['Detect_ID'].str.contains(str(detector_id)))]
if not old_row.empty:
old_row = old_row.iloc[0]
old_row['Date/Time'] = row['date/time']
old_df.loc[(old_df['ID'] == row['ID']) & (old_df['Detector_ID'].str.contains(str(detector_id))), 'date/time'] = old_row['date/time']
updated_rows.append(old_row)
else:
row['Identify'] = 'new'
new_rows.append(row)
unique_pairs.add((row['ID'], detector_id))
# Append the new rows to the old dataframe and write the updated dataframe to a new file
old_df = old_df.append(new_rows, ignore_index=True)
for row in updated_rows:
row['Identify'] = 'updated'
old_df = old_df.append(updated_rows, ignore_index=True)
old_df.to_csv('updated.csv', index=False)