I would like some advice on how to update/insert new data into an already existing data table using Python/Databricks:
# Inserting and updating already existing data
# Original data
import pandas as pd
source_data = {'Customer Number': ['1', '2', '3'],
'Colour': ['Red', 'Blue', 'Green'],
'Flow': ['Good', 'Bad', "Good"]
}
df1 = pd.DataFrame (source_data, columns = ['Customer Number','Colour', 'Flow'])
print(df1)
# New data
new_data = {'Customer Number': ['1', '4',],
'Colour': ['Blue', 'Blue'],
'Flow': ['Bad', 'Bad']
}
df2 = pd.DataFrame (new_data, columns = ['Customer Number','Colour', 'Flow'])
print(df2)
# What the updated table will look like
updated_data = {'Customer Number': ['1', '2', '3', '4',],
'Colour': ['Blue', 'Blue', 'Green', 'Blue',],
'Flow': ['Bad', 'Bad', "Good", 'Bad']
}
df3 = pd.DataFrame (updated_data, columns = ['Customer Number','Colour', 'Flow'])
print(df3)
What you can see here is that the original data has three customers. I then get 'new_data' which contains an update of customer 1's data and new data for 'customer 4', who was not already in the original data. Then if you look at 'updated_data' you can see what the final data should look like. Here 'Customer 1's data has been updated and customer 4s data has been inserted.
Does anyone know where I should start with this? Which module I could use?
I’m not expecting someone to solve this in terms of developing, just need a nudge in the right direction.
Edit: the data source is .txt or CSV, the output is JSON, but as I load the data to Cosmos DB it’ll automatically convert so don’t worry too much about that.
Thanks