I have a csv file with transactions. I am trying to dump this into MongoDB for later analysis. However the file that is generated is always 7 days old and gets generated each day. Meaning that for each new day I will have 1 new day and 6 days of data that I have already captured.
When I insert the dataframe into MongoDB then naturally I keep generating duplicates over and over.
I have tried to set an ID for the dataframe hoping that duplicates would be rejected but this doesn't seem to work. I have tried doing update_many() but it seems to ask for a filter that I am not sure I get to work.
df = pd.read_csv(current_file)
number_of_rows = df.shape[0]
df['Client'] = 'Myclient'
df.set_index(["Myclient", "Transaction Date ID","Transaction No. Transaction No.", "Transaction Time", "Transaction product", "Site Site_Num"],inplace=False)
import pymongo
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["TransactionDB"]
transactiontable= db["transactions"]
df.reset_index(level=0, inplace=True)
data_dict = df.to_dict("records")
transactiontable.insert_many(data_dict)
I would like that any row that has the transaction number, the date, the time and product that is the same is simply skipped over or reinserted over the existing. I tried to create a unique ID based on the first couple of data points from the CSV but this seems to have fallen flat.
Thanks