0

I am planning on using panda dataframes to read a large CSV file and publish this data into postgresql db. I'll be performing this process everyday and want to load only the changes to the database. Would panda dataframes be able to identify the delta between what is present in the table vs what is present in the dataframe and proceed with the data upload?

Punter Vicky
  • 15,954
  • 56
  • 188
  • 315
  • Seems doable yes, sure. Depends on volumes though, but I could easily imagine you load all your `.csv` using an iterator, generate hashes of every row. Then compare these hashes with hashes of your DB rows. – arnaud Mar 19 '20 at 20:00
  • Yes, you can do it by removing duplicate rows. But this will take more time since the database is growing every day. Join database and CSV file and don't keep common rows. https://stackoverflow.com/questions/23667369/drop-all-duplicate-rows-in-python-pandas – Hariprasad Mar 19 '20 at 21:19
  • Thanks @Arnaud & Hariprasad. The largest file that I have is 1 GB. – Punter Vicky Mar 20 '20 at 00:38

0 Answers0