0

I have multiple files that contains 12 weeks worth of sales data and i receive them every month. I will loop through the folder and append them into one df. Because the monthly file contains past 12 weeks of data, there are some overlapping.

Example:

File 1 (sales_mar2022.csv) contains Jan 2022 to Mar 2022
File 2 (sales_apr2022.csv) contains Feb 2022 to Apr 2022
File 3 (sales_jun2022.csv) contains Mar 2022 to Jun 2022 and continue. 

The file contains columns like

Date       | Product ID | Brand   | Sales     | Volume 
01/01/2022 | 1232       | Brand X | 22332.234 | 23424

The latest file always contains the most updated values due to back dating of data. I want to perform something similar like SQL upsert function using pandas.

The index of each file is not the same so pandas update function dont really work well.

I tried appending all files into one combined df and used drop_duplicates on dates and product ID but unable to know which row to keep from the most updated file.

The combined df contains about 10 millions rows so appending all and dropping duplicates might not be efficient as well.

Appreciate for any suggestions. Thanks!

wjandrea
  • 28,235
  • 9
  • 60
  • 81
ottogi
  • 1
  • Regarding `drop_duplicates`, it has a `keep` parameter you could set to `'last'`. Is that what you were looking for? – wjandrea Aug 15 '23 at 17:46
  • Please make a [mre] including some (more) example data and desired output. It also wouldn't hurt to provide your code. For specifics see [How to make good reproducible pandas examples](/q/20109391/4518341). BTW, welcome to Stack Overflow! Check out the [tour], and [ask] if you want more tips. – wjandrea Aug 15 '23 at 18:31
  • Possible duplicate: [pandas DataFrame concat / update ("upsert")?](/q/33001585/4518341) – wjandrea Aug 15 '23 at 18:50

0 Answers0