0

I have 2 dataframes that I want to be merged together.

Df1 : sales dataframe, only with sold products. if not sold, not there. ALL WEEKS 1 to 53 for 2019/2020/2021

Year   Week   Store   Article   Sales Volume
2019   11     SF      sku1      500
2021   16     NY      sku2      20
2020   53     PA      sku1      500
2021   01     NY      sku3      200
2019   11     SF      sku1      455
2021   16     NY      sku2      20

df2: is a stock dataframe. Entire product range, even if not sold, it appears. Only with stock at WEEK 16 for each 2019/2020/2021 year for each ALL products

Year   Week   Store   Article   Stock Volume
2019   16     SF      sku1      500
2021   16     NY      sku2      20
2020   16     PA      sku4      500
2021   16     NY      sku5      200
2019   16     SF      sku65      455
2021   16     NY      sku2000      20
                      ...

I have tried to merge both dfs by doing this (I wanted to get all Articles but the drawback is that I loose the other weeks):

merged = pd.merge(df1,df2, how = "right", right_on=["Article ID", "Year ID", "Week ID", "Store ID"], left_on=["Article", "Year", "Week", "Store"])

But I only get the sales value associated to week 16 stock and I lose all the other weeks.

So I tried a left join

merged = pd.merge(df1,df2, how = "left", right_on=["Article ID", "Year ID", "Week ID", "Store ID"], left_on=["Article", "Year", "Week", "Store"])

Now I have all the weeks but I am missing some products stocks

I need to keep ALL PRODUCTS of df2 while also keeping weeks of sales of df1. Is there a way to merge both dfs by keeping the entire stock depth and the entire sales weeks ?

Thanks for your help !!

2 Answers2

0

You can try this

merged = pd.merge(df1, df2, on='year')

Source: how to merge two data frames based on particular column in pandas python?

Eddy Piedad
  • 336
  • 1
  • 8
0

You need a full outer join in order to not lose any Sales from df1 or Product from df2:

merged = pd.merge(df1,df2, how = "outer", right_on=["Article ID", "Year ID", "Week ID", "Store ID"], left_on=["Article", "Year", "Week", "Store"])
IoaTzimas
  • 10,538
  • 2
  • 13
  • 30
  • Thanks for your answer. Now i get a memory error "unable to allocate 2.25GiB". Then len is bout 7.5 million rows. originally df1 is about 6.3M rows and df2 735K rows. I can't even print it to see if accurate or not. Do you have any ideas ? Thanks!! – Idris_rhn May 06 '21 at 07:30
  • Try with a subset to see if it works properly: `merged = pd.merge(df1[:100],df2[:100], how = "outer", right_on=["Article ID", "Year ID", "Week ID", "Store ID"], left_on=["Article", "Year", "Week", "Store"])` – IoaTzimas May 06 '21 at 10:48