0

suppose I have 2 dataframes of the same shape. In the first column of each dataframe I have a product id (parent of variant items), in the following columns I have some data (pre-processed product features / numbers), which is different in the two dataframes, and in the last column I have the total amount of variant items pro product (children to parent product id)

first dataframe:

dfaa = pd.DataFrame([['id1', 1, 2, 3, 3], ['id2',4, 5, 6,6 ], ['id3', 7, 8, 9,9]], columns=['prod_id','a', 'b','c','number of prod variants'])

second dataframe:

dfbb = pd.DataFrame([['id1', 1.1, 2.2, 3.3, 3], ['id2',4.4, 5.4, 6.6,6 ], ['id3', 7.7, 8.8, 9.9,9]], columns=['prod_id','a', 'b','c','number of prod variants'])

original dataframes

What I need to do is join these dataframes to form one dataframe with a multiindex like this:

dataframe(s) with multiindex

The first option would be an extra index level for each feature consisting of two columns on the lower level for the two values from 2 original dataframes. The second option I could think of is just concatenating the features along columns and then adding an extra index level which describes the numbers (non-NaN-values and unique values).

For the first option it could be neccessary to modify the names of the columns of the lower index level (e.g. instead of a and a I could work with a_vals and a_unique) - that would be no problem.

Trying real hard to get hold of working with data in python, I really appreciate your help.

anama
  • 1
  • 2
  • This is not how merging works. Please review [SO: pandas merging 101](https://stackoverflow.com/q/53645882/7758804) and [pandas User Guide: Merge, join, concatenate and compare](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html) – Trenton McKinney Feb 06 '21 at 16:34
  • thanks for the hint, I've just edited the code. but sorry for the rest: I haven't mentioned merge anywhere^^ – anama Feb 06 '21 at 16:40
  • _What I need to do is join these dataframes_ `join` and `merge` are essentially the same thing. – Trenton McKinney Feb 06 '21 at 16:42

1 Answers1

1

Looking at one of your target structures, it can be built by stack() and unstack()

dfaa = pd.DataFrame([['id1', 1, 2, 3, 3], ['id2',4, 5, 6,6 ], ['id3', 7, 8, 9,9]], columns=['prod_id','a', 'b','c','number of prod variants'])
dfbb = pd.DataFrame([['id1', 1.1, 2.2, 3.3, 3], ['id2',4.4, 5.4, 6.6,6 ], ['id3', 7.7, 8.8, 9.9,9]], columns=['prod_id','a', 'b','c','number of prod variants'])

def prepdf(df, cat):
    return (df.loc[:,[c for c in dfaa.columns if "number" not in c]]
     .set_index("prod_id")
     .stack()
     .to_frame()
     .assign(cat=cat)
    )

dfm = (pd.concat([
    prepdf(dfaa, "VALS"),
    prepdf(dfbb, "V_UN")])
 .set_index("cat", append=True)
 .unstack([2,1])
 .droplevel(0, axis=1)
 .join(dfbb.loc[:,["prod_id","number of prod variants"]]
       .set_index("prod_id")
      .rename(columns={"number of prod variants":("","number of prod variants")}))
)

output

cat     VALS           V_UN                                  
           a    b    c    a    b    c number of prod variants
prod_id                                                      
id1      1.0  2.0  3.0  1.1  2.2  3.3                       3
id2      4.0  5.0  6.0  4.4  5.4  6.6                       6
id3      7.0  8.0  9.0  7.7  8.8  9.9                       9
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30