1

Apologies in advance for duplicate question, but I have yet to find a solution that works from previous questions.

I am unable to join two data-frames with different MultiIndexes. I want to keep all the columns from both data-frames.

Given that df1 has ~300k rows and df2 has ~50k rows the join would be many:1 between df1:df2.

df1                  B  path_id
cust_id date                   
11      2015-02-24  10       13
28      2015-02-25  16       22
23      2015-02-26  21       19
15      2015-02-27  11       28
18      2015-02-28  29       10


df2               C
cust_id path_id    
11      13       10
28      22       26
23      19       22
15      28       27
18      10       18

The goal is to assign column C to all matching combinations of index cust_id & column path_id. See df3 below as an example.

df3                     B   C  path_id
cust_id date                       
11      2015-02-24  10  10       13
28      2015-02-25  16  26       22
23      2015-02-26  21  22       19
15      2015-02-27  11  27       28
18      2015-02-28  29  18       10

Appreciate any response on this. Thank you!

W.Powell
  • 23
  • 6
  • It would help greatly if you created some dummy dataframe with your structure and then also, showed the expected output. https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Scott Boston Jul 10 '20 at 17:21
  • 1
    Thanks for the advice. I have updated the post. – W.Powell Jul 10 '20 at 17:45
  • Since it appears that your index levels have names, you can use more on "cust_id". try `df1.merge(df2, on=['cust_id','path_id'])` – Scott Boston Jul 10 '20 at 17:58

3 Answers3

1

Well I figured it out. I'm not sure if this is the best way but I just reset the indexes of both data frames and merged on the columns. See code below.

df1.reset_index()
df2.reset_index()
df3 = df1.merge(df2, on=['cust_id', 'path_id'])

I then reassigned the indexes afterwards. If there is a better way please let me know.

Thanks!

W.Powell
  • 23
  • 6
  • Yep. This is one way to do it. I posted a solution below where you didn't have to do both index levels on both dataframes. – Scott Boston Jul 10 '20 at 18:05
0

Try:

df1.reset_index('date').merge(df2, on=['cust_id','path_id'])

Output:

               date   b  path_id   c
cust_id                             
11       2015-02-24  10       13  10
28       2015-02-25  16       22  26
23       2015-02-26  21       19  22
15       2015-02-27  11       28  27
18       2015-02-28  29       10  18
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Thanks. I believe my company is using an older version of pandas and I can't join on indexes. When I try to put an index label into the join array it returns that the column doesn't exist. – W.Powell Jul 10 '20 at 18:07
  • @W.Powell I do remember that being an issue with an older version of pandas. What version of pandas are you running out of curiosity? – Scott Boston Jul 10 '20 at 18:57
  • pd.__version__ = 0.25.1 – W.Powell Jul 10 '20 at 19:27
0

Based on what I could understand by your question, these are two options that could help you.

  1. The following code would perform a full (inner) merge based on two indexes i.e: cust_id and path_id and will store the rest data as columns of the dataframe.
df3 = pd.merge(df1, df2,on=["cust_id", "path_id"])
  1. The following code would perform the above as well as will make the column (cust_id) the index for the new dataframe.
df3 = pd.merge(df1, df2,on=["cust_id", "path_id"]).set_index('cust_id')