Say I have a dataframe created like this indexed by 3 levels
import pandas as pd
import numpy as np
arr = np.random.random((2, 4))
mdf = pd.DataFrame({'cid': [0, 1]})
pdf = pd.DataFrame({'doc_id': ['d1', 'd1', 'd2', 'd2'], 'passage_id': [0, 1, 0, 1]})
index = pd.MultiIndex.from_frame(mdf.join(pdf, how='cross'))
df = pd.DataFrame({'score': arr.flatten()}, index=index)
Also say I have an external dataframe that only involves levels 1 and 2
ddf = pd.DataFrame({'doc_id': ['d1', 'd1', 'd2', 'd2'], 'passage_id': [0, 1, 0, 1], 'wlen': [4, 3, 2, 1]}).set_index(['doc_id', 'passage_id'])
What I want is to an efficient way to get the 'wlen' column into the score dataframe df. But ddf has no concept of 'cid'
Right now I am doing something like this:
df['wlen'] = df.groupby(level=[1, 2])['score'].transform(lambda g: ddf.loc[g.name]['wlen'])
It works. But I am not very happy with this because I am transforming 'score' and it's not even involved. Seems very hacky.
I am looking for something that indexes ddf and then brings it back to df
ddf.loc[df.index.droplevel(0)]
seems to index ddf correctly. But bringing it back to df is a problem df['wlen'] = ddf.loc[df.index.droplevel(0)]['wlen']
throws an error saying cannot handle a non-unique multi-index!
What does the error mean? And is there a more elegant way of doing what I want?