1

Say I have a dataframe created like this indexed by 3 levels

import pandas as pd
import numpy as np

arr = np.random.random((2, 4))
mdf = pd.DataFrame({'cid': [0, 1]})
pdf = pd.DataFrame({'doc_id': ['d1', 'd1', 'd2', 'd2'], 'passage_id': [0, 1, 0, 1]})
index = pd.MultiIndex.from_frame(mdf.join(pdf, how='cross'))
df = pd.DataFrame({'score': arr.flatten()}, index=index)

Also say I have an external dataframe that only involves levels 1 and 2

ddf = pd.DataFrame({'doc_id': ['d1', 'd1', 'd2', 'd2'], 'passage_id': [0, 1, 0, 1], 'wlen': [4, 3, 2, 1]}).set_index(['doc_id', 'passage_id'])

What I want is to an efficient way to get the 'wlen' column into the score dataframe df. But ddf has no concept of 'cid'

Right now I am doing something like this:

df['wlen'] = df.groupby(level=[1, 2])['score'].transform(lambda g: ddf.loc[g.name]['wlen'])

It works. But I am not very happy with this because I am transforming 'score' and it's not even involved. Seems very hacky.

I am looking for something that indexes ddf and then brings it back to df

ddf.loc[df.index.droplevel(0)] seems to index ddf correctly. But bringing it back to df is a problem df['wlen'] = ddf.loc[df.index.droplevel(0)]['wlen'] throws an error saying cannot handle a non-unique multi-index!

What does the error mean? And is there a more elegant way of doing what I want?

Vikash Balasubramanian
  • 2,921
  • 3
  • 33
  • 74
  • join would do it, `df.join(ddf, on=['doc_id', 'passage_id'], how='left')` – Ben.T Feb 11 '22 at 18:32
  • 2
    Just be very careful with `join`. The `on` arg **only** refers to the levels in the DataFrame **calling** the join. The index is used in the other regardless of naming and ordering: https://stackoverflow.com/questions/52373285/pandas-join-on-string-datatype/52373954#52373954. So for instance even though merge works regardless of ordering, if you just do `on=['passage_id', 'doc_id']` you will get an error – ALollz Feb 11 '22 at 18:36
  • thanks @ALollz, interesting to read. Maybe doing with reindex would be safer `df['wlen'] = ddf.reindex(df.index.droplevel(0))['wlen'].to_numpy()`? – Ben.T Feb 11 '22 at 19:22

0 Answers0