3

I'm using the method .lookup() on two distinct dataframes in the sense df2.lookup(df1.index, df1.column) (i.e., it's different to Pandas - select column using other column value as column name).

Consider the following MWE:

# Parameter
lo = -5
hi = 5 
n = 4
idx = range(n)
rep = 2

# DF 1
idx_1 = np.tile(idx, rep) 
data_1 =  np.random.randint(lo, hi, n*rep)
df_1 = pd.DataFrame(data_1, index=idx_1, columns=['column']) 

# DF 2
idx_2 = idx
col_2 = range(lo, hi+1)
data_2 = np.random.rand(n, len(col_2))
df_2 = pd.DataFrame(data_2, index=idx_2, columns=col_2) 

# Result
result = df_2.lookup(df_1.index, df_1.column)

Which is, in my opinion, very convenient and easy to understand. Pandas tells me:

FutureWarning: The 'lookup' method is deprecated and will beremoved in a future version.You can use DataFrame.melt and DataFrame.locas a substitute.

Unfortunately, I would not know how the substitue works.

An intuitive but rather inefficient solution would be

result = [df_2.loc[df_1.index[i], df_1.iloc[i, 0]] for i in range(n*rep)]

Is there an easy to implement substitute for the task above that substitutes df.lookup() via built-ins?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
clueless
  • 313
  • 1
  • 10
  • A quick look at the documentation should solve this issue: https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-lookup – VicN Feb 17 '22 at 10:26
  • 1
    I actually did look at the snippet. But I was not able to apply the provided solution for my problem. – clueless Feb 17 '22 at 10:28
  • Agreed, the documentation gives a solution which only seems to work if you want to lookup all the rows and a different column for each row. This is only matching half of the lookup function's utility – Danny Mar 09 '23 at 22:07
  • Does this answer your question? [Pandas Lookup to be deprecated - elegant and efficient alternative](https://stackoverflow.com/questions/65882258/pandas-lookup-to-be-deprecated-elegant-and-efficient-alternative) – Danny Mar 09 '23 at 23:09

1 Answers1

0

The following seems to work in about the same time (slightly faster) as df.lookup:

df_2.to_numpy()[df_2.index.get_indexer(df_1.index), df_2.columns.get_indexer(df_1.column)

Or to put it in code that better matches the old df.lookup API:

df.to_numpy()[df.index.get_indexer(row_labels), df.columns.get_indexer(col_labels)]

I tested both the old lookup function and this new approach 100k times on a very small and a moderately large (100k x 4) DataFrame and in both cases this alternate approach ran marginally faster (39 seconds compared to 41.5 seconds)

Danny
  • 3,077
  • 2
  • 23
  • 26