6

I'm trying to use the value of one cell to find the value of a cell in another column. The first cell value ('source') dictates which column to lookup.

import pandas as pd

df = pd.DataFrame({'A': ['John', 'Andrew', 'Bob', 'Fred'], 'B': [
                  'Fred', 'Simon', 'Andrew', 'Andrew'], 'source': ['A', 'B', 'A', 'B']}, )

print(df)

        A       B source
0    John    Fred      A
1  Andrew   Simon      B
2     Bob  Andrew      A
3    Fred  Andrew      B

My required output value in the 'output' column is a lookup of the 'source':

        A       B source  output
0    John    Fred      A    John
1  Andrew   Simon      B   Simon
2     Bob  Andrew      A     Bob
3    Fred  Andrew      B  Andrew

Failed attempts

df['output'] = df[df['source']]

This results in a ValueError: Wrong number of items passed 4, placement implies 1 because the df['source'] passes in a Series, not a string. I tried converting to a string using:

df['output'] = df[df['source'].convertDTypes(convert_string=True)]

which gave error AttributeError: 'Series' object has no attribute 'convertDTypes'.

Working solution

I found a solution might by iterating through the rows using:

for index, row in df.iterrows():
    column = df.loc[index, 'source']
    df.at[index, 'output'] = df.loc[index, column]

However, this post suggests iterating is a bad idea. The code doesn't seem very elegant, either.

I feel I've missed something basic here; this really should not be that hard.

smci
  • 32,567
  • 20
  • 113
  • 146
ITSM-79
  • 63
  • 1
  • 5
  • Does this answer your question? [Pandas - select column using other column value as column name](https://stackoverflow.com/questions/45487312/pandas-select-column-using-other-column-value-as-column-name) – Leif Metcalf Jan 26 '22 at 02:01

5 Answers5

7

Let us do numpy way since lookup will not longer work in the future version

df['new'] = df.values[df.index,df.columns.get_indexer(df.source)]
df
Out[339]: 
        A       B source     new
0    John    Fred      A    John
1  Andrew   Simon      B   Simon
2     Bob  Andrew      A     Bob
3    Fred  Andrew      B  Andrew
BENY
  • 317,841
  • 20
  • 164
  • 234
4

Use numpy.where

df['output'] = np.where(df.source == 'A', df.A, df.B)

If you have more columns, use numpy.select

conditions = [df.source == 'A', df.source == 'B']
values = [df.A, df.B]
df['output'] = np.select(conditions, values)
Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55
  • 2
    Thanks. In my actual problem, the 'source' column has at least eight different values which could all reference a different column. Using any kind of hard referencing to 'A' or 'B' will not work. The solution needs to go from reading the 'source' to looking up the correct column, and the finding the value in the same row as the 'source' to generate the 'output'. – ITSM-79 Mar 20 '21 at 19:25
2

Stack and then loc with multiindex for recent versions:

df['output'] = df.stack().loc[zip(df.index,df['source'])].droplevel(-1)

or:

df['output'] = (df.stack().loc[pd.MultiIndex.from_arrays((df.index,df['source']))]
                .droplevel(1))

For earlier versions of pandas:

df['output'] =  df.lookup(df.index,df['source'])

        A       B source  output
0    John    Fred      A    John
1  Andrew   Simon      B   Simon
2     Bob  Andrew      A     Bob
3    Fred  Andrew      B  Andrew
anky
  • 74,114
  • 11
  • 41
  • 70
  • 1
    Well, I'm going back to the drawing board to figure out this one. It works. and worked on my full size working DataFrame, but it will take me a while to figure out all the difference elements as I have never used stack or MultiIndex. Thanks for your help! – ITSM-79 Mar 20 '21 at 19:46
  • @ITSM-79 it would be easy if you print `df.stack()` and `pd.MultiIndex.from_arrays((df.index,df['source']))` seperately, then its just matching the indexes and dropping the extra index :) – anky Mar 20 '21 at 19:52
  • @ITSM-79 Added another method with `zip` – anky Mar 20 '21 at 20:54
2

Try this:

df['output'] = df.apply(lambda x: x[x.source], axis=1)

Output:

    A         B source  output
0   John    Fred    A   John
1   Andrew  Simon   B   Simon
2   Bob     Andrew  A   Bob
3   Fred    Andrew  B   Andrew
ashkangh
  • 1,594
  • 1
  • 6
  • 9
  • 2
    Love this solution for my simple problem. It's noted elsewhere that using 'apply' is slow, but this works here and is way better than my iterrows solution. – ITSM-79 Mar 20 '21 at 19:44
0

You can also do this simply by enumerate() ,list comprehension and loc[] accessor

df['output']=[df.loc[x,y] for x,y in enumerate(df['source'])]

Now If you print df you will get your desired output:

      A     B       source      output
0   John    Fred        A       John
1   Andrew  Simon       B       Simon
2   Bob     Andrew      A       Bob
3   Fred    Andrew      B       Andrew
Anurag Dabas
  • 23,866
  • 9
  • 21
  • 41