Pandas - select column using other column value as column name

Question

I have a dataframe that contains a column, let's call it "names". "names" has the name of other columns. I would like to add a new column that would have for each row the value based on the column name contained on that "names" column.

Example:

Input dataframe: pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b']})

  a  |  b  | names |
 --- | --- | ----  |
  1  |  -1 | 'a'   |
  2  |  -2 | 'b'   |
  3  |  -3 | 'a'   |
  4  |  -4 | 'b'   |

Output dataframe: pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b'], "new_col":[1,-2,3,-4]})

  a  |  b  | names | new_col | 
 --- | --- | ----  | ------  |
  1  |  -1 | 'a'   |    1    |
  2  |  -2 | 'b'   |   -2    |
  3  |  -3 | 'a'   |    3    |
  4  |  -4 | 'b'   |   -4    |

score 15 · Accepted Answer · edited Nov 25 '21 at 10:25

15

You can use lookup:

df['new_col'] = df.lookup(df.index, df.names)
df
#   a    b  names   new_col
#0  1   -1      a   1
#1  2   -2      b   -2
#2  3   -3      a   3
#3  4   -4      b   -4

EDIT

lookup has been deprecated, here's the currently recommended solution:

idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

edited Nov 25 '21 at 10:25

ignoring_gravity

6,677
4
32
65

answered Aug 03 '17 at 14:38

Psidom

209,562
33
339
356

Asclepius · Answer 2 · 2021-01-14T15:49:28.937

3

Because DataFrame.lookup is deprecated as of Pandas 1.2.0, the following is what I came up with using DataFrame.melt:

df['new_col'] = df.melt(id_vars='names', value_vars=['a', 'b'], ignore_index=False).query('names == variable').loc[df.index, 'value']

Output:

>>> df
   a  b names  new_col
0  1 -1     a        1
1  2 -2     b       -2
2  3 -3     a        3
3  4 -4     b       -4

Can this be simplified? For correctness, the index must not be ignored.

Additional reference:

Looking up values by index/column labels (archive)

edited Jan 14 '21 at 15:49

answered Jan 14 '21 at 15:36

Asclepius

57,944
17
167
143

It looks like very low level and incomprehensible code. This is horrible and overkill API for simple lookup target. It saddens me. I would not like to maintain code like this. – iroln Feb 13 '21 at 19:47
2

@iroln I agree. I have to reluctantly write such low-level code with Pandas all the time due to its limited APIs. If you see many of my other answers which use Pandas, about half of them are comparably low-level. I wish there were cleaner alternatives. If it might help, what you can do is add a link to this answer in your code as a reference. – Asclepius Feb 13 '21 at 20:26

score 3 · Answer 3 · edited Mar 04 '23 at 19:20

3

Solution using pd.factorize (from a pandas issue):

idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]

edited Mar 04 '23 at 19:20

Asclepius

57,944
17
167
143

answered Feb 17 '21 at 17:13

ignoring_gravity

6,677
4
32
65

score 0 · Answer 4 · answered Dec 09 '21 at 16:00

With the straightforward and easy solution (lookup) deprecated, another alternative to the pandas-based ones proposed here is to convert df into a numpy array and use numpy indexing:

df['new_col'] = df.values[df.index.get_indexer(df['names'].index), df.columns.get_indexer(df['names'])]

Let me explain what this does. df.values is a numpy array based on the DataFrame. As numpy arrays have to be indexed numerically, we need to use the get_indexer function to convert the pandas row and column index names to index numbers that can be used with numpy:

>>> df.index.get_indexer(df['names'].index)
array([0, 1, 2, 3], dtype=int64)

>>> df.columns.get_indexer(df['names'])
array([0, 1, 0, 1], dtype=int64)

(In this case, where the row index is already numerical, you could get away with simply using df.index as the first argument inside the bracket, but this does not work generally.)

score 0 · Answer 5 · answered Jan 26 '22 at 03:04

Here's a short solution using df.melt and df.merge:

df.merge(df.melt(var_name='names', ignore_index=False), on=[None, 'names'])

Outputs:

   key_0  a  b names value
0      0  1 -1     a     1
1      1  2 -2     b    -2
2      2  3 -3     a     3
3      3  4 -4     b    -4

There's a redundant key_0 column which you need to drop with df.drop.

Pandas - select column using other column value as column name

5 Answers5

EDIT

Linked

Related