13

I have a dataframe that contains a column, let's call it "names". "names" has the name of other columns. I would like to add a new column that would have for each row the value based on the column name contained on that "names" column.

Example:

Input dataframe: pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b']})

  a  |  b  | names |
 --- | --- | ----  |
  1  |  -1 | 'a'   |
  2  |  -2 | 'b'   |
  3  |  -3 | 'a'   |
  4  |  -4 | 'b'   |

Output dataframe: pd.DataFrame.from_dict({"a": [1, 2, 3,4], "b": [-1,-2,-3,-4], "names":['a','b','a','b'], "new_col":[1,-2,3,-4]})

  a  |  b  | names | new_col | 
 --- | --- | ----  | ------  |
  1  |  -1 | 'a'   |    1    |
  2  |  -2 | 'b'   |   -2    |
  3  |  -3 | 'a'   |    3    |
  4  |  -4 | 'b'   |   -4    |
Asclepius
  • 57,944
  • 17
  • 167
  • 143
ab3
  • 313
  • 3
  • 10

5 Answers5

15

You can use lookup:

df['new_col'] = df.lookup(df.index, df.names)
df
#   a    b  names   new_col
#0  1   -1      a   1
#1  2   -2      b   -2
#2  3   -3      a   3
#3  4   -4      b   -4

EDIT

lookup has been deprecated, here's the currently recommended solution:

idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
ignoring_gravity
  • 6,677
  • 4
  • 32
  • 65
Psidom
  • 209,562
  • 33
  • 339
  • 356
3

Because DataFrame.lookup is deprecated as of Pandas 1.2.0, the following is what I came up with using DataFrame.melt:

df['new_col'] = df.melt(id_vars='names', value_vars=['a', 'b'], ignore_index=False).query('names == variable').loc[df.index, 'value']

Output:

>>> df
   a  b names  new_col
0  1 -1     a        1
1  2 -2     b       -2
2  3 -3     a        3
3  4 -4     b       -4

Can this be simplified? For correctness, the index must not be ignored.

Additional reference:

Asclepius
  • 57,944
  • 17
  • 167
  • 143
  • It looks like very low level and incomprehensible code. This is horrible and overkill API for simple lookup target. It saddens me. I would not like to maintain code like this. – iroln Feb 13 '21 at 19:47
  • 2
    @iroln I agree. I have to reluctantly write such low-level code with Pandas all the time due to its limited APIs. If you see many of my other answers which use Pandas, about half of them are comparably low-level. I wish there were cleaner alternatives. If it might help, what you can do is add a link to this answer in your code as a reference. – Asclepius Feb 13 '21 at 20:26
3

Solution using pd.factorize (from a pandas issue):

idx, cols = pd.factorize(df['names'])
df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Asclepius
  • 57,944
  • 17
  • 167
  • 143
ignoring_gravity
  • 6,677
  • 4
  • 32
  • 65
0

With the straightforward and easy solution (lookup) deprecated, another alternative to the pandas-based ones proposed here is to convert df into a numpy array and use numpy indexing:

df['new_col'] = df.values[df.index.get_indexer(df['names'].index), df.columns.get_indexer(df['names'])]

Let me explain what this does. df.values is a numpy array based on the DataFrame. As numpy arrays have to be indexed numerically, we need to use the get_indexer function to convert the pandas row and column index names to index numbers that can be used with numpy:

>>> df.index.get_indexer(df['names'].index)
array([0, 1, 2, 3], dtype=int64)

>>> df.columns.get_indexer(df['names'])
array([0, 1, 0, 1], dtype=int64)

(In this case, where the row index is already numerical, you could get away with simply using df.index as the first argument inside the bracket, but this does not work generally.)

0

Here's a short solution using df.melt and df.merge:

df.merge(df.melt(var_name='names', ignore_index=False), on=[None, 'names'])

Outputs:

   key_0  a  b names value
0      0  1 -1     a     1
1      1  2 -2     b    -2
2      2  3 -3     a     3
3      3  4 -4     b    -4

There's a redundant key_0 column which you need to drop with df.drop.

Leif Metcalf
  • 286
  • 2
  • 6