0

I have a pandas DataFrame in this format:

                           TRUTH  A  B  C  CLASS
2020-01-01 00:00:00+00:00      1  1  2  1      A
2020-01-02 00:00:00+00:00      2  1  2  2      B
2020-01-03 00:00:00+00:00      3  2  2  3      C
2020-01-04 00:00:00+00:00      4  4  3  3      A
2020-01-05 00:00:00+00:00      3  8  3  3      C
...

The columns A, B and C represent predictions and TRUTH is the actual value. The column CLASS tells which prediction is the preferred prediction.

I want to generate the final prediction getting each preferred prediction. Meaning I want the value from column A (1) then the value from B (2) then the value from C (3), then the value from A (4), then the value from C (3).

The result would be this:

                           TRUTH  PREDICTION A  B  C  CLASS
2020-01-01 00:00:00+00:00      1           1 1  2  1      A
2020-01-02 00:00:00+00:00      2           2 1  2  2      B
2020-01-03 00:00:00+00:00      3           3 2  2  3      C
2020-01-04 00:00:00+00:00      4           4 4  3  3      A
2020-01-05 00:00:00+00:00      3           3 8  3  3      C
...

I have a sample code, which can do this, but it's a little slow..

df["PREDICTION"] = [df.loc[i, col] for i, col in zip(df.index, df["CLASS"])]

There most definitely is a better way of doing this kind of manipulation but I have no idea..

Anton
  • 563
  • 4
  • 13
  • 1
    Standard approach is covered in [my answer here](https://stackoverflow.com/a/69352473/15497888) and in [the docs](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-lookup). `factorize` -> `idx, col = pd.factorize(df['CLASS'])`. Then create use numpy indexing to create the new column `df["PREDICTION"] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]` – Henry Ecker Feb 11 '22 at 14:10
  • 1
    Could also use [insert](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.insert.html) instead of `=` if you wanted the column to appear exactly where it is in the expected output: `df.insert(1, 'PREDICTION', df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx])` – Henry Ecker Feb 11 '22 at 14:12

0 Answers0