Calculate Mean Absolute Error for each row of a Pandas dataframe

Question

Below is a sample of pandas dataframe that I'm working with.
I want to calculate mean absolute error for each row but only considering relevant columns for values of ID column.

There may be 2 4,6 or 8 columns relevant to values of ID column. For example, relevant columns for 'id4' is 'id4_signal1_true' and 'id4_signal1_pred'. for 'id1' its 'id1_signal1_true', 'id1_signal2_true', 'id1_signal1_pred' and 'id1_signal2_pred'.

import pandas as pd
list = [['id4',0.37,0.97,0.21,0.54,0.11,0.38,0.95,0.2,0.5,0.23],
        ['id1',0.41,0.44,0.21,0.54,0.11,0.41,0.48,0.2,0.5,0.23],
        ['id3',0.41,0.44,0.21,0.54,0.11,0.41,0.48,0.2,0.5,0.23]]

df = pd.DataFrame(list, columns =['ID','id1_signal1_true','id1_signal2_true','id4_signal1_true','id3_signal1_true',
                                  'id3_signal2_true','id1_signal1_pred','id1_signal2_pred','id4_signal1_pred',
                                  'id3_signal1_pred','id3_signal2_pred'])

I want to calculate mae for each row only considering the specific relevant columns. for example, for the first row, it should be like mean(abs(id4_signal1_true-id4_signal1_pred)) for the second row, it should be mean(abs('id1_signal1_true'-'id1_signal1_pred'),abs('id1_signal2_true'-'id1_signal2_pred'))

below is a screenshot of how the output looks like. MAE is the column which i want to get

I used the below code to solve this. this is working fine. But the only issue is i have around 2 million rows and this takes hours. i want to find a efficient way to do this. Highly appreciated your help on this

Attack = df
ID_MAE = []
for id in range(len(Attack['ID'])):

    signals = np.array(Attack[Attack.columns[Attack.columns.str.contains(Attack.ID[id])]]) # select columns relevant to current ID
    signal = signals[id]  # select only the specific row (id)

    no_of_signals = int(len(signal)/2) # identify number of signals
    reshaped_arr = np.reshape(signal, (2,no_of_signals))
    signal_true = reshaped_arr[0]  # array for true values
    signal_pred = reshaped_arr[1]  # array for predicted values

    # mae calculation
    MAE = np.mean(np.abs(signal_true - signal_pred), axis=0)
    ID_MAE.append(MAE)

df['MAE'] = ID_MAE

Yes, ID column has 10 unique values (id0, id1, ... id9). for each ID, there are 2,4,6 or 8 columns with true or pred suffix (eg: id1_signal1_true and id1_signal1_pred) — Sampath Rajapaksha, Apr 28 '22 at 14:17
@keramat No. Don't know about it. Can i use it with this problem? — Sampath Rajapaksha, Apr 28 '22 at 14:18

score 2 · Answer 1 · answered Apr 28 '22 at 14:30

Try:

Split the columns into a three-level header
groupby ID and Signal and get the MAE
Select the correct MAE for each row
Collapse the multi-level header back to a single level.

df = df.set_index("ID").rename_axis(None)
df.columns = df.columns.str.split("_",expand=True)
df = df.rename_axis(["ID","Signal","Type"],axis=1).sort_values(["ID","Signal"],axis=1)
MAE = df.groupby(["ID","Signal"], axis=1).diff().abs().groupby("ID", axis=1).mean()
df.columns = df.columns.map("_".join)

df["MAE"] = df.index.to_series().apply(lambda x: MAE.at[x,x])

>>> df
     id1_signal1_true  id1_signal1_pred  ...  id4_signal1_pred   MAE
id4              0.37              0.38  ...               0.2  0.01
id1              0.41              0.41  ...               0.2  0.02
id3              0.41              0.41  ...               0.2  0.08

[3 rows x 11 columns]

Thank you very much. This is working fine with a sample of a dataset. but when i try to apply this to 2 million rows it crashes the computer. Is this create a very large dataframe? — Sampath Rajapaksha, Apr 28 '22 at 22:50

MoRe · Accepted Answer · 2022-04-29T20:34:05.153

import numpy as np
df2 = df.set_index('ID').apply(lambda x: (np.array(list(map(lambda t: t[1], sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.name)]].to_dict().items())))).reshape(-1,2)),  axis=1)
df2.apply(lambda arr: np.array([abs(a[0] - a[1]) for a in arr]).mean())

output:

ID
id4    0.01
id1    0.02
id3    0.08
dtype: float64

UPDATE:

or you can:

df2 = df.apply(lambda x: (np.array(list(map(lambda t: t[1], sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)]].to_dict().items())))).reshape(-1,2)),  axis=1)
df["MAE"] = df2.apply(lambda arr: np.array([abs(a[0] - a[1]) for a in arr]).mean())

well, well, well... now explanation.. let's go :)) notice: I explain update part

what is problem? you are some columns and rows... and for each row, you want to have corresponding columns... mean only columns that startswith column's name...Hmm, new idea...

df.apply(lambda x: x.index.str.startswith(x.ID), axis=1)

output:

0    [False, False, False, True, False, False, Fals...
1    [False, True, True, False, False, False, True,...
2    [False, False, False, False, True, True, False...

As you see, for every row, give that that each column start with (is correspond) or not (notice: ID is id4,id1,...)

ok, next you must get all columns that are correspond, using:

df.apply(lambda x: (ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)], axis=1)

output:

1   2   3   4   5   6   7   8   9   10
0   NaN NaN id4_signal1_true    NaN NaN NaN NaN id4_signal1_pred    NaN NaN
1   id1_signal1_true    id1_signal2_true    NaN NaN NaN id1_signal1_pred    id1_signal2_pred    NaN NaN NaN
2   NaN NaN NaN id3_signal1_true    id3_signal2_true    NaN NaN NaN id3_signal1_pred    id3_signal2_pred

as you know, you can pass a list of boolean as index to pandas series and get all columns that are True...

wait a minute, it can be more simple... (because x.index is series, itself)

df.apply(lambda x: x[x.index[x.index.str.startswith(x.ID)]], axis=1)

ok,we get all correspond columns, so what? nothing, as you see, some columns are NaN and we must rid of them, so convert data to list of name-value pair using to_dict().items():

df.apply(lambda x: x[x.index[x.index.str.startswith(x.ID)]].to_dict().items(), axis=1)

output:

0    ((id4_signal1_true, 0.21), (id4_signal1_pred, ...
1    ((id1_signal1_true, 0.41), (id1_signal2_true, ...
2    ((id3_signal1_true, 0.54), (id3_signal2_true, ...
dtype: object

why we need names? because we need to calculate MAE between correct pairs...

ok, now we have pairs, but in incorrect order... how can we sort it? we now that correct pairs have same name, except in last part: pred and true... so let's sort them based on names:

df.apply(lambda x: sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)]].to_dict().items()),  axis=1)

output:

0    [(id4_signal1_pred, 0.2), (id4_signal1_true, 0...
1    [(id1_signal1_pred, 0.41), (id1_signal1_true, ...
2    [(id3_signal1_pred, 0.5), (id3_signal1_true, 0...

oh, yes, they are in in correct order and we can calculate MAE for each pair and so, we can rid of names, so map on each list and get second elements:

df.apply(lambda x: list(map(lambda t: t[1], sorted(x[x.index[x.index.str.startswith(x.ID)]].to_dict().items()))),  axis=1)

output:

0                 [0.2, 0.21]
1    [0.41, 0.41, 0.48, 0.44]
2     [0.5, 0.54, 0.23, 0.11]
dtype: object

ok... now, we can calculate MAE for each pair, but how we can convert each list to list of pairs... hmmm... NumPy!!! and using .reshape(-1,2) we convert it to pairs and calculate MAE for each pair:

(np.array(list(map(lambda t: t[1], sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)]].to_dict().items())))).reshape(-1,2))

output:

0                   [[0.2, 0.21]]
1    [[0.41, 0.41], [0.48, 0.44]]
2     [[0.5, 0.54], [0.23, 0.11]]
dtype: object

wait a minute... we use NumPy... and why don't use further?

df.apply(lambda x: np.array(sorted(x[x.index[x.index.str.startswith(x.ID)]].to_dict().items()))[:,1].astype(float).reshape(-1,2),  axis=1)

convert sorted output to numpy.array and get second element using: [:,1] now, just calculate MAE for each pair:

df2.apply(lambda arr: np.array([abs(a[0] - a[1]) for a in arr]))

output:

0                         [0.009999999999999981]
1                     [0.0, 0.03999999999999998]
2    [0.040000000000000036, 0.12000000000000001]
dtype: object

we calculate absolute difference for each pair... and, again, we can make it simpler:

df.apply(lambda x: np.abs(np.diff(np.array(sorted(x[x.index[x.index.str.startswith(x.ID)]].to_dict().items()))[:,1].astype(float).reshape(-1,2))), axis=1)

and, finally, we calculate mean for each numpy.array

The third and more simple and fast manner:

df.apply(lambda x: np.abs(np.diff(np.array(sorted(x[x.index[x.index.str.startswith(x.ID)]].to_dict().items()))[:,1].astype(float).reshape(-1,2))).mean(), axis=1)

I attempted to explain it in simple word, hope be helpfull

Thank you very much. I tried the same code. but I get this error TypeError: 'list' object is not callable — Sampath Rajapaksha, Apr 28 '22 at 15:29
@SampathRajapaksha because you define your variable by name: `list` please reset your kernel and rename it to `lst` and **dont name any variable with builtin names** :) `list = [['id4',0.37,0.97,0.21,0.54,0.11,0.38,0.95,0.2,0.5,0.23], ['id1',0.41,0.44,0.21,0.54,0.11,0.41,0.48,0.2,0.5,0.23], ['id3',0.41,0.44,0.21,0.54,0.11,0.41,0.48,0.2,0.5,0.23]]` — MoRe, Apr 28 '22 at 15:31
Thank you. Now it's working. but the output is something else for df2. am i missing something? `ID id4 [[0.2, 0.21]] id1 [[0.41, 0.41], [0.48, 0.44]] id3 [[0.5, 0.54], [0.23, 0.11]]` — Sampath Rajapaksha, Apr 28 '22 at 16:01
@SampathRajapaksha No, but why `df2`? it is just a temp variable. assign last line to a variable, like: `df['MAE'] = df2.apply(lambda arr: np.array([abs(a[0] - a[1]) for a in arr]).mean())` — MoRe, Apr 28 '22 at 16:24
@SampathRajapaksha but before that add this line: `df.set_index('ID', inplace=True)` or see **UPDATE** — MoRe, Apr 28 '22 at 16:30
This is working fine now. Highly appreciate it if you could explain the used method — Sampath Rajapaksha, Apr 28 '22 at 22:48

Calculate Mean Absolute Error for each row of a Pandas dataframe

2 Answers2

UPDATE:

The third and more simple and fast manner: