import numpy as np
df2 = df.set_index('ID').apply(lambda x: (np.array(list(map(lambda t: t[1], sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.name)]].to_dict().items())))).reshape(-1,2)), axis=1)
df2.apply(lambda arr: np.array([abs(a[0] - a[1]) for a in arr]).mean())
output:
ID
id4 0.01
id1 0.02
id3 0.08
dtype: float64
UPDATE:
or you can:
df2 = df.apply(lambda x: (np.array(list(map(lambda t: t[1], sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)]].to_dict().items())))).reshape(-1,2)), axis=1)
df["MAE"] = df2.apply(lambda arr: np.array([abs(a[0] - a[1]) for a in arr]).mean())
well, well, well... now explanation.. let's go :))
notice: I explain update part
what is problem? you are some columns and rows... and for each row, you want to have corresponding columns... mean only columns that startswith
column's name
...Hmm, new idea...
df.apply(lambda x: x.index.str.startswith(x.ID), axis=1)
output:
0 [False, False, False, True, False, False, Fals...
1 [False, True, True, False, False, False, True,...
2 [False, False, False, False, True, True, False...
As you see, for every row, give that that each column start with (is correspond) or not (notice: ID
is id4
,id1
,...)
ok, next you must get all columns that are correspond, using:
df.apply(lambda x: (ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)], axis=1)
output:
1 2 3 4 5 6 7 8 9 10
0 NaN NaN id4_signal1_true NaN NaN NaN NaN id4_signal1_pred NaN NaN
1 id1_signal1_true id1_signal2_true NaN NaN NaN id1_signal1_pred id1_signal2_pred NaN NaN NaN
2 NaN NaN NaN id3_signal1_true id3_signal2_true NaN NaN NaN id3_signal1_pred id3_signal2_pred
as you know, you can pass a list of boolean as index to pandas series and get all columns that are True
...
wait a minute, it can be more simple... (because x.index
is series, itself)
df.apply(lambda x: x[x.index[x.index.str.startswith(x.ID)]], axis=1)
ok,we get all correspond columns, so what? nothing, as you see, some columns are NaN
and we must rid of them, so convert data to list
of name-value
pair using to_dict().items()
:
df.apply(lambda x: x[x.index[x.index.str.startswith(x.ID)]].to_dict().items(), axis=1)
output:
0 ((id4_signal1_true, 0.21), (id4_signal1_pred, ...
1 ((id1_signal1_true, 0.41), (id1_signal2_true, ...
2 ((id3_signal1_true, 0.54), (id3_signal2_true, ...
dtype: object
why we need name
s? because we need to calculate MAE
between correct pairs...
ok, now we have pairs, but in incorrect order... how can we sort it? we now that correct pairs have same name, except in last part: pred
and true
... so let's sort them based on names:
df.apply(lambda x: sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)]].to_dict().items()), axis=1)
output:
0 [(id4_signal1_pred, 0.2), (id4_signal1_true, 0...
1 [(id1_signal1_pred, 0.41), (id1_signal1_true, ...
2 [(id3_signal1_pred, 0.5), (id3_signal1_true, 0...
oh, yes, they are in in correct order and we can calculate MAE
for each pair and so, we can rid of names, so map
on each list
and get second element
s:
df.apply(lambda x: list(map(lambda t: t[1], sorted(x[x.index[x.index.str.startswith(x.ID)]].to_dict().items()))), axis=1)
output:
0 [0.2, 0.21]
1 [0.41, 0.41, 0.48, 0.44]
2 [0.5, 0.54, 0.23, 0.11]
dtype: object
ok... now, we can calculate MAE
for each pair, but how we can convert each list
to list
of pair
s... hmmm... NumPy!!! and using .reshape(-1,2)
we convert it to pairs and calculate MAE
for each pair:
(np.array(list(map(lambda t: t[1], sorted(x[(ser:=pd.Series(x.index.to_list()))[ser.str.startswith(x.ID)]].to_dict().items())))).reshape(-1,2))
output:
0 [[0.2, 0.21]]
1 [[0.41, 0.41], [0.48, 0.44]]
2 [[0.5, 0.54], [0.23, 0.11]]
dtype: object
wait a minute... we use NumPy... and why don't use further?
df.apply(lambda x: np.array(sorted(x[x.index[x.index.str.startswith(x.ID)]].to_dict().items()))[:,1].astype(float).reshape(-1,2), axis=1)
convert sorted
output to numpy.array
and get second element using: [:,1]
now, just calculate MAE
for each pair
:
df2.apply(lambda arr: np.array([abs(a[0] - a[1]) for a in arr]))
output:
0 [0.009999999999999981]
1 [0.0, 0.03999999999999998]
2 [0.040000000000000036, 0.12000000000000001]
dtype: object
we calculate absolute difference for each pair... and, again, we can make it simpler:
df.apply(lambda x: np.abs(np.diff(np.array(sorted(x[x.index[x.index.str.startswith(x.ID)]].to_dict().items()))[:,1].astype(float).reshape(-1,2))), axis=1)
and, finally, we calculate mean
for each numpy.array
The third and more simple and fast manner:
df.apply(lambda x: np.abs(np.diff(np.array(sorted(x[x.index[x.index.str.startswith(x.ID)]].to_dict().items()))[:,1].astype(float).reshape(-1,2))).mean(), axis=1)
I attempted to explain it in simple word, hope be helpfull