I am trying to do a target encoding of the categorical columns of an array X of features based on a target 0-1 array y, i.e. substitute each column level in feature x_i with the mean value of the target (i.e. number of 1's) for that level.
The following code is likely to be inefficient, because of the two 2 loops to mimic the group-by. Is there any room for improvement for such implementation (avoiding the slow pandas group-by)? Thank you
import numpy as np
np.random.seed(9)
rows, cols= 100_00,500
x = np.random.choice(['a','b','c','d','e',"f","g"],size=(rows,cols))
y = np.random.choice([0,1], size =(rows,1))
#learn encoding
for colum in range(X.shape[1]):
c = X[:,colum]
if c.dtype.kind=="U":
unique = np.unique(c)
tmap_num={}
for uni in unique:
tmap_num[uni]=y[c==uni].mean()
maps_num[str(colum)] = tmap_num
#apply encoding
X = X.astype('<U32')
for col, tmap in maps.items():
vals = np.full(X.shape[0], np.nan)
for val, mean_target in tmap.items():
vals[X[:,int(col)]==val] = mean_target
X[:,int(col)] = vals