0

I am trying to do a target encoding of the categorical columns of an array X of features based on a target 0-1 array y, i.e. substitute each column level in feature x_i with the mean value of the target (i.e. number of 1's) for that level.

The following code is likely to be inefficient, because of the two 2 loops to mimic the group-by. Is there any room for improvement for such implementation (avoiding the slow pandas group-by)? Thank you

import numpy as np

np.random.seed(9)
rows, cols= 100_00,500
x = np.random.choice(['a','b','c','d','e',"f","g"],size=(rows,cols))
y = np.random.choice([0,1], size =(rows,1))

#learn encoding
for colum in range(X.shape[1]):
    c = X[:,colum]
    if c.dtype.kind=="U":
        unique = np.unique(c)
        tmap_num={}
        for uni in unique:
            tmap_num[uni]=y[c==uni].mean()
        maps_num[str(colum)] = tmap_num

#apply encoding
X = X.astype('<U32')
for col, tmap in maps.items():
    vals = np.full(X.shape[0], np.nan)
    for val, mean_target in tmap.items():
        vals[X[:,int(col)]==val] = mean_target
    X[:,int(col)] = vals
  • Does this answer your question? [Vectorized groupby with NumPy](https://stackoverflow.com/questions/49141969/vectorized-groupby-with-numpy) – Daniel F Jun 10 '22 at 11:35
  • Thank you for your reply. They actually don't because such groupby implementations apply some function on the grouped observations, which is not my case since I have categorical observation and what I want is just learning and apply an encoding. – Prettymath77 Jun 10 '22 at 13:48

0 Answers0