Background
The scikit-learn
API is based on stateful objects, which take 2D numpy
arrays as input, compute a transformation (internally, within the object), and later apply it to other 2D arrays. e.g.:
arr = np.arange(4).reshape(2,2)
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(arr) # scaler state has changed, nothing returns
scaler.transform(arr) # a transformed version of arr returns
My Question
I want to apply a transformation to data stored in a pandas
DataFrame, and put the transformed data back into the same DataFrame.
The problem is that df.apply(scaler.transform)
feeds data into the scaler column-by-column (1D arrays), where scaler expects a 2D array.
Following the answers here and here, I'm currently doing:
transformed_array = scaler.transform(df.values)
transformed_df = pd.DataFrame(data=transformed_array, index=df.index, columns=df.columns)
But that seems rather clunky and inefficient. Also, I'm feeling there's a corner case where I'll lose the DataFrame's metadata.
Is there a better way?