1

I am trying to encode and scale my datafame using sklearns pipelines. It's just returning a numpy array instead of a dataframe. Instead of making a hacky solution(which I am best at!), I was hoping there was a easier/standard way to get an encoded/scaled dataframe back.

Here's a sample of the code I'm trying to encode/scale :

from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer


num_attributes = list(train_set.select_dtypes(exclude=['object'])) #to select all num columns, we exclude any column with object types
cat_attributes = list(train_set.select_dtypes(include=['object'])) #here we select all columns with object types

cat_pipeline = Pipeline([ 
    ('imputer', SimpleImputer(fill_value='none', strategy='constant')),
    ('one_hot', OneHotEncoder())
    ])

full_pipeline = ColumnTransformer([
    ('num', StandardScaler(), num_attributes),
    ('cat', cat_pipeline, cat_attributes)
])

train_set_prepared = full_pipeline.fit_transform(train_set)

Result is numpy array:

  (0, 0)    nan
  (0, 1)    -0.002676506826924531
  (0, 2)    nan
  (0, 3)    -0.03350622836892517
  (0, 4)    nan
  (0, 5)    -0.03294496247236749
  (0, 6)    0.002534826949104915

Is there a way to transform it easily back into a datafame that is scaled/encoded?

Lostsoul
  • 25,013
  • 48
  • 144
  • 239
  • does [this](https://stackoverflow.com/a/54045636/9243482) help – Ando Aug 27 '20 at 16:02
  • @YukiShioriii I tried to wrap my command with the command in the answer and got this error - ValueError: Shape of passed values is (490546, 1), indices imply (490546, 110) - I did this command - df_scaled = pd.DataFrame(full_pipeline.fit_transform(train_set),columns = train_set.columns) – Lostsoul Aug 27 '20 at 16:07
  • try removing the `columns` parameter @Lostsoul – Ando Aug 27 '20 at 16:11
  • It didn't work, either, everything is in one column - 0 (0, 0)\tnan\n (0, 1)\t-0.002676506826924531... 1 (0, 0)\tnan\n (0, 1)\t-0.002676506826924531... 2 (0, 0)\tnan\n (0, 1)\t-0.002676506826924531... 3 (0, 0)\tnan\n (0, 1)\t-0.002676506826924531... – Lostsoul Aug 27 '20 at 16:20
  • it seems to be putting all results into one column array. – Lostsoul Aug 27 '20 at 16:21
  • You want the result to have 3 column right ? – Ando Aug 27 '20 at 16:23
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/220568/discussion-between-yukishioriii-and-lostsoul). – Ando Aug 27 '20 at 16:24

0 Answers0