2

I am trying to apply a box-cox transformation to a single column but I am unable to do that. Can somebody help me with this issue?

from sklearn.datasets import fetch_california_housing
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.preprocessing import PowerTransformer

california_housing = fetch_california_housing(as_frame=True).frame
california_housing

power = PowerTransformer(method='box-cox', standardize=True)
california_housing['MedHouseVal']=power.fit_transform(california_housing['MedHouseVal'])
Bad Coder
  • 177
  • 11

1 Answers1

3

The function power.fit_transform requires the input data in case of a single feature to have shape (n, 1) instead of (n,) (where california_housing['MedHouseVal'] is of shape (n,), as it is a pd.Series). This can be achieved either by reshaping, i.e. by replacing

power.fit_transform(california_housing['MedHouseVal'])

with

power.fit_transform(california_housing['MedHouseVal'].to_numpy().reshape(-1, 1))

or, alternatively, and a bit more readable, by simply accessing a list of columns (which gives a pd.DataFrame) with california_housing[['MedHouseVal']] instead of a single column (which gives a pd.Series) with california_housing['MedHouseVal'], i.e. by using

power.fit_transform(california_housing[['MedHouseVal']])

Note that

print(california_housing['MedHouseVal'].shape)
print(california_housing[['MedHouseVal']].shape)

prints

(20640,)
(20640, 1)

An other option would be to use scipy.stats.boxcox:

from sklearn.datasets import fetch_california_housing
from scipy.stats import boxcox

california_housing = fetch_california_housing(as_frame=True).frame
california_housing['MedHouseVal'] = boxcox(california_housing['MedHouseVal'])[0]
Michael Hodel
  • 2,845
  • 1
  • 5
  • 10
  • Thank you for the code. But what is wrong with the above code? – Bad Coder Aug 22 '22 at 22:30
  • Data needs to be reshaped with `.reshape(-1, 1)`, as there is only a single feature (as the error message tells you). See edited answer. – Michael Hodel Aug 22 '22 at 22:33
  • Thank you so much. In this approach, why should we have a zero[0]? `california_housing['MedHouseVal'] = boxcox(california_housing['MedHouseVal'])[0]` – Bad Coder Aug 22 '22 at 22:48
  • `scipy.stats.boxcox()` returns a tuple, where the first element, i.e. `boxcox(...)[0]`, is the transformed array, the second is the lambda that maximizes the log-likelihood, etc.. See documentation for details: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html – Michael Hodel Aug 22 '22 at 22:53