0

I have a large data frame with MANY columns. I want to normalize a few columns which are all numeric, and then plot two using regression. I thought the code below would do it for me.

from sklearn import preprocessing
# Create x, where x the 'scores' column's values as floats
modDF = df[['WeightedAvg','Score','Co','Score', 'PeerGroup', 'TimeT', 'Ter', 'Spread']].values.astype(float)
# Create a minimum and maximum processor object
min_max_scaler = preprocessing.MinMaxScaler()
# Create an object to transform the data to fit minmax processor
x_scaled = min_max_scaler.fit_transform(modDF)
# Run the normalizer on the dataframe
df_normalized = pd.DataFrame(x_scaled)


import seaborn as sns
import matplotlib.pyplot as plt
sns.regplot(x="WeightedAvg", y="Spread", data=modDF)

However, I am getting the following error: IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices

I did a regression without normalizing, using sns.regplot and it worked, but it looked weird, so I want to see it with normalization applied. I know how the regression works. I just don't know how the regression works.

ASH
  • 20,759
  • 19
  • 87
  • 200
  • It's not clear from your example where the error is occurring. In your example you create 'modDF' but then scale 'x'. In general though if I created a numeric ndarray for x your code does seem to worl – David Waterworth Jan 16 '20 at 01:16
  • Oh, nice catch. I just changed it, so it's right now, and I re-ran the code, and I'm getting the same error. – ASH Jan 16 '20 at 01:32
  • 1
    As soon as you call `.values` on the dataframe, it becomes a numpy array. Can you try just `df.loc[:, ['your columns']].astype(float)` – Mark Moretto Jan 16 '20 at 01:39
  • Does this answer your question? [Linear Regression on Pandas DataFrame using Sklearn ( IndexError: tuple index out of range)](https://stackoverflow.com/questions/29934083/linear-regression-on-pandas-dataframe-using-sklearn-indexerror-tuple-index-ou) – AMC Feb 08 '20 at 01:09
  • https://stackoverflow.com/questions/34952651/only-integers-slices-ellipsis-numpy-newaxis-none-and-intege – AMC Feb 08 '20 at 01:09

1 Answers1

1

There is no need to use the command: df_normalized = pd.DataFrame(x_scaled).

If you want to run a linear regression. This should work:

from sklearn import preprocessing
from sklearn.linear_model import LinearRegression

df = ['WeightedAvg','Score','Co','Score', 'PeerGroup', 'TimeT', 'Ter', 'Spread']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce', axis=1)

X = df[['WeightedAvg','Score','Co','Score', 'PeerGroup', 'TimeT', 'Ter', 'Spread']]
#select your target variable
y = df[['target']]
#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Create a minimum and maximum processor object
min_max_scaler = preprocessing.MinMaxScaler()
# Create an object to transform the data to fit minmax processor
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_test_scaled = min_max_scaler.transform(X_test)
#start linear regression
reg = LinearRegression().fit(X_train_scaled, y_train)
#predict for test
y_predict = reg(X_test_scaled, y_test)

If you work with train/test-split it is important that you use the scaler fitting only on the training data, the test data is unknow to that point in time! For the testing part you are only allowed to use it for transforming.

PV8
  • 5,799
  • 7
  • 43
  • 87