0

I am using Pandas 0.24.2 to prepare some data for machine learning. To setup the data, I used the StandardScaler() in scikit-learn to normalize the features. However, I am getting this odd warning about

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

The odd thing is that I am already using the.iloc[] method on the dataframes. Here is the code itself below. Note that x_train, x_test are dataframes.

dat = pd.read_csv('/path/data.csv')
x_train, x_test = train_test_split(dat, 
                               test_size=0.2,
                               random_state=42)
scaler = StandardScaler()
x_train.iloc[:, :-1] = scaler.fit_transform(x_train.iloc[:, :-1])
x_test.iloc[:, :-1] = scaler.transform(x_test.iloc[:, :-1])

x_mean = scaler.mean_
x_std = scaler.scale_

Can anyone figure out what the actual problem is? I avoided rescaling the last column in the dataframe because it is the label column.

petezurich
  • 9,280
  • 9
  • 43
  • 57
krishnab
  • 9,270
  • 12
  • 66
  • 123
  • I am unable reproduce the issue. When I run the code example (using pandas 0.24.2) no warnings appear. – Xukrao Jun 07 '19 at 18:44
  • `x_train` and `x_test` were likely sliced improperly. – cs95 Jun 07 '19 at 18:47
  • My guess is `x_train` and `x_test` are already a copies of a bigger dataset, say `train_df`. So really, you should/could do `train_df.loc[:, :-1] = scaler.fit_transform(x_train.loc[:, :-1])` – Quang Hoang Jun 07 '19 at 18:48
  • @QuangHoang sorry, let me add some additional code. I created `x_train,`x_test` from the `train_test_split` function. – krishnab Jun 07 '19 at 18:54
  • @QuangHoang So `x_train, x_test` came from a larger dataset that I split into 2 chunks. I tried using just `.loc`, but that generated a different error `TypeError: cannot do slice indexing on with these indexers [-1] of ` – krishnab Jun 07 '19 at 19:00
  • Try instead: `scaler.fit(x_train.iloc[:, :-1])` then `dat.iloc[:, :-1] = scaler.transform(dat.iloc[:,:-1])` should transform both `x_train` and `x_test`. – Quang Hoang Jun 07 '19 at 19:03
  • @QuangHoang Yes this worked. Thanks for the idea. I think the problem has something to do with the indexing that comes out of the `train_test_split()` function. I wonder if reindexing might also solve the problem. But your idea did fix it. – krishnab Jun 07 '19 at 19:08
  • Nope, rescaling did not work. This could just be a false positive warning and I can ignore it. Oh well. But thanks to everyone for their help. – krishnab Jun 07 '19 at 19:18
  • @QuangHoang actually I realized that we should not run `scaler` on the original full `dat` data. In principle, we want to fit the scaler to the training data only--since we don't know the test data at training time. If we fit to the full `dat` data, this scaling is different than just scaling on the `x_train` data and then applying that scaling to the `x_test`. Subtle but important point, but even if the code works does not mean you should follow it. Haha :). – krishnab Jun 07 '19 at 19:44
  • Please read my comment again. Yes, `fit` on the `x_train` and `transform` the whole dataset. – Quang Hoang Jun 07 '19 at 19:48
  • Oh yes, I see that now. Haha. So you were ahead of the curve :). Good catch. – krishnab Jun 07 '19 at 19:49

0 Answers0