2

I have a dataset of 5 features. Two of these features are very similar but do not have the same min and max values.

... | feature 2 | feature 3 | ...
--------------------------------
..., 208.429993, 206.619995, ...
..., 207.779999, 205.050003, ...
..., 206.029999, 203.410004, ...
..., 204.429993, 202.600006, ...
..., 206.429993, 204.25, ...

feature 3 is always smaller than feature 2 and it is important that it stays that way after scaling. But since feature 2 and features 3 do not have the exact same min and max values, after scaling they will both end up having 0 and 1 as min and max by default. This will remove the relationship between the values. In fact after scaling, the first sample becomes:

 ... | feature 2 | feature 3 | ...
--------------------------------
 ...,  0.00268,   0.00279, ...

This is something that I do not want. I cannot seem to find a way to manually change the min and max values of MinMaxScaler. There are other ugly hacks such as manipulating the data and combining feature2 and feature 3 into one for the scaling and splitting again afterward. But I would like to know first if there is a solution that is handled by sklearn, such as using the same min and max for multiple features.

Otherwise, the simplest workaround would do.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
bcsta
  • 1,963
  • 3
  • 22
  • 61
  • Scalers have a fit and a transform method, which you can call independently. So, you could fit on column 1 and then transform column 1 and column2. – warped Jun 04 '20 at 21:31
  • wouldn't that make some values in column 2 be lower than 0? is that a problem? – bcsta Jun 04 '20 at 21:36

2 Answers2

2

Fitting scaler with one column and transforming both. Trying with the data you posted:

    feature_1   feature_2
0   208.429993  206.619995
1   207.779999  205.050003
2   206.029999  203.410004
3   204.429993  202.600006

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df['feature_2'].values.reshape(-1,1))

scaler.transform(df)

array([[1.45024949, 1.        ],
       [1.288559  , 0.60945366],
       [0.85323442, 0.20149259],
       [0.45522189, 0.        ]])

If you scale data that are outside of the range you used to fit the scaler, the scaled data will be outside of [0,1].

The only way to avoid it is to scale each column individually.

Whether or not this is a problem depends on what you want to do with the data after scaling.

warped
  • 8,947
  • 3
  • 22
  • 49
0

You can scale those two features as one feature by flattening first then reshaping it back

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_df[['f2','f3']] = scaler.fit_transform(df[['f2','f3']].reshape(-1,1)).reshape(df[['f2','f3']].shape)