scaling only numeric values in data frame that contain string

Question

i am in python , i am try to make scaling to data frame

subject_id hour_measure         urinecolor   blood pressure                  
3          1.00                 red          40
           1.15                 red          high
4          2.00              yellow          low

as it that contain numeric and text columns the following code gives me error

 #MinMaxScaler for Data
scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
X = scaler.fit_transform(X)

it gives me error as the data frame contain string , how can i tell python to only scale columns contain numbers , and also scale numeric values in string columns .

Do you want to scale the numbers in the mixed columns - like "blood pressure" or only in the numeric columns? If it is just the numeric columns you could just subset these e.g. `X[['hour_meassure',...]] = scaler.fit_transform(['hour_meassure',...])` — Lutz, Dec 07 '19 at 09:02
i want to scale all numeric columns , and also numeric values in string columns (e.g blood pressure in the attached example) — mayaaa, Dec 07 '19 at 09:03

jezrael · Accepted Answer · 2019-12-07T09:32:45.297

Convert non numeric values to missing values and then use alternative solution for scaling, last replace missing values back to original:

print (df)
   subject_id  hour_measure urinecolor blood pressure
0           3          1.00        red             40
1           3          1.15        red           high
2           4          2.00     yellow            low
3           5          5.00     yellow            100

df = df.set_index('subject_id')

df1 = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
df2 = (df1 - df1.min()) / (df1.max() - df1.min())

df = df2.combine_first(df)
print (df)
            hour_measure urinecolor blood pressure
subject_id                                        
3                 0.0000        red              0
3                 0.0375        red           high
4                 0.2500     yellow            low
5                 1.0000     yellow              1

First solution:

I suggest replace text columns to numeric by dictionary like:

dbp = {'high': 150, 'low': 60}

df['blood pressure'] = df['blood pressure'].replace(dbp)

All together:

#if subject_id are numeric convert them to index
df = df.set_index('subject_id')

dbp = {'high': 150, 'low': 60}
#replace to numbers and convert to integers
df['blood pressure'] = df['blood pressure'].replace(dbp).astype(int)

print (df)
            hour_measure urinecolor  blood pressure
subject_id                                         
3                   1.00        red              40
3                   1.15        red             150
4                   2.00     yellow              60

print (df.dtypes)
hour_measure      float64
urinecolor         object
blood pressure      int32
dtype: object

from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler(copy=True, feature_range=(0, 1))
#select only numeric columns
X = scaler.fit_transform(df.select_dtypes(np.number))
print (X)
[[0.         0.        ]
 [0.15       1.        ]
 [1.         0.18181818]]

Detail:

print (df.select_dtypes(np.number))
            hour_measure  blood pressure
subject_id                              
3                   1.00              40
3                   1.15             150
4                   2.00              60

thank you for your concern and time . but there is no solution to scale only numeric values in string columns — mayaaa, Dec 07 '19 at 09:17
kindly, if you don't mind to add the previous solution also to the answer to test both solution — mayaaa, Dec 07 '19 at 09:31
in the solution that you make scale manually, i don't want to scale the hour_measure column , how can i remove it from scaling — mayaaa, Dec 09 '19 at 20:32
@Nora for scaling between `0 and 52` use `min1 = 0 max1 = 52 df2 = (df1 - df1.min()) / (df1.max() - df1.min()) df2 = df2 * (max1 - min1) + min1` — jezrael, Dec 10 '19 at 06:18
i try to make the index the both columns in your solution *df = data.set_index(['subject_id','hour_measure']) df1 = df.apply(lambda x: pd.to_numeric(x, errors='coerce')) df2 = (df1 - df1.min()) / (df1.max() - df1.min()) df = df2.combine_first(df)* — mayaaa, Dec 10 '19 at 07:27

score 0 · Answer 2 · answered Dec 07 '19 at 12:14

Another approach as following: (I've added new row the see scaled values in blood pressure)

       hour_measure urinecolor blood pressure  temp_column
0          1.00        red             40           40
1          1.15        red           high            0
2          2.00     yellow            low            0
3          3.00     yellow             20           20

df['temp_column'] = df['blood pressure'].values
df['temp_column'] = df['temp_column'].apply(lambda x: 0 if str(x).isalpha() == True else x)

This will create a new temp_column with the numeric values of blood pressure column.

scaler = MinMaxScaler(copy=True, feature_range=(0, 1))
df['hour_measure'] = scaler.fit_transform(df['hour_measure'].values.reshape(-1, 1))
df['temp_column'] = scaler.fit_transform(df['temp_column'].values.reshape(-1 ,1))

I've applied MinMaxScaler to temp_column which contains numeric values of blood pressure. And i just put scaled numeric values to back in blood pressure column.

numeric_rows = pd.to_numeric(df['blood pressure'], errors='coerce').dropna().index.tolist()
print('Index of numeric values in blood pressure column: ', numeric_rows)
for i in numeric_rows:
    df['blood pressure'].iloc[i] = df['temp_column'].iloc[i]
df = df.drop(['temp_column'], axis=1)

Result:

   hour_measure urinecolor blood pressure
0         0.000        red              1
1         0.075        red           high
2         0.500     yellow            low
3         1.000     yellow            0.5

scaling only numeric values in data frame that contain string

2 Answers2