-2

I have this code that normalizes a pandas dataframe.

import numpy as np; import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import preprocessing


df = pd.read_csv('DS/RS_DS/final_dataset.csv')
rec_df = df.drop(['person_id','encounter_id','birthdate','CN','HN','DN','DIAG_DM','DIAG_NONDM','TPN'], axis=1)

#normalize values from 0 to 1
df_val = rec_df.values
min_max_scaler = preprocessing.MinMaxScaler()
df_val_scaled = min_max_scaler.fit_transform(df_val)
df_scaled = pd.DataFrame(df_val_scaled)


df_flask = pd.DataFrame([[42.8,151,73,79,0,1,74]],columns=['weight','height','wc','hc','isMale','isFemale','age'])
df_flask_val = df_flask.values
df_flask_val_scaled = min_max_scaler.fit_transform(df_flask_val)
df_flask_scaled = pd.DataFrame(df_flask_val_scaled)

df_scaled returns a dataframe that is normalized. df_flask is a dataframe that I want to normalize based on df_scaled so I can use it for comparison. df_flask_scaled return all 0, I think it didnt normalize based on the dataframe. is there anyway to normalize the single row df.

or should I add this data to the dataframe then compute normalizing?

Reub
  • 665
  • 2
  • 18
  • 35
  • 1
    your question isn't clear. normalize the single row df ? what's that? – YOLO Apr 08 '18 at 20:11
  • 1
    When you use already fitted (learnt) models on new data, remember to never use `fit()` or methods that include `'fit'` in their name, like `fit_transform()`. It re-fits the new data, so the old data is forgottern. Only call transform() on new data. – Vivek Kumar Apr 09 '18 at 09:08
  • I have two dataframes in the code. rec_df and df_flask which is my single row dataframe @YOLO – Reub Apr 09 '18 at 17:22
  • that makes better sense @VivekKumar – Reub Apr 09 '18 at 17:22

1 Answers1

0

I think you should do fit and transform separately. This is done to ensure that the distribution of data using in fitting is maintained.

# initialise scaler
min_max_scaler = preprocessing.MinMaxScaler()

# fit here
min_max_scaler.fit(rec_df.values)

# apply transformation
df_val_scaled = min_max_scaler.transform(rec_df.values)
df_flask_val_scaled = min_max_scaler.transform(df_flask_val)
YOLO
  • 20,181
  • 5
  • 20
  • 40