Pandas defining Z_score function to be called when creating new columns in different dataframes

Question

def z_score(df, column, mean, std):
    return #  ?????

mean = history_df['distances'].mean()
std = history_df['distances'].std()
training_df['distances_normal'] = z_score(training_df, 'distances', mean, std)
testing_df['distances_normal'] = z_score(testing_df, 'distances', mean, std)

hello, any suggestions on what the z_score function should look like (after return) so that further down when I create the new columns 'distances_normal' to the training and testing dataframes based on the history dataframe column 'distances' the values are normalized?

thx in advance

score 0 · Accepted Answer · answered Nov 30 '20 at 23:03

You do not need to define the z_score function as the calculation is simple and can be carried out directly on the dateframe:

training_df['distances_normal'] = (training_df['distances'] - mean)/ std

if you still want to use a z_score function, then you can define it taking one element at a time, and then use apply to apply it to each element of the dataframe column in turn:

def z_score(x, mean, std):
    return (x - mean)/std

training_df['distances_normal'] = training_df['distances'].apply(lambda x: z_score(x, mean, std))

the end result is the same but the first version is faster as it uses vector operations

You can also use some standard library tools for this as it is quite common, see eg this question

Pandas defining Z_score function to be called when creating new columns in different dataframes

1 Answers1