9

What is the efficient equivalent of R's scale function in pandas? E.g.

newdf <- scale(df)

written in pandas? Is there an elegant way using transform?

  • A nice feature request for `pandas` might be something similar to R's `sweep` function. – Phillip Cloud Aug 09 '13 at 22:47
  • Possible duplicate of [Keep pandas structure with numpy/scikit functions](http://stackoverflow.com/questions/14813289/keep-pandas-structure-with-numpy-scikit-functions) – ariddell Nov 10 '15 at 14:22

2 Answers2

12

Scaling is very common in machine learning tasks, so it is implemented in scikit-learn's preprocessing module. You can pass pandas DataFrame to its scale method.

The only "problem" is that the returned object is no longer a DataFrame, but a numpy array; which is usually not a real issue if you want to pass it to a machine learning model anyway (e.g. SVM or logistic regression). If you want to keep the DataFrame, it would require some workaround:

from sklearn.preprocessing import scale
from pandas import DataFrame

newdf = DataFrame(scale(df), index=df.index, columns=df.columns)

See also here.

Community
  • 1
  • 1
herrfz
  • 4,814
  • 4
  • 26
  • 37
8

I don't know R, but from reading the documentation it looks like the following would do the trick (albeit in a slightly less general way)

def scale(y, c=True, sc=True):
    x = y.copy()

    if c:
        x -= x.mean()
    if sc and c:
        x /= x.std()
    elif sc:
        x /= np.sqrt(x.pow(2).sum().div(x.count() - 1))
    return x

For the more general version you'd probably need to do some type/length checking.

EDIT: Added explanation of the denominator in elif sc: clause

From the R docs:

 ... If ‘scale’ is
 ‘TRUE’ then scaling is done by dividing the (centered) columns of
 ‘x’ by their standard deviations if ‘center’ is ‘TRUE’, and the
 root mean square otherwise.  If ‘scale’ is ‘FALSE’, no scaling is
 done.

 The root-mean-square for a (possibly centered) column is defined
 as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing
 values and n is the number of non-missing values.  In the case
 ‘center = TRUE’, this is the same as the standard deviation, but
 in general it is not.

The line np.sqrt(x.pow(2).sum().div(x.count() - 1)) computes the root mean square using the definition by first squaring x (the pow method) then summing along the rows and then dividing by the non NaN counts in each column (the count method).

As a side the note the reason I didn't just simply compute the RMS after centering is because the std method calls bottleneck for faster computation of that expression in that special case where you want to compute the standard deviation and not the more general RMS.

You could instead compute the RMS after centering, might be worth a benchmark since now that I'm writing this I'm not actually sure which is faster and I haven't benchmarked it.

Phillip Cloud
  • 24,919
  • 11
  • 68
  • 88
  • could you explain `/= np.sqrt(x.pow(2).sum().div(x.count() - 1))`? –  Aug 01 '13 at 23:21
  • The root-mean-square being calculated according to the R docs will be standard deviation if it is centered, so you can remove the middle if statement, leaving `if c: x -= x.mean(); if sc: x /= np.sqrt(x.pow(2).sum().div(x.count() - 1))` – machow Aug 02 '13 at 20:07
  • @Closed Sure. I commented as such. – Phillip Cloud Aug 03 '13 at 23:01
  • Oops! I misread that the first go around. Carry on, good sir :). – machow Aug 04 '13 at 01:45
  • This is not exactly R's scale. R'scale takes row-wise means and row-wise std.deviations but you take the standard deviation of the entire array and mean of entire array. – Gwang-Jin Kim Jan 21 '20 at 09:05