1

Background

The scikit-learn API is based on stateful objects, which take 2D numpy arrays as input, compute a transformation (internally, within the object), and later apply it to other 2D arrays. e.g.:

arr = np.arange(4).reshape(2,2)
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(arr) # scaler state has changed, nothing returns
scaler.transform(arr) # a transformed version of arr returns

My Question

I want to apply a transformation to data stored in a pandas DataFrame, and put the transformed data back into the same DataFrame.

The problem is that df.apply(scaler.transform) feeds data into the scaler column-by-column (1D arrays), where scaler expects a 2D array.

Following the answers here and here, I'm currently doing:

transformed_array = scaler.transform(df.values)
transformed_df = pd.DataFrame(data=transformed_array, index=df.index, columns=df.columns)

But that seems rather clunky and inefficient. Also, I'm feeling there's a corner case where I'll lose the DataFrame's metadata.

Is there a better way?

OmerB
  • 4,134
  • 3
  • 20
  • 33

2 Answers2

0

You can use the iloc[:,:].

According to the documentation

Pandas provides a suite of methods in order to get purely integer based indexing. The semantics follow closely python and numpy slicing. These are 0-based indexing. When slicing, the start bounds is included, while the upper bound is excluded. Note that setting works as well.

Example:

df = pd.DataFrame([[1, 2.], [3, 4.]], columns=['a', 'b'])
df2 = pd.DataFrame([[3, 4.], [5, 6.]], columns=['c', 'd'])

df.iloc[:,:]=df2.values
print(df)
     a    b
0  3.0  4.0
1  5.0  6.0

So in your case, it will be:

df.iloc[:,:] = scaler.transform(df.values) # On an already fitted scaler
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Thanks, do you know if assigning like this is more/less efficient than using the constructor? also is iloc better than loc in this sense? – OmerB Apr 10 '18 at 10:29
  • @OmerB No, I am sorry, I dont know about performance. But `.loc` cannot be used for this, because thats for label-based indexing. In '.loc' you cannot specify index of entries. – Vivek Kumar Apr 10 '18 at 10:45
  • but I can do `.loc[:,:]` or even just `df[:]`... They might be all equivalent, but I'll wait to see if someone has a definitive answer for that... – OmerB Apr 10 '18 at 12:52
  • 1
    @OmerB they are not equivalent perfomance wise: https://stackoverflow.com/a/45983830/4016674 – hellpanderr Apr 10 '18 at 19:23
0

Consider the following demo:

In [198]: df = (pd.DataFrame(np.random.randint(10**5, size=(5,3)), columns=list('abc'))
                  .assign(d=list('abcde')))

In [199]: df
Out[199]:
       a      b      c  d
0  17821  80092  11803  a
1  91198  19663  78665  b
2  77674  46347  72550  c
3  67390  63699  16347  d
4  50445  31346  95608  e

In [200]: cols = ['a','b','c']

In [201]: df[cols] = scaler.fit_transform(df[cols])

In [202]: df
Out[202]:
          a         b         c  d
0 -1.701325  1.466854 -1.259806  a
1  1.196186 -1.315108  0.690414  b
2  0.662151 -0.086660  0.512053  c
3  0.256056  0.712172 -1.127267  d
4 -0.413068 -0.777259  1.184605  e
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419