5

Assuming we have a Pandas dataframe and a scikit-learn model, trained (fit) using that dataframe. Is there a way to do row-wise prediction? The use case is to use the predict function to fill in empty values in the dataframe, using an sklearn model.

I expected that this would be possible using the pandas apply function (with axis=1), but I keep getting dimensionality errors.

Using Pandas version '0.22.0' and sklearn version '0.19.1'.

Simple example:

import pandas as pd
from sklearn.cluster import kmeans

data = [[x,y,x*y] for x in range(1,10) for y in range(10,15)]

df = pd.DataFrame(data,columns=['input1','input2','output'])

model = kmeans()
model.fit(df[['input1','input2']],df['output'])

df['predictions'] = df[['input1','input2']].apply(model.predict,axis=1)

The resulting dimensionality error:

ValueError: ('Expected 2D array, got 1D array instead:\narray=[ 1. 
10.].\nReshape your data either using array.reshape(-1, 1) if your data has 
a single feature or array.reshape(1, -1) if it contains a single sample.', 
'occurred at index 0')

Running predict on the whole column works fine:

df['predictions'] = model.predict(df[['input1','input2']])

However, I want the flexibility to use this row-wise.

I've tried various approaches to reshape the data first, for example:

def reshape_predict(df):
    return model.predict(np.reshape(df.values,(1,-1)))

df[['input1','input2']].apply(reshape_predict,axis=1)

Which just returns the input with no error, whereas I expect it to return a single column of output values (as an array).

SOLUTION:

Thanks to Yakym for providing a working solution! Trying a few variants based on his suggestion, the easiest solution was to simply wrap the row values in square brackets (I tried this previously, but without the 0 index for the prediction, with no luck).

df['predictions'] = df[['input1','input2']].apply(lambda x: model.predict([x])[0],axis=1)
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
user3304496
  • 121
  • 2
  • 6

1 Answers1

2

Slightly more verbose, you can turn each row into 2D array by adding new a new axis to the values. You will then have to access the prediction with 0 index:

df["predictions"] = df[["input1", "input2"]].apply(
    lambda s: model.predict(s.values[None])[0], axis=1
)
hilberts_drinking_problem
  • 11,322
  • 3
  • 22
  • 51