Create submodels with pandas groupby and locate each model with test data

Question

I have a pandas dataframe in which values in a column are used as the group-by basis to create submodels.

import pandas as pd
from sklearn.linear_model import Ridge

data = pd.DataFrame({"Name": ["A", "A", "A", "B", "B", "B"], "Score": [90, 80, 90, 92, 87, 80], "Age": [10, 12, 14, 9, 11, 12], "Training": [0, 1, 2, 0, 1, 2]})

"Name" is used as the basis to create submodel for each individual. I want o use variable "Age" and "Training" to predict "Score" of one individual "Name" (i.e "A" and "B" in this case). That is, if I have "A" and know the "Age" and "Training" of "A", I would love to use "A", "Age", "Training" to predict "Score". However, "A" should be used to access to the model that "A" belongs to other than other model.

grouped_df = data.groupby(['Name'])
for key, item in grouped_df:
    Score = grouped_df['Score']
    Y = grouped_df['Age', 'Training']
    Score_item = Score.get_group(key)
    Y_item = Y.get_group(key)
    model = Ridge(alpha = 1.2)
    modelfit = model.fit(Y_item, Score_item)
    modelpred = model.predict(Y_item)
    modelscore = model.score(Y_item, Score_item)
    print modelscore

Up to here, I have built simple Ridge models to sub-groups A and B.

My question is, with test data as below:

test_data = [u"A, 13, 0", u"B, 12, 1", u"A 10, 0"] ##each element, respectively, represents `Name`, `Age` and `Training`

How to feed the data to the prediction models? I have

line = test_data
Name = [line[i].split()[0] for i in range(len(line))]
Age = [line[i].split()[1] for i in range(len(line))]
Training = [line[i].split()[2] for i in range(len(line))]
Y = pd.DataFrame({"Name": Name, "Age": Age, "Training": Training})

This gives me the pandas dataframe of the test data. However, I am not sure how to proceed further to feed the test data to the model. I highly appreciate your help. Thank you!!

UPDATE

After I adopted the code of Parfait, the code looks better now. Here I did not, however, create another pandas dataframe of the testdata (as I am not sure how to deal with row in there). Instead, I feed in the test values by spliting strings. I obtained an error as indicated below. I searched and found a post here Preprocessing in scikit learn - single sample - Depreciation warning which is related. However, I tried to reshape the test data but it is on the list form so it does not have the attribute of reshap. I think I misunderstand. I highly appreciate if you can let me know how to fix this error. Thank you.

import pandas as pd
from sklearn.linear_model import Ridge
import numpy as np

data = pd.DataFrame({"Name": ["A", "A", "A", "B", "B", "B"], "Score": [90, 80, 90, 92, 87, 80], "Age": [10, 12, 14, 9, 11, 12], "Training": [0, 1, 2, 0,$


modeldict = {}                                           # INITIALIZE DICT
grouped_df = data.groupby(['Name'])

for key, item in grouped_df:
    Score = grouped_df['Score']
    Y = grouped_df['Age', 'Training']
    Score_item = Score.get_group(key)
    Y_item = Y.get_group(key)
    model = Ridge(alpha = 1.2)
    modelfit = model.fit(Y_item, Score_item)
    modelpred = model.predict(Y_item)
    modelscore = model.score(Y_item, Score_item)
    modeldict[key] = modelfit                            # SAVE EACH FITTED MODEL TO DICT


line = [u"A, 13, 0", u"B, 12, 1", u"A, 10, 0"]
Name = [line[i].split(",")[0] for i in range(len(line))]
Age = [line[i].split(",")[1] for i in range(len(line))]
Training = [line[i].split(",")[2] for i in range(len(line))]


for i in range(len(line)):
Name = line[i].split(",")[0]
Age = line[i].split(",")[1]
Training = line[i].split(",")[2]
model = modeldict[Name]
ip = [float(Age), float(Training)]
score = model.predict(ip)

print score

ERROR

/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)
86.6666666667
/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)
83.5320600273
/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)
86.6666666667
/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)
[ 86.66666667]
/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)
[ 83.53206003]
/opt/conda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. DeprecationWarning)
[ 86.66666667]

score 0 · Answer 1 · answered Aug 07 '16 at 01:28

Consider saving submodels in a dictionary with Name as the key and then run a pandas.DataFrame.apply() to run operations on each row aligning row's Name to corresponding model.

NOTE: Below is untested code but hopefully gives a general idea to which you can adjust accordingly. The main issue might be the model.predict() input and output in the defined function, runModel, used in the apply(). A numpy matrix to of Age and Training values are used in model.predict() which hopefully returns a numpy equal to sample size (i.e., each row). See Ridge model:

modeldict = {}                                           # INITIALIZE DICT
grouped_df = data.groupby(['Name'])

for key, item in grouped_df:
    Score = grouped_df['Score']
    Y = grouped_df['Age', 'Training']
    Score_item = Score.get_group(key)
    Y_item = Y.get_group(key)
    model = Ridge(alpha = 1.2)
    modelfit = model.fit(Y_item, Score_item)
    modelpred = model.predict(Y_item)
    modelscore = model.score(Y_item, Score_item)
    print modelscore

    modeldict[key] = modelfit                            # SAVE EACH FITTED MODEL TO DICT

line = [u"A, 13, 0", u"B, 12, 1", u"A 10, 0"] 
Name = [line[i].split()[0] for i in range(len(line))]
Age = [line[i].split()[1] for i in range(len(line))]
Training = [line[i].split()[2] for i in range(len(line))]

testdata = pd.DataFrame({"Name": Name, "Age": Age, "Training": Training})

def runModel(row):
    # LOCATE MODEL BY NAME KEY 
    model = modeldict[row['Name']]
    # PREDICT VALUES
    score = model.predict(np.matrix([row['Age'], row['Training']])
    # RETURN SCALAR FROM score ARRAY 
    return(score[0])    

testdata['predictedScore'] = testdata.apply(runModel, axis=1)

Create submodels with pandas groupby and locate each model with test data

1 Answers1