1

I am using the iris data set from sklearn. I need to split the data, sample the training set without repetition based on the proportions, apply a Naive Bayes Classifier, record score and return a dictionary that maps the sample size (key) used to fit the model to the corresponding score (training and test scores as a tuple)

I need some help with the returning dictionary part. This is what I have done to get the required dictionary. I am unsure if what I have done is correct or if there is a better way to do this.

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.naive_bayes import MultinomialNB
score_list=shape_list=[]
iris = load_iris()
props=[0.2,0.5,0.7,0.9]
df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                    columns= iris['feature_names'] + ['target'])
y=df[list(df.loc[:,df.columns.values =='target'])]
X=df[list(df.loc[:,df.columns.values !='target'])]
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3
                                       ,train_size=0.7)
for i in props:
    ix = np.random.choice(X_train.index, size=int(i*len(X_train)), replace = False)
    sampleX = X_train.loc[ix]
    sampleY = y_train.loc[ix]
    modelNB = MultinomialNB()
    modelNB.fit(sampleX, sampleY)
    train_score=modelNB.score(sampleX,sampleY)
    test_score=modelNB.score(X_test,y_test)
    score_list.append((train_score , test_score))
    shape_list.append(sampleX.shape[0])
print(dict(zip(shape_list,score_list)))
desertnaut
  • 57,590
  • 26
  • 140
  • 166
freshman_2021
  • 361
  • 2
  • 9
  • 1
    Welcome to stackoverflow, please read [tour] and [mre] and in this case also: [how-to-make-good-reproducible-pandas-examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Andreas Aug 08 '21 at 18:45
  • 1
    Multiple variables/objects are not defined, e.g. `MultinomialNB`, `size`, `both`, ... – Andreas Aug 08 '21 at 18:45
  • I'll make the necessary edits – freshman_2021 Aug 08 '21 at 18:49

2 Answers2

1

This should do

# Create a global dictionay
results = {}
for i in props:
    size = int(i*len(X_train))
    ix = np.random.choice(X_train.index, size=size, replace = False)
    sampleX = X_train.loc[ix]
    sampleY = y_train.loc[ix]
    modelNB = MultinomialNB()
    modelNB.fit(sampleX, sampleY)
    train_score = modelNB.score(sampleX,sampleY)
    test_score = modelNB.score(X_test,y_test)

    # insert the values in the dictionay using size as key
    results[size] = (train_score, test_score)
    
print(results)
Abhishek Prajapat
  • 1,793
  • 2
  • 8
  • 19
1

maybe this view is good for you:

list_size = list()
list_train_score = list()
list_test_score = list()
for i in props:
    size = int(i*len(X_train))
    ix = np.random.choice(X_train.index, size=size, replace = False)
    sampleX = X_train.loc[ix]
    sampleY = y_train.loc[ix]
    modelNB = MultinomialNB()
    modelNB.fit(sampleX, sampleY)
    train_score = modelNB.score(sampleX,sampleY)
    test_score = modelNB.score(X_test,y_test)

    list_size.append(size)
    list_train_score.append(train_score)
    list_test_score.append(test_score)
    


df = pd.DataFrame(list(zip(list_size, list_train_score, list_test_score)), 
                  columns =['size', 'train_score', 'test_score'])

df

output:

enter image description here

I'mahdi
  • 23,382
  • 5
  • 22
  • 30