2

I am trying to write a machine learning algorithm where I am trying to predict whether the output will be +50000 or -50000. In doing so I am making use of 11 string features using random forest classifier. But since Random Forest Classifier requires input in the form of float/numbers, I am using DictVectorizer to convert the string features to float/numbers. But for different rows in the data, the DictVectorizer creates different number of features(240-260). This is causing an error in predicting output from the model. One sample input row is:

{'detailed household summary in household': ' Spouse of householder',
 'tax filer stat': ' Joint both under 65',
 'weeks worked in year': ' 52',
 'age': '32', 
 'sex': ' Female',
 'marital status': ' Married-civilian spouse present',
 'full or part time employment stat': ' Full-time schedules',
 'detailed household and family stat': ' Spouse of householder', 
 'education': ' Bachelors degree(BA AB BS)',
 'num persons worked for employer': ' 3',
 'major occupation code': ' Adm support including clerical'}

Is there some way I can convert the input so that I can use random forest classifier to predict the output.

Edit: The code which I am using to do so is:

    X,Y=[],[]
    features=[0,4,7,9,12,15,19,22,23,30,39]
    with open("census_income_learn.csv","r") as fl:
        reader=csv.reader(fl)
        for row in reader:
            data={}
            for i in features:
                data[columnNames[i]]=str(row[i])
            X.append(data)
            Y.append(str(row[41]))

    X_train, X_validate, Y_train, Y_validateActual = train_test_split(X, Y, test_size=0.2, random_state=32)

    vec = DictVectorizer()
    X_train=vec.fit_transform(X_train).toarray()
    X_validate=vec.fit_transform(X_validate).toarray()
    print("data ready")

    forest = RandomForestClassifier(n_estimators = 100)
    forest = forest.fit( X_train, Y_train )
    print("model created")

    Y_predicted=forest.predict(X_validate)
    print(Y_predicted)

So here if i try to print the first elements of training set and validation set, I get 252 features in X_train[0], whereas there are 249 features in X_validate[0].

sohil
  • 415
  • 5
  • 12

1 Answers1

2

Try this:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

cols = [0,4,7,9,12,15,19,22,23,30,39,  41]
names = [
 'detailed household summary in household',
 'sex',
 'full or part time employment stat',
 'age',
 'detailed household and family stat',
 'weeks worked in year',
 'num persons worked for employer',
 'major occupation code',
 'tax filer stat',
 'education',
 'marital status',
 'TARGET'
]

fn = r'D:\temp\.data\census_income_learn.csv'
data = pd.read_csv(fn, header=None, usecols=cols, names=names)

# http://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn    
df = data.apply(LabelEncoder().fit_transform)

X, Y = np.split(df, [11], axis=1)
X_train, X_validate, Y_train, Y_validateActual = train_test_split(X, Y, test_size=0.2, random_state=32)

forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit( X_train, Y_train )

Y_predicted=forest.predict(X_validate)
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • this worked for me. Im getting a warning **DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). forest = forest.fit( X_train, Y_train )**.Thanks. – sohil Jan 12 '17 at 23:41