I am trying to write a machine learning algorithm where I am trying to predict whether the output will be +50000 or -50000. In doing so I am making use of 11 string features using random forest classifier. But since Random Forest Classifier requires input in the form of float/numbers, I am using DictVectorizer to convert the string features to float/numbers. But for different rows in the data, the DictVectorizer creates different number of features(240-260). This is causing an error in predicting output from the model. One sample input row is:
{'detailed household summary in household': ' Spouse of householder',
'tax filer stat': ' Joint both under 65',
'weeks worked in year': ' 52',
'age': '32',
'sex': ' Female',
'marital status': ' Married-civilian spouse present',
'full or part time employment stat': ' Full-time schedules',
'detailed household and family stat': ' Spouse of householder',
'education': ' Bachelors degree(BA AB BS)',
'num persons worked for employer': ' 3',
'major occupation code': ' Adm support including clerical'}
Is there some way I can convert the input so that I can use random forest classifier to predict the output.
Edit: The code which I am using to do so is:
X,Y=[],[]
features=[0,4,7,9,12,15,19,22,23,30,39]
with open("census_income_learn.csv","r") as fl:
reader=csv.reader(fl)
for row in reader:
data={}
for i in features:
data[columnNames[i]]=str(row[i])
X.append(data)
Y.append(str(row[41]))
X_train, X_validate, Y_train, Y_validateActual = train_test_split(X, Y, test_size=0.2, random_state=32)
vec = DictVectorizer()
X_train=vec.fit_transform(X_train).toarray()
X_validate=vec.fit_transform(X_validate).toarray()
print("data ready")
forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit( X_train, Y_train )
print("model created")
Y_predicted=forest.predict(X_validate)
print(Y_predicted)
So here if i try to print the first elements of training set and validation set, I get 252 features in X_train[0], whereas there are 249 features in X_validate[0].