DictVectorizer issue: Creating different number of features for different inputs

Question

I am trying to write a machine learning algorithm where I am trying to predict whether the output will be +50000 or -50000. In doing so I am making use of 11 string features using random forest classifier. But since Random Forest Classifier requires input in the form of float/numbers, I am using DictVectorizer to convert the string features to float/numbers. But for different rows in the data, the DictVectorizer creates different number of features(240-260). This is causing an error in predicting output from the model. One sample input row is:

{'detailed household summary in household': ' Spouse of householder',
 'tax filer stat': ' Joint both under 65',
 'weeks worked in year': ' 52',
 'age': '32', 
 'sex': ' Female',
 'marital status': ' Married-civilian spouse present',
 'full or part time employment stat': ' Full-time schedules',
 'detailed household and family stat': ' Spouse of householder', 
 'education': ' Bachelors degree(BA AB BS)',
 'num persons worked for employer': ' 3',
 'major occupation code': ' Adm support including clerical'}

Is there some way I can convert the input so that I can use random forest classifier to predict the output.

Edit: The code which I am using to do so is:

    X,Y=[],[]
    features=[0,4,7,9,12,15,19,22,23,30,39]
    with open("census_income_learn.csv","r") as fl:
        reader=csv.reader(fl)
        for row in reader:
            data={}
            for i in features:
                data[columnNames[i]]=str(row[i])
            X.append(data)
            Y.append(str(row[41]))

    X_train, X_validate, Y_train, Y_validateActual = train_test_split(X, Y, test_size=0.2, random_state=32)

    vec = DictVectorizer()
    X_train=vec.fit_transform(X_train).toarray()
    X_validate=vec.fit_transform(X_validate).toarray()
    print("data ready")

    forest = RandomForestClassifier(n_estimators = 100)
    forest = forest.fit( X_train, Y_train )
    print("model created")

    Y_predicted=forest.predict(X_validate)
    print(Y_predicted)

So here if i try to print the first elements of training set and validation set, I get 252 features in X_train[0], whereas there are 249 features in X_validate[0].

what kind of a structure do you pass to DictVectorizer? It expects a __list__ of dictionaries... — MaxU - stand with Ukraine, Jan 12 '17 at 22:15
@MaxU I am passing a list of dictionaries. I just added a sample of one of the dictionaries. All the dictionaries are of the same format(meaning all keys are present for each dictionary in the list) — sohil, Jan 12 '17 at 22:38
do you have a column names for those columns: `[0,4,7,9,12,15,19,22,23,30,39]`? — MaxU - stand with Ukraine, Jan 12 '17 at 22:55
@MaxU yes I have columnNames array declared. Making the data is not an issue. — sohil, Jan 12 '17 at 22:58

MaxU - stand with Ukraine · Accepted Answer · 2017-01-12T23:33:35.667

Try this:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction import DictVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

cols = [0,4,7,9,12,15,19,22,23,30,39,  41]
names = [
 'detailed household summary in household',
 'sex',
 'full or part time employment stat',
 'age',
 'detailed household and family stat',
 'weeks worked in year',
 'num persons worked for employer',
 'major occupation code',
 'tax filer stat',
 'education',
 'marital status',
 'TARGET'
]

fn = r'D:\temp\.data\census_income_learn.csv'
data = pd.read_csv(fn, header=None, usecols=cols, names=names)

# http://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn    
df = data.apply(LabelEncoder().fit_transform)

X, Y = np.split(df, [11], axis=1)
X_train, X_validate, Y_train, Y_validateActual = train_test_split(X, Y, test_size=0.2, random_state=32)

forest = RandomForestClassifier(n_estimators = 100)
forest = forest.fit( X_train, Y_train )

Y_predicted=forest.predict(X_validate)

this worked for me. Im getting a warning **DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). forest = forest.fit( X_train, Y_train )**.Thanks. — sohil, Jan 12 '17 at 23:41

DictVectorizer issue: Creating different number of features for different inputs

1 Answers1