Attributes mismatch between training and testing data in sklearn - linear regression

Question

I am trying to train a linear regression model using sklearn to predict likes of given tweets. I have the following as features/ attributes.

 ['id', 'month', 'hour', 'text', 'hasMedia', 'hasHashtag', 'followers_count', 'retweet_count', 'favourite_count', 'sentiment', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust', ......keywords............]

I use tfidfvectorizer for extracting keywords. The problem is, depending on the size of the training data, the number of keywords differ and therefore, the number of independent attributes differ. Because of this there is a mismatch of attributes between training and testing data. I get ValueError: Shape of passed values is (1, 1678), indices imply (1, 1928).

It works fine when I split the same data into train and test and predict with test as below.

Program for training and prediction

def train_favourite_prediction(result):
    result = result.drop(['retweet_count'], axis=1)
    result = result.dropna()

    X = result.loc[:, result.columns != 'favourite_count']
    y = result['favourite_count']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    regressor = LinearRegression()
    regressor.fit(X_train, y_train)

    # now you can save it to a file
    joblib.dump(regressor, os.path.join(dirname, '../../knowledge_base/knowledge_favourite.pkl'))

    return None


def predict_favourites(result):
    result = result.drop(['retweet_count'], axis=1)
    result = result.dropna()

    X = result.loc[:, result.columns != 'favourite_count']
    y = result['favourite_count']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    regressor = LinearRegression()

    # and later you can load it
    regressor = joblib.load(os.path.join(dirname, '../../knowledge_base/knowledge_favourite.pkl'))

    coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])

    print(coeff_df)

    y_pred = regressor.predict(X_test)

    df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

    print(df)
    print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
    print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
    print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

    print("the large training just finished")

    return None

Code for fit vectorization

Have a look at Applying Tfidfvectorizer on list of pos tags gives ValueError to understand the format of my 'text' column.

 def ready_for_training(dataset):
    dataset = dataset.head(1000)
    dataset['text'] = dataset.text.apply(lambda x: literal_eval(x))
    dataset['text'] = dataset['text'].apply(
        lambda row: [item for sublist in row for item in sublist])


    tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)

    keyword_response = tfidf.fit_transform(dataset['text'])
    keyword_matrix = pd.DataFrame(keyword_response.todense(), columns=tfidf.get_feature_names())
    keyword_matrix = keyword_matrix.loc[:, (keyword_matrix != 0).any(axis=0)]


    dataset['sentiments'] = dataset['sentiments'].map(eval)
    dataset = pd.concat([dataset.drop(['sentiments'], axis=1), dataset['sentiments'].apply(pd.Series)], axis=1)
    dataset = dataset.drop(['neg', 'neu','pos'], axis=1)

    dataset['emotions'] = dataset['emotions'].map(eval)
    dataset = pd.concat([dataset.drop(['emotions'], axis=1), dataset['emotions'].apply(pd.Series)], axis=1)

    dataset = dataset.drop(['id', 'month', 'text'], axis=1)

    result = pd.concat([dataset, keyword_matrix], axis=1, sort=False)

    return result

What I need is to predict 'favourite_count' when a new single Tweet is given. When I get the keywords for this tweet I get only a few. While training I trained with 1000+ keywords. I have stored the trained knowledge in a .pkl file. How should I handle this mismatch of attributes? To fill the missing columns in testing tweet as in Keep same dummy variable in training and testing data I may need the training set as a dataframe. But I have stored the trained knowledge as .pkl. and won't be able to access the columns in the trained knowledge.

for y_pred = regressor.predict(X_test) I need the X_test. Should I be getting it from the previous method without again splitting it? — Kabilesh, Dec 13 '18 at 05:15
Can you add your tfidf vectorizer code also. then only we can understand the feature mismatch. — Venkatachalam, Dec 13 '18 at 06:00

Attributes mismatch between training and testing data in sklearn - linear regression

0 Answers0

Linked