I am trying to train a linear regression model using sklearn to predict likes of given tweets. I have the following as features/ attributes.
['id', 'month', 'hour', 'text', 'hasMedia', 'hasHashtag', 'followers_count', 'retweet_count', 'favourite_count', 'sentiment', 'anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust', ......keywords............]
I use tfidfvectorizer for extracting keywords. The problem is, depending on the size of the training data, the number of keywords differ and therefore, the number of independent attributes differ. Because of this there is a mismatch of attributes between training and testing data. I get ValueError: Shape of passed values is (1, 1678), indices imply (1, 1928).
It works fine when I split the same data into train and test and predict with test as below.
Program for training and prediction
def train_favourite_prediction(result):
result = result.drop(['retweet_count'], axis=1)
result = result.dropna()
X = result.loc[:, result.columns != 'favourite_count']
y = result['favourite_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# now you can save it to a file
joblib.dump(regressor, os.path.join(dirname, '../../knowledge_base/knowledge_favourite.pkl'))
return None
def predict_favourites(result):
result = result.drop(['retweet_count'], axis=1)
result = result.dropna()
X = result.loc[:, result.columns != 'favourite_count']
y = result['favourite_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
# and later you can load it
regressor = joblib.load(os.path.join(dirname, '../../knowledge_base/knowledge_favourite.pkl'))
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
print(coeff_df)
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(df)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("the large training just finished")
return None
Code for fit vectorization
Have a look at Applying Tfidfvectorizer on list of pos tags gives ValueError to understand the format of my 'text' column.
def ready_for_training(dataset):
dataset = dataset.head(1000)
dataset['text'] = dataset.text.apply(lambda x: literal_eval(x))
dataset['text'] = dataset['text'].apply(
lambda row: [item for sublist in row for item in sublist])
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)
keyword_response = tfidf.fit_transform(dataset['text'])
keyword_matrix = pd.DataFrame(keyword_response.todense(), columns=tfidf.get_feature_names())
keyword_matrix = keyword_matrix.loc[:, (keyword_matrix != 0).any(axis=0)]
dataset['sentiments'] = dataset['sentiments'].map(eval)
dataset = pd.concat([dataset.drop(['sentiments'], axis=1), dataset['sentiments'].apply(pd.Series)], axis=1)
dataset = dataset.drop(['neg', 'neu','pos'], axis=1)
dataset['emotions'] = dataset['emotions'].map(eval)
dataset = pd.concat([dataset.drop(['emotions'], axis=1), dataset['emotions'].apply(pd.Series)], axis=1)
dataset = dataset.drop(['id', 'month', 'text'], axis=1)
result = pd.concat([dataset, keyword_matrix], axis=1, sort=False)
return result
What I need is to predict 'favourite_count' when a new single Tweet is given. When I get the keywords for this tweet I get only a few. While training I trained with 1000+ keywords. I have stored the trained knowledge in a .pkl file. How should I handle this mismatch of attributes? To fill the missing columns in testing tweet as in Keep same dummy variable in training and testing data I may need the training set as a dataframe. But I have stored the trained knowledge as .pkl. and won't be able to access the columns in the trained knowledge.