I am using the following method to train a linear regressor to predict retweets of tweets. I am using 'text' as the feature and 'retweet_count' as the target to be predicted. However, I have several additional features in my data such as hasMedia, hasHashtag, followers_count, sentiment (which are numerical features). How can I combine these features with 'text' that was converted to tfidf vector?
I already tried concatenating pandas dataframes. And then When I give a new test data the features mismatch. Please check my question in Attributes mismatch between training and testing data in sklearn - linear regression
def predict_retweets(dataset):
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)
keyword_response = tfidf.fit_transform(dataset['text']).toarray()
X = keyword_response
y = dataset['retweet_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(df)
return None
Sample of data