How to combine additional features with tfidf vector

Question

I am using the following method to train a linear regressor to predict retweets of tweets. I am using 'text' as the feature and 'retweet_count' as the target to be predicted. However, I have several additional features in my data such as hasMedia, hasHashtag, followers_count, sentiment (which are numerical features). How can I combine these features with 'text' that was converted to tfidf vector?

I already tried concatenating pandas dataframes. And then When I give a new test data the features mismatch. Please check my question in Attributes mismatch between training and testing data in sklearn - linear regression

def predict_retweets(dataset):
    tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)

    keyword_response = tfidf.fit_transform(dataset['text']).toarray()

    X = keyword_response
    y = dataset['retweet_count']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    regressor = LinearRegression()

    regressor.fit(X_train, y_train)
    y_pred = regressor.predict(X_test)

    df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

    print(df)

    return None

Sample of data

Possible duplicate of [Append tfidf to pandas dataframe](https://stackoverflow.com/questions/45961747/append-tfidf-to-pandas-dataframe) — Mohamed Ali JAMAOUI, Dec 14 '18 at 08:30
I already tried that. Please check my issue in https://stackoverflow.com/questions/53747463/attributes-mismatch-between-training-and-testing-data-in-sklearn-linear-regres — Kabilesh, Dec 14 '18 at 08:32
Perhaps it would be easier to help with some samples of the data — yatu, Dec 14 '18 at 08:53

How to combine additional features with tfidf vector

0 Answers0