4

I am using the following method to train a linear regressor to predict retweets of tweets. I am using 'text' as the feature and 'retweet_count' as the target to be predicted. However, I have several additional features in my data such as hasMedia, hasHashtag, followers_count, sentiment (which are numerical features). How can I combine these features with 'text' that was converted to tfidf vector?

I already tried concatenating pandas dataframes. And then When I give a new test data the features mismatch. Please check my question in Attributes mismatch between training and testing data in sklearn - linear regression

def predict_retweets(dataset):
    tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)

    keyword_response = tfidf.fit_transform(dataset['text']).toarray()

    X = keyword_response
    y = dataset['retweet_count']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    regressor = LinearRegression()

    regressor.fit(X_train, y_train)
    y_pred = regressor.predict(X_test)

    df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})

    print(df)

    return None

Sample of data

enter image description here

Kabilesh
  • 1,000
  • 6
  • 22
  • 47

0 Answers0