I have a problem with my project regarding naive Bayes which is found in initiating data into variables based on categories so that they are separated based on data that has been filled with labeling. when the tf idf scoring is then divided there are several parts of the variable that cannot run properly. and resulted in a ValueError: y should be a 1d array, got an array of shape (402, 9) instead.
occur.
spreadsheet dataset preprocessed
df = pd.read_csv('dataset_preprocessed.csv')
df
sentimen = df[df.Sentimen.apply(lambda x: x !="Sentimen")]
tfidf
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()
text_tf = tf.fit_transform(df['Tweet_Stemmed'].astype('U'))
print(text_tf)
import pandas as pd
df = pd.DataFrame(text_tf.todense().T,
index=tf.get_feature_names(),
columns=[f'D{i+1}' for i in range(len(df['Tweet_Stemmed']))])
df.head(20)
Train Test Split
#Train Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(text_tf, sentimen, test_size=0.7, random_state=3000)
Apply Predict Sentiment Naive Bayes
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
clf = MultinomialNB()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
print("MultinomialNB Accuracy:", accuracy_score(y_test,predicted))
ERROR
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-118-90624fc1891c> in <module>
4 from sklearn.metrics import confusion_matrix
5 clf = MultinomialNB()
----> 6 clf.fit(X_train, y_train)
7 predicted = clf.predict(X_test)
8
5 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
1037
1038 raise ValueError(
-> 1039 "y should be a 1d array, got an array of shape {} instead.".format(shape)
1040 )
1041
ValueError: y should be a 1d array, got an array of shape (402, 9) instead.
I have tried changing the sizes in test_size
and random_state
. then think about how to call some variables from the sentiment column and df instead, what is unfortunate is that it produces scoring data and there is no table of sentiment
names and tweet_stemmed
.