The task is to encode all the text and categorical features and again combine them to form the data matrix but am getting the error incompatible row dimensions.
My work so far:
Encode categorical feature using Label Encoder
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
enc.fit(x_train[' Round'])
round_train_le = enc.transform(x_train[' Round'])
round_test_le = enc.transform(x_test[' Round'])
Encode Text feature category using TfIdfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer1 = TfidfVectorizer(max_features=500)
vectorizer1.fit(x_train[' Category'])
category_train_enc = vectorizer1.transform(x_train[' Category'])
category_test_enc = vectorizer1.transform(x_test[' Category'])
print(category_train_enc.shape)
print(category_test_enc.shape)
Encode Text feature question using TfIdfVectorizer
vectorizer2 = TfidfVectorizer(max_features=5000)
vectorizer2.fit(x_train[' Question'])
question_train_enc = vectorizer2.transform(x_train[' Question'])
question_test_enc = vectorizer2.transform(x_test[' Question'])
print(question_train_enc.shape)
print(question_test_enc.shape)
Encode Text feature answer using TfIdfVectorizer
vectorizer3 = TfidfVectorizer(max_features=1000)
vectorizer3.fit(x_train[' Answer'])
answer_train_enc = vectorizer3.transform(x_train[' Answer'])
answer_test_enc = vectorizer3.transform(x_test[' Answer'])
print(answer_train_enc.shape)
print(answer_test_enc.shape)
Combining the encoded features:
from scipy.sparse import hstack
x_tr = hstack((round_train_le, category_train_enc, question_train_enc, answer_train_enc))
x_te = hstack((round_test_le, category_test_enc, question_test_enc, answer_test_enc))
print("Final Data matrix")
print(x_tr.shape, y_train.shape)
print(x_te.shape, y_test.shape)
And then am getting the following error:
ValueError Traceback (most recent call last)
<ipython-input-60-12e131ba4df4> in <module>
1 # merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
2 from scipy.sparse import hstack
----> 3 x_tr = hstack((round_train_le, category_train_enc, question_train_enc, answer_train_enc))
4 x_te = hstack((round_test_le, category_test_enc, question_test_enc, answer_test_enc))
5
~\anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)
463
464 """
--> 465 return bmat([blocks], format=format, dtype=dtype)
466
467
~\anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)
584 exp=brow_lengths[i],
585 got=A.shape[0]))
--> 586 raise ValueError(msg)
587
588 if bcol_lengths[j] == 0:
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 145341, expected 1.
Please suggest what change i need to make in the code to resolve the error.