0

The task is to encode all the text and categorical features and again combine them to form the data matrix but am getting the error incompatible row dimensions.

My work so far:

Encode categorical feature using Label Encoder

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()

enc.fit(x_train[' Round'])

round_train_le = enc.transform(x_train[' Round'])
round_test_le = enc.transform(x_test[' Round'])

Encode Text feature category using TfIdfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer1 = TfidfVectorizer(max_features=500)

vectorizer1.fit(x_train[' Category'])

category_train_enc = vectorizer1.transform(x_train[' Category'])
category_test_enc = vectorizer1.transform(x_test[' Category'])

print(category_train_enc.shape)
print(category_test_enc.shape)

Encode Text feature question using TfIdfVectorizer

vectorizer2 = TfidfVectorizer(max_features=5000)

vectorizer2.fit(x_train[' Question'])

question_train_enc = vectorizer2.transform(x_train[' Question'])
question_test_enc = vectorizer2.transform(x_test[' Question'])

print(question_train_enc.shape)
print(question_test_enc.shape)

Encode Text feature answer using TfIdfVectorizer

vectorizer3 = TfidfVectorizer(max_features=1000)

vectorizer3.fit(x_train[' Answer'])

answer_train_enc = vectorizer3.transform(x_train[' Answer'])
answer_test_enc = vectorizer3.transform(x_test[' Answer'])

print(answer_train_enc.shape)
print(answer_test_enc.shape)

Combining the encoded features:

from scipy.sparse import hstack
x_tr = hstack((round_train_le, category_train_enc, question_train_enc, answer_train_enc))
x_te = hstack((round_test_le, category_test_enc, question_test_enc, answer_test_enc))

print("Final Data matrix")
print(x_tr.shape, y_train.shape)
print(x_te.shape, y_test.shape)

And then am getting the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-60-12e131ba4df4> in <module>
      1 # merge two sparse matrices: https://stackoverflow.com/a/19710648/4084039
      2 from scipy.sparse import hstack
----> 3 x_tr = hstack((round_train_le, category_train_enc, question_train_enc, answer_train_enc))
      4 x_te = hstack((round_test_le, category_test_enc, question_test_enc, answer_test_enc))
      5 

~\anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)
    463 
    464     """
--> 465     return bmat([blocks], format=format, dtype=dtype)
    466 
    467 

~\anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)
    584                                                     exp=brow_lengths[i],
    585                                                     got=A.shape[0]))
--> 586                     raise ValueError(msg)
    587 
    588                 if bcol_lengths[j] == 0:

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 145341, expected 1.

Please suggest what change i need to make in the code to resolve the error.

Satyam Anand
  • 479
  • 1
  • 5
  • 14

1 Answers1

0

When using scipy.sparse.hstack() you have to ensure that all the elements you try to stack have the same 0's dimension, i.e., same number of rows. See the following example:

import numpy as np
from scipy.sparse import hstack

a = np.array([1, 2, 3, 4, 5])
b = np.array([1, 2, 3, 5])

c = hstack([a, b])
print(c)

Output:

 (0, 0) 1
  (0, 1)    2
  (0, 2)    3
  (0, 3)    4
  (0, 4)    5
  (0, 5)    1
  (0, 6)    2
  (0, 7)    3
  (0, 8)    5

On the other hand when the number of rows does not match - it results in the error you are getting:

import numpy as np
from scipy.sparse import hstack

a = np.array([1, 2, 3, 4, 5, 6])
b = np.array([[1, 2, 3], [4, 5, 6]])

c = hstack([a, b])
print(c)

Output:

ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 2.

So you should check that all your items are of the same number of rows to join them row-wise

Cheers.

Michael
  • 2,167
  • 5
  • 23
  • 38