0

I have a dataset like the following with about a thousand of observations. I want to create a SVM hyperplane plot classifying between negative and positive classes. I use the subsequent codes which I basically adopted from the Scikit-learn website. However, it gives me this error:

ValueError: Found input variables with inconsistent numbers of samples: [2, 214]

 df_1.head() 
                                              text   cat  lexicon
   0   [india, education, commission, report, ma...   1   negative
   1   [national, education, policy, 2017, by, min... 2   positive
              ......
 1000  [india, education, commission, report, ma...

I use following steps to

from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm

vectorizer = CountVectorizer(stop_words='english', min_df = 10)

text = df_1.text.tolist()
x = vectorizer.fit_transform(text)
y = df_1.lexicon.tolist()

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, 
random_state=0)

fig, ax = plt.subplots()
model = svm.LinearSVC(C=1).fit(x_train, y_train)

# get the separating hyperplane
w = model.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - (model.intercept_[0]) / w[1]

# create a mesh to plot in
x_min, x_max = x[:, 0].min() - 1, x[:, 0].max() + 1
y_min, y_max = x[:, 1].min() - 1, x[:, 1].max() + 1
xx2, yy2 = np.meshgrid(np.arange(x_min, x_max, .2),
                 np.arange(y_min, y_max, .2))
Z = model.predict(np.c_[xx2.ravel(), yy2.ravel()])

Z = Z.reshape(xx2.shape)
ax.contourf(xx2, yy2, Z, cmap=plt.cm.coolwarm, alpha=0.3)
ax.scatter(x[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, s=25)
ax.plot(xx,yy)

ax.axis([x_min, x_max,y_min, y_max])
plt.show()

After getting the error as shown above, I also tried use the following code to transform x and y but I get the similar error. Can anybody suggest a solution? Thank you.

_, y = np.unique(y, return_inverse=True)
x = np.unique(x, return_inverse=True)

The detailed error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-57-abc7d45a71c2> in <module>()
      3 from sklearn import svm
      4 
----> 5 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=0)
      6 
      7 

~\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py in train_test_split(*arrays, **options)
   2029         test_size = 0.25
   2030 
-> 2031     arrays = indexable(*arrays)
   2032 
   2033     if shuffle is False:

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in indexable(*iterables)
    227         else:
    228             result.append(np.array(X))
--> 229     check_consistent_length(*result)
    230     return result
    231 

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_consistent_length(*arrays)
    202     if len(uniques) > 1:
    203         raise ValueError("Found input variables with inconsistent numbers of"
--> 204                          " samples: %r" % [int(l) for l in lengths])
    205 
    206 

ValueError: Found input variables with inconsistent numbers of samples: [2, 214]

But if I remove x_train and y_train I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-58-1009080f56c2> in <module>()
     22 
     23 fig, ax = plt.subplots()
---> 24 model = svm.LinearSVC(C=1).fit(x,y)
     25 
     26 # get the separating hyperplane

~\Anaconda3\lib\site-packages\sklearn\svm\classes.py in fit(self, X, y, sample_weight)
    225 
    226         X, y = check_X_y(X, y, accept_sparse='csr',
--> 227                          dtype=np.float64, order="C")
    228         check_classification_targets(y)
    229         self.classes_ = np.unique(y)

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    571     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
    572                     ensure_2d, allow_nd, ensure_min_samples,
--> 573                     ensure_min_features, warn_on_dtype, estimator)
    574     if multi_output:
    575         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

~\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    431                                       force_all_finite)
    432     else:
--> 433         array = np.array(array, dtype=dtype, order=order, copy=copy)
    434 
    435         if ensure_2d:

ValueError: setting an array element with a sequence.
James
  • 75
  • 7
  • Your scatter plot uses `x` and `X`, and for the latter I don't see any definition. As the error says, you're trying to plot two x-coordinates against 214 y-coordinates. – Thomas Lang Dec 04 '18 at 05:46
  • Hi @ThomasLang thank you. But x contains all the text data from the dataframe and y has lexicon categories. I am actually new to hyperplotting. Could you ask more specific information that I can describe? Thanks again. – James Dec 04 '18 at 05:51
  • Can you add the stack trace into your question? This would help identify the real error easier. – Thomas Lang Dec 04 '18 at 05:54
  • Yes, I have added it. – James Dec 04 '18 at 06:01
  • Ah, okay, so the error is actually happening during training/testing and not during plotting. Check out [here](https://stackoverflow.com/questions/30813044/sklearn-found-arrays-with-inconsistent-numbers-of-samples-when-calling-linearre). – Thomas Lang Dec 04 '18 at 06:03
  • No, even when I remove training and testing it also results in error. I also added the error without training and testing – James Dec 04 '18 at 06:05
  • Well, the error message clearly says that it happens inside the `fit` function in the latter case. For fitting, you need `x` and `y` to be of same length, which does not seem to be the case here. Hence `x` should be of same length, but 2D. – Thomas Lang Dec 04 '18 at 06:07

0 Answers0