0

I have this code working fine

    df_amazon = pd.read_csv ("datasets/amazon_alexa.tsv", sep="\t")

    X = df_amazon['variation'] # the features we want to analyze
    ylabels = df_amazon['feedback'] # the labels, or answers, we want to test against

    X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

    # Create pipeline using Bag of Words
    pipe = Pipeline([('cleaner', predictors()),
                     ('vectorizer', bow_vector),
                     ('classifier', classifier)])

    pipe.fit(X_train,y_train)

But if I try to add 1 more feature to the model, replacing

    X = df_amazon['variation']

by

    X = df_amazon[['variation','verified_reviews']] 

I have this error message from Sklearn when I call fit:

ValueError: Found input variables with inconsistent numbers of samples: [2, 2205]

So fit works when X_train and y_train have the shapes (2205,) and (2205,).

But not when the shapes are changed to (2205, 2) and (2205,).

What's the best way to deal with that?

tawab_shakeel
  • 3,701
  • 10
  • 26
Marcel
  • 2,810
  • 2
  • 26
  • 46

2 Answers2

1
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

df = pd.DataFrame(data = [['Heather Gray Fabric','I received the echo as a gift.',1],['Sandstone Fabric','Without having a cellphone, I cannot use many of her features',0]], columns = ['variation','review','feedback'])


vect = CountVectorizer()
vect.fit_transform(df[['variation','review']])

# now when you look at vocab that has been created
print(vect.vocabulary_)

#o/p, where feature has been generated only for column name and not content of particular column
Out[49]:
{'variation': 1, 'review': 0} 

#so you need to make one column which contain which contain variation and review both and that  need to be passed into your model
df['variation_review'] = df['variation'] + df['review']

vect.fit_transform(df['variation_review'])
print(vect.vocabulary_)

{'heather': 8,
'gray': 6,
'fabrici': 3,
'received': 9,
'the': 11,
'echo': 2,
'as': 0,
'gift': 5,
'sandstone': 10,
'fabricwithout': 4,
'having': 7,
'cellphone': 1}
qaiser
  • 2,770
  • 2
  • 17
  • 29
  • Indeed `df['variation_review'] = df['variation'] + df['review']` solves the problem but I don't know if that's a good solution, once "variation" is a category and "review" is a text. What do you think, qaiser? – Marcel Jun 29 '19 at 15:21
  • check this link, https://stackoverflow.com/questions/39121104/how-to-add-another-feature-length-of-text-to-current-bag-of-words-classificati – qaiser Jun 29 '19 at 15:45
0

The data must have a shape (n_samples, n_features). Try to traspose X (X.T).

andre
  • 163
  • 8
  • If I try to transpose X, `X = df_amazon[['variation','verified_reviews']].T`, the error changes to ValueError: Found input variables with inconsistent numbers of samples: [2, 3150] – Marcel Jun 29 '19 at 14:17