Attempting to 'Fit_Transform()' DataFrame results in "Input Variables with Inconsistent Numbers of Samples" Error

Question

I'm reading two columns of a .csv file into a Pandas dataframe using pandas.read_csv(). The head of the Dataframe is shown below:

        Year    cleaned
    0   1909    acquaint hous receiv follow letter clerk crown...
    1   1909    ask secretari state war whether issu statement...
    2   1909    i beg present petit sign upward motor car driv...
    3   1909    i desir ask secretari state war second lieuten...
    4   1909    ask secretari state war whether would introduc...

Following this, I call df.dropna(inplace=True)(thanks to Brad Solomon) to allow the coming fit/transform calls to proceed without producing a 'MemoryError' as shown in my previous question here.

Now that I have a memory-friendly form of Dataframe, I use SKLearn's train_test_split() to create four sets of data that I intend to use for fitting/transforming on to a Pipeline.

X_train, X_test, y_train, y_test = train_test_split(df, df['Year'], test_size=0.25)

The shape of these variables is:

[IN] X_train.shape [OUT] (1785, 2)
[IN] X_test.shape  [OUT] (595, 2)
[IN] y_train.shape [OUT] (1785,)
[IN] y_test.shape  [OUT] (595,)

So, I have my data split into appropriate subsections for testing and training. I then create my Pipeline, which makes use of TfidfVectorizer, SelectKBest and LinearSVC as shown below:

pipeline = Pipeline(
    [('vectorizer', TfidfVectorizer(decode_error='replace', encoding='utf-8', stop_words='english', ngram_range=(1,2), sublinear_tf=True)),
     ('chi2', SelectKBest(chi2, k=1000)),
     ('classifier', LinearSVC(C=1.0, penalty='l1', max_iter=3000, dual=False))
    ])

Finally, we come across the error mentioned in the title when I attempt to call fit_transform() on the aforementioned X and y training data

model = pipeline.fit_transform(X_train, y_train)

...which in turn produces the error:

ValueError: Found input variables with inconsistent numbers of samples: [2, 1785]

The full Traceback can be viewed here.

Are you sure you are using the correct data for printing the shape and passing the same data. Also even if you solve this error, the LinearSVC doesn't have any `transform()` method. So calling `fit_transform()` on pipe will try to `invoke transform()` method of LinearSVC and you will get an error. — Vivek Kumar, Mar 16 '18 at 06:32
Thanks for commenting. I'm not sure, I was previously using an alternate `train_test_split()` as shown: `X_train, X_test, y_train, y_test = train_test_split(df['cleaned'].tolist(), df['Year'], test_size=0.25)`, although a lack of completion leads me to believe it will result in a `MemoryError` i.e. it appears to run 'forever'. I'm heading out for a bit but will leave the process to run in the meantime. — Dbercules, Mar 16 '18 at 13:52
I've arrived back some ~2 hours later to discover the process still running and hogging a vast amount of my system's RAM (almost 6GB), I'll attempt it without LinearSVC this time around and update you afterwards. Just to clarify, I'm using an alternate `train_test_split()` method as shown in the above comment, as opposed to that present in my original question. Cheers. — Dbercules, Mar 16 '18 at 16:11

score 0 · Answer 1 · answered Mar 16 '18 at 00:34

0

inconsistent numbers of samples: [2, 1785] seems to indicate the rows and columns have been flipped in the pipeline.

try:

pipeline.fit_transform(X_train.T,
                       y_train.reshape((1785, 1)))

may need to reshape y_train see this similar question and bear in mind the same transform will need to be applied to test_X and test_y before use.

answered Mar 16 '18 at 00:34

stacksonstacks

8,613
6
28
44

Thanks for answering. I implemented your suggestion but was subsequently greeted by `AttributeError: 'int' object has no attribute 'lower'`. I've altered the `train_test_split()` to use `df['cleaned'].tolist()` as the first parameter as shown in the original post. I'm waiting for that to process the `fit_transform()` but I suspect it will result in a `MemoryError`. – Dbercules Mar 16 '18 at 13:49
This is actually a new problem which is progress! looks like one of your columns is an int type rather than a str type. try `df.column.astype(int)` or similar – stacksonstacks Mar 16 '18 at 19:09
Prior to fitting the training data on the Pipeline, I've entered `y_train.reshape((1785, 1)).astype(str)` which converts the dtype to `' – Dbercules Mar 17 '18 at 15:39

Attempting to 'Fit_Transform()' DataFrame results in "Input Variables with Inconsistent Numbers of Samples" Error

1 Answers1