0

I'm reading two columns of a .csv file into a Pandas dataframe using pandas.read_csv(). The head of the Dataframe is shown below:

        Year    cleaned
    0   1909    acquaint hous receiv follow letter clerk crown...
    1   1909    ask secretari state war whether issu statement...
    2   1909    i beg present petit sign upward motor car driv...
    3   1909    i desir ask secretari state war second lieuten...
    4   1909    ask secretari state war whether would introduc...

Following this, I call df.dropna(inplace=True)(thanks to Brad Solomon) to allow the coming fit/transform calls to proceed without producing a 'MemoryError' as shown in my previous question here.

Now that I have a memory-friendly form of Dataframe, I use SKLearn's train_test_split() to create four sets of data that I intend to use for fitting/transforming on to a Pipeline.

X_train, X_test, y_train, y_test = train_test_split(df, df['Year'], test_size=0.25)

The shape of these variables is:

[IN] X_train.shape [OUT] (1785, 2)
[IN] X_test.shape  [OUT] (595, 2)
[IN] y_train.shape [OUT] (1785,)
[IN] y_test.shape  [OUT] (595,)

So, I have my data split into appropriate subsections for testing and training. I then create my Pipeline, which makes use of TfidfVectorizer, SelectKBest and LinearSVC as shown below:

pipeline = Pipeline(
    [('vectorizer', TfidfVectorizer(decode_error='replace', encoding='utf-8', stop_words='english', ngram_range=(1,2), sublinear_tf=True)),
     ('chi2', SelectKBest(chi2, k=1000)),
     ('classifier', LinearSVC(C=1.0, penalty='l1', max_iter=3000, dual=False))
    ])

Finally, we come across the error mentioned in the title when I attempt to call fit_transform() on the aforementioned X and y training data

model = pipeline.fit_transform(X_train, y_train)

...which in turn produces the error:

ValueError: Found input variables with inconsistent numbers of samples: [2, 1785]

The full Traceback can be viewed here.

Dbercules
  • 629
  • 1
  • 9
  • 26
  • Are you sure you are using the correct data for printing the shape and passing the same data. Also even if you solve this error, the LinearSVC doesn't have any `transform()` method. So calling `fit_transform()` on pipe will try to `invoke transform()` method of LinearSVC and you will get an error. – Vivek Kumar Mar 16 '18 at 06:32
  • Thanks for commenting. I'm not sure, I was previously using an alternate `train_test_split()` as shown: `X_train, X_test, y_train, y_test = train_test_split(df['cleaned'].tolist(), df['Year'], test_size=0.25)`, although a lack of completion leads me to believe it will result in a `MemoryError` i.e. it appears to run 'forever'. I'm heading out for a bit but will leave the process to run in the meantime. – Dbercules Mar 16 '18 at 13:52
  • I've arrived back some ~2 hours later to discover the process still running and hogging a vast amount of my system's RAM (almost 6GB), I'll attempt it without LinearSVC this time around and update you afterwards. Just to clarify, I'm using an alternate `train_test_split()` method as shown in the above comment, as opposed to that present in my original question. Cheers. – Dbercules Mar 16 '18 at 16:11

1 Answers1

0

inconsistent numbers of samples: [2, 1785] seems to indicate the rows and columns have been flipped in the pipeline.

try:

pipeline.fit_transform(X_train.T,
                       y_train.reshape((1785, 1)))

may need to reshape y_train see this similar question and bear in mind the same transform will need to be applied to test_X and test_y before use.

stacksonstacks
  • 8,613
  • 6
  • 28
  • 44
  • Thanks for answering. I implemented your suggestion but was subsequently greeted by `AttributeError: 'int' object has no attribute 'lower'`. I've altered the `train_test_split()` to use `df['cleaned'].tolist()` as the first parameter as shown in the original post. I'm waiting for that to process the `fit_transform()` but I suspect it will result in a `MemoryError`. – Dbercules Mar 16 '18 at 13:49
  • This is actually a new problem which is progress! looks like one of your columns is an int type rather than a str type. try `df.column.astype(int)` or similar – stacksonstacks Mar 16 '18 at 19:09
  • Prior to fitting the training data on the Pipeline, I've entered `y_train.reshape((1785, 1)).astype(str)` which converts the dtype to `' – Dbercules Mar 17 '18 at 15:39