0

I am trying to apply KNN to Diabetes prima data, in order to split my data set into training and testing datasets, I have used iloc function as described in the code. But when I am using this code, I am getting really weird test data shapes. Can anyone please explain what am I doing wrong here

here is the code :

# first 8 columns from index 0 to 7 to be used for parameters
X = dataset.iloc[:,0:8]
y = dataset.iloc[:,8]
# lets split X and Y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size =0.2,random_state =0)
# let us check the shape of all of these
print("X_train shape is : ", X_train.shape)
print("X_test shape  is : ", X_test.shape)
print("y_train shape is : ", y_train.shape)
print("y_test shape is : ", y_test.shape)

This is the output I am getting : 
X_train shape is :  (614, 8)
X_test shape  is :  (154, 8)
y_train shape is :  (614,)
y_test shape is :  (154,)
  • 1
    It seems perfectly fine to me, what is the `weird` thing? Te size is correct. The dataset contains 768 rows, 20% is 154. For train you are passing 8 columns, for test you are using a series. – Celius Stingher Jan 17 '20 at 17:45
  • Does this answer your question? [Difference between numpy.array shape (R, 1) and (R,)](https://stackoverflow.com/questions/22053050/difference-between-numpy-array-shape-r-1-and-r) – G. Anderson Jan 17 '20 at 17:51
  • 1
    I'm voting to close this question as off-topic because there is no question to be asked. There is no problem in the code. – Celius Stingher Jan 17 '20 at 17:54

2 Answers2

1

When you use train_test_split you're not getting pandas objects back, but numpy arrays. The output that you get is how numpy arrays show their shape. Here are a couple of examples:

import numpy as np
np.array([0, 1, 2]).shape

## (3,)

np.array([[0, 1, 2], [3, 4, 5]]).shape

## (2, 3)
Oriol Mirosa
  • 2,756
  • 1
  • 13
  • 15
1

Your code is right. In your dataset you probably have 668 rows and 9 columns, where the last one is the column to be predicted. When you use the iloc function you are spliting the features (columns 1 to 8) from the response (column 9). The train_test_split is separating your data (x and y) into a train set and a test set.

The shapes that you are getting are rigth:

X_train shape is :  (614, 8)   614 rows and 8 columns
X_test shape  is :  (154, 8)   154 rows and 8 columns
y_train shape is :  (614,)     614 rows and 1 column
y_test shape is :  (154,)      154 rows and 1 column
Filipe Lauar
  • 434
  • 3
  • 8