6

I'm doing this tutorial on machine learning in which the following code is used:

import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('breast-cancer-wisconsin.data.csv')
df.replace('?', -99999, inplace = True)
df.drop(['id'], 1, inplace = True)
X = np.array(df.drop(['class'], 1))
y = np.array(df['class'])

X_train, X_test, y_test, y_train = train_test_split(X, y)

Here is a sample from the csv file:

id,clump_thickness,unif_cell_size,unif_cell_shape, marg_adhesion,
single_epith_cell_size,bare_nuclei,bland_chrom,norm_nucleoli, mitoses,class
    1000025,5,1,1,1,2,1,3,1,1,2
    1002945,5,4,4,5,7,10,3,2,1,2
    1015425,3,1,1,1,2,2,3,1,1,2
    1016277,6,8,8,1,3,4,3,7,1,2
    1017023,4,1,1,3,2,1,3,1,1,2
    1017122,8,10,10,8,7,10,9,7,1,4
    1018099,1,1,1,1,2,10,3,1,1,2
    1018561,2,1,2,1,2,1,3,1,1,2
    1033078,2,1,1,1,2,1,1,1,5,2
    1033078,4,2,1,1,2,1,2,1,1,2
    1035283,1,1,1,1,1,1,3,1,1,2
    1036172,2,1,1,1,2,1,2,1,1,2
    1041801,5,3,3,3,2,3,4,4,1,4
    1043999,1,1,1,1,2,3,3,1,1,2
    1044572,8,7,5,10,7,9,5,5,4,4
    1047630,7,4,6,4,6,1,4,3,1,4
    1048672,4,1,1,1,2,1,2,1,1,2
    1049815,4,1,1,1,2,1,3,1,1,2
    1050670,10,7,7,6,4,10,4,1,2,4
    1050718,6,1,1,1,2,1,3,1,1,2
    1054590,7,3,2,10,5,10,5,4,4,4
    1054593,10,5,5,3,6,7,7,10,1,4
    1056784,3,1,1,1,2,1,2,1,1,2
    1057013,8,4,5,1,2,?,7,3,1,4
    1059552,1,1,1,1,2,1,3,1,1,2
    1065726,5,2,3,4,2,7,3,6,1,4
    1066373,3,2,1,1,1,1,2,1,1,2

When looking at the results from sklearn.model_selection.train_test_split I found out something weird (at least to me). If I run

    print(type(y_test[0]))
    print()
    print(type(X_train[:,1][0]))

I get the following output:

<class 'numpy.int64'>
<class 'int'>

Somehow the values in X_train are of the type int and the values in y_test are of the type numpy.int64. I don't know why train_test_split does this - I don't think it has to do with the data that is being split up - and the documentation doesn't seem to mention it either.

Since I want the values in y_test to be regular integers as well, I tried changing the type of y_test with astype(). Unfortunately, the following code

y_test = y_test.astype(int)
print(type(y_test[0]))

returns

<class 'numpy.int64'>

Question: Why does train_test_split return arrays containing values with different kinds of datatypes? Why am I not able to convert the values in y_test to integers?

Edit: The difference in type is caused by the data. If I run

 print(type(X[:,1][0]))
 print(type(y[0])) 

I get

<class 'int'>
<class 'numpy.int64'>

I still would like to know why astype doesn't work though!:)

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
Mr. President
  • 1,489
  • 3
  • 11
  • 21
  • 1
    There's not much of a difference here, other than just a couple of bytes. From my personal experience, numpy prefers to store results in `int64` (so this is for `y_test`), while normal arrays simply store as `int`. Can refer to the differences: https://stackoverflow.com/questions/9696660/what-is-the-difference-between-int-int16-int32-and-int64 – shiv_90 Oct 18 '18 at 12:49
  • @Shiv_90 Thank you for your reply! There are some practical differences though. For example, inserting the data into a datatable column with type 'numeric' works with `int` but not with `numpy.int64` – Mr. President Oct 18 '18 at 12:51
  • I see. And there could be many reasons behind this exclusivity; although this is not my sure answer :) – shiv_90 Oct 18 '18 at 13:00

1 Answers1

1

To convert numpy values to python types, there's numpy.ndarray.item

y_test_int = [v.item() for v in y_test]
print(type(y_test_int[0]))
#<class 'int'>
STJ
  • 51
  • 4