1

After the splitting of my data, im trying a feature ranking but when im trying to access the X_train.columns im getting this 'numpy.ndarray' object has no attribute 'columns'.

 from sklearn.model_selection import train_test_split
 y=df['DIED'].values
 x=df.drop('DIED',axis=1).values
 X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=42)
 print('X_train',X_train.shape)
 print('X_test',X_test.shape)
 print('y_train',y_train.shape)
 print('y_test',y_test.shape)

 bestfeatures = SelectKBest(score_func=chi2, k="all")
 fit = bestfeatures.fit(X_train,y_train)
 dfscores = pd.DataFrame(fit.scores_)
 dfcolumns = pd.DataFrame(X_train.columns)

i know that train test split returns a numpy array, but how i should deal with it?

daskalos26
  • 13
  • 1
  • 4
  • Possible duplicate of [How do I create test and train samples from one dataframe with pandas?](https://stackoverflow.com/questions/24147278/how-do-i-create-test-and-train-samples-from-one-dataframe-with-pandas). I would personally use [this answer](https://stackoverflow.com/a/30454743/5811400). – ruancomelli Jun 05 '20 at 15:26
  • 1
    I assume the "columns" of X_train would be the one from x so `dfcolumns = pd.DataFrame(x.columns)` should work, although I'm not sure of the point of creating this? – Ben.T Jun 05 '20 at 15:26

1 Answers1

1

May be this code makes it clear:

from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

# here i imitate your example of data 

df = pd.DataFrame(data = np.random.randint(100, size = (50,5)), columns = ['DIED']+[f'col_{i}' for i in range(4)])
df.head()

Out[1]:

        DIED    col_0   col_1   col_2   col_3
0       36      0       23      43      55
1       81      59      83      37      31
2       32      86      94      50      87
3       10      69      4       69      27
4       1       16      76      98      74

#df here is a DataFrame, with all attributes, like df.columns

y=df['DIED'].values
x=df.drop('DIED',axis=1).values   # <- here you get values, so the type of structure is array of array now (not DataFrame), so it hasn't any columns name
x

Out[2]:

array([[ 0, 23, 43, 55],
       [59, 83, 37, 31],
       [86, 94, 50, 87],
       [69,  4, 69, 27],
       [16, 76, 98, 74],
       [17, 50, 52, 31],
       [95,  4, 56, 68],
       [82, 35, 67, 76],
       .....

# now you can access to columns by index, like this:

x[:,2]    # <- gives you access to the 3rd column

Out[3]:
array([43, 37, 50, 69, 98, 52, 56, 67, 81, 64, 48, 68, 14, 41, 78, 65, 11,
       86, 80,  1, 11, 32, 93, 82, 93, 81, 63, 64, 47, 81, 79, 85, 60, 45,
       80, 21, 27, 37, 87, 31, 97, 16, 59, 91, 20, 66, 66,  3,  9, 88])

 # or you able to convert array of array back to DataFrame

pd.DataFrame(data = x, columns = df.columns[1:])

Out[4]:

    col_0   col_1   col_2   col_3
0   0       23      43      55
1   59      83      37      31
2   86      94      50      87
3   69      4       69      27
....

The same approach with all your variables: X_train, X_test, Y_train, Y_test

Alex
  • 1,118
  • 7
  • 7