1

How to split a (569 ,31 ) dataframe into two with shapes (569 ,30) and (569, )

The dataFrame has 31 columns-

df.columns yields this -

Index([u'mean radius', u'mean texture', u'mean perimeter', u'mean area',
       u'mean smoothness', u'mean compactness', u'mean concavity',
       u'mean concave points', u'mean symmetry', u'mean fractal dimension',
       u'radius error', u'texture error', u'perimeter error', u'area error',
       u'smoothness error', u'compactness error', u'concavity error',
       u'concave points error', u'symmetry error', u'fractal dimension error',
       u'worst radius', u'worst texture', u'worst perimeter', u'worst area',
       u'worst smoothness', u'worst compactness', u'worst concavity',
       u'worst concave points', u'worst symmetry', u'worst fractal dimension',
       u'target'],
      dtype='object')

I need to split it into two. I did something like this -

X = df.ix[:,'mean radius': 'worst fractal dimension']

y = df.ix[:,'target': ]

X.shape gives (569, 30) which is as expected, but y.shape gives (569,1). I dont really understand the difference between (569,) ans (569, 1). BUt he answer required is shape of (569,)

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
surabhi gupta
  • 65
  • 1
  • 1
  • 9

2 Answers2

2

y.shape gives you (569, 1) because calling y = df.ix[:,'target': ] returns you a DataFrame type.

Difference between shapes (569,) and (569, 1) is that (569,) is a Series type and it has only one dimension, while (569, 1) is a DataFrame with two dimensions ('569' - for 569 rows and '1' for 1 column).

Calling y = df['target'] should return you a Series type.

Also, note, that the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers:
http://pandas.pydata.org/ Nevertheless, it still works

You can also convert 'one-column' DataFrame into Series manually as discussed for example here

To check the type of your variable you can find type(y) very usefull and it helps solve similar issues

Evgene
  • 21
  • 2
1
X = df[df.columns.drop('target')]
y = df['target']

alternatively you can change:

y = df.ix[:,'target': ]

to:

y = df.ix[:,'target']

PS .ix[] indexer is deprecated in modern Pandas versions, so it's advised to use .loc[] instead

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419