0

I'm assuming in the following code, iris is a bunch object specifically made for sklearn/datasets.

# import load_iris function from datasets module
from sklearn.datasets import load_iris

# save "bunch" object containing iris dataset and its attributes
iris = load_iris()

When I'm trying to understand what type of object is it, it says bunch object.

type(iris)
Out[4]:
sklearn.utils.Bunch

Now, if I need to use corr() method for computing standard correlation between every pair of attributes, that needs to work on dataframe, not on bunch object.

How do I do that? Can I perform it on iris.data? I know it is an array. Not dataframe.

# check the types of the features
print(type(iris.data))
Out[5]:
<class 'numpy.ndarray'>

Now, if I had used the built-in dataset of seaborne or from the actual data source, it would not have this issue. Here iris.corr() is working perfectly. Yes, here iris is dataframe.

iris = sns.load_dataset("iris")
type(iris)
Out[7]:
pandas.core.frame.DataFrame
iris.corr()
Out[8]:

              sepal_length  sepal_width  petal_length  petal_width
sepal_length      1.000000    -0.117570      0.871754     0.817941
sepal_width      -0.117570     1.000000     -0.428440    -0.366126
petal_length      0.871754    -0.428440      1.000000     0.962865
petal_width       0.817941    -0.366126      0.962865     1.000000

How do I run corr() in previous example? Using sklearn bunch object? How do I convert sklearn bunch object to dataframe? Or converting iris.data ndarray to dataframe?

Rakibul Hassan
  • 325
  • 3
  • 13
  • 1
    Convert it to a dataframe first, then use `.corr()`. see [How to convert a Scikit-learn dataset to a Pandas dataset?](https://stackoverflow.com/a/38105540/8421052) – Troy Oct 13 '18 at 19:02
  • 1
    Thank you so much for the heads up. It was helpful. – Rakibul Hassan Oct 14 '18 at 03:37

1 Answers1

0

After reviewing the responses at How to convert a Scikit-learn dataset to a Pandas dataset?, here might be the answer. Thanks to everyone for the direction.

from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

data = load_iris()

We can combine features and target variables using np.column_stack.

df = pd.DataFrame(np.column_stack((data.data, data.target)), columns = data.feature_names+['target'])
print(df.head())

Output:

sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)     target
0                5.1               3.5                1.4               0.2     0.0
1                4.9               3.0                1.4               0.2     0.0 
2                4.7               3.2                1.3               0.2     0.0 
3                4.6               3.1                1.5               0.2     0.0
4                5.0               3.6                1.4               0.2     0.0

Now, we can replace by converting target_names to dictionary and add a new column:

df['label'] = df.target.replace(dict(enumerate(data.target_names)))
print(df.head())

Output:

sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)     target  label 
0                5.1               3.5                1.4               0.2     0.0     setosa
1                4.9               3.0                1.4               0.2     0.0     setosa
2                4.7               3.2                1.3               0.2     0.0     setosa
3                4.6               3.1                1.5               0.2     0.0     setosa
4                5.0               3.6                1.4               0.2     0.0     setosa
Rakibul Hassan
  • 325
  • 3
  • 13