How to convert a Scikit-learn dataset to a Pandas dataset

Question

How do I convert data from a Scikit-learn Bunch object to a Pandas DataFrame?

from sklearn.datasets import load_iris
import pandas as pd
data = load_iris()
print(type(data))
data1 = pd. # Is there a Pandas method to accomplish this?

score 203 · Accepted Answer · edited Dec 07 '16 at 23:57

203

Manually, you can use pd.DataFrame constructor, giving a numpy array (data) and a list of the names of the columns (columns). To have everything in one DataFrame, you can concatenate the features and the target into one numpy array with np.c_[...] (note the []):

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

# save load_iris() sklearn dataset to iris
# if you'd like to check dataset type use: type(load_iris())
# if you'd like to view list of attributes use: dir(load_iris())
iris = load_iris()

# np.c_ is the numpy concatenate function
# which is used to concat iris['data'] and iris['target'] arrays 
# for pandas column argument: concat iris['feature_names'] list
# and string list (in this case one string); you can make this anything you'd like..  
# the original dataset would probably call this ['Species']
data1 = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= iris['feature_names'] + ['target'])

edited Dec 07 '16 at 23:57

rolyat

575
6
9

answered Jun 29 '16 at 13:26

TomDLT

4,346
1
20
26

3

Can you add a little text to explain this code? This is somewhat brief by our standards. – gung - Reinstate Monica Jun 29 '16 at 14:09
2

Some bunches have the feature_names as a ndarray which will break the columns parameter. – Jul 10 '17 at 01:17
1

Missing "Species" key and values for dataframe. – mastash3ff Jul 11 '17 at 15:24
1

Species is no longer available in the latest iris data frame, as far as I can tell. They are replaced by target_names. – Kingz Jul 13 '17 at 05:14
4

This code didn't work as-is for me. For the columns parameter, I needed to pass in columns=np.append(iris['feature_names'], 'target). Did I do something wrong, or does this answer need an edit? – Josh Davis Oct 02 '17 at 01:35
What failed with `iris['feature_names'] + ['target']`? – TomDLT Oct 02 '17 at 09:54
my error: columns= diab['feature_names'] + ['target']) KeyError: 'feature_names' am I doing something wrong? – Dev_Man Oct 28 '17 at 19:59
@mastash3ff It is caused by you specifying a column which name is "Species". Just use 'target' by stead or replace 'target' with 'Species' in above code. – Junyong Yao Nov 09 '17 at 01:47
4

This doesn't work for all datasets, such as `load_boston()`. This answer works more generally: https://stackoverflow.com/a/46379878/1840471 – Max Ghenis Jun 08 '18 at 20:03
Equivalent to c_ `np.hstack((iris["data"], iris["target"].reshape(-1, 1)))` or `np.concatenate((iris["data"], iris["target"].reshape(-1, 1)), axis=1)` – Maximilian Janisch Sep 13 '19 at 10:16
1

Update v0.23: Please check dheinz's answer, scikit-learn added a parameter `as_frame`. (https://stackoverflow.com/a/61780389/6156647) – TomDLT May 14 '20 at 22:07

justin4480 · Answer 2 · 2022-01-06T12:10:21.753

122

from sklearn.datasets import load_iris
import pandas as pd

data = load_iris()
df = pd.DataFrame(data=data.data, columns=data.feature_names)
df.head()

This tutorial maybe of interest: http://www.neural.cz/dataset-exploration-boston-house-pricing.html

edited Jan 06 '22 at 12:10

answered Apr 21 '17 at 22:40

justin4480

1,361
1
9
5

18

Need to concatenate the data with target: df = pd.DataFrame(np.concatenate((iris.data, np.array([iris.target]).T), axis=1), columns=iris.feature_names + ['target']) – CyberPlayerOne Apr 26 '17 at 07:06

score 88 · Answer 3 · edited Jun 08 '18 at 20:02

88

TOMDLt's solution is not generic enough for all the datasets in scikit-learn. For example it does not work for the boston housing dataset. I propose a different solution which is more universal. No need to use numpy as well.

from sklearn import datasets
import pandas as pd

boston_data = datasets.load_boston()
df_boston = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df_boston['target'] = pd.Series(boston_data.target)
df_boston.head()

As a general function:

def sklearn_to_df(sklearn_dataset):
    df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
    df['target'] = pd.Series(sklearn_dataset.target)
    return df

df_boston = sklearn_to_df(datasets.load_boston())

edited Jun 08 '18 at 20:02

Max Ghenis

14,783
16
84
132

answered Sep 23 '17 at 13:03

Nilav Baran Ghosh

1,349
11
18

1

I think `pd.Series(sklearn_dataset.target)` can be replaced with `sklearn_dataset.target`? At least it works for me on pandas 1.1.3 – 3142 maple Oct 28 '20 at 13:40
2

I find this solution easier to understand – Max Segal Feb 03 '21 at 17:52

score 19 · Answer 4 · answered Feb 01 '18 at 09:03

Took me 2 hours to figure this out

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
##iris.keys()


df= pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                 columns= iris['feature_names'] + ['target'])

df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)

Get back the species for my pandas

BhishanPoudel · Answer 5 · 2021-03-05T15:12:07.697

New Update

You can use the parameter as_frame=True to get pandas dataframes.

If as_frame parameter available (eg. load_iris)

from sklearn import datasets
X,y = datasets.load_iris(return_X_y=True) # numpy arrays

dic_data = datasets.load_iris(as_frame=True)
print(dic_data.keys())

df = dic_data['frame'] # pandas dataframe data + target
df_X = dic_data['data'] # pandas dataframe data only
ser_y = dic_data['target'] # pandas series target only
dic_data['target_names'] # numpy array

If as_frame parameter NOT available (eg. load_boston)

from sklearn import datasets

fnames = [ i for i in dir(datasets) if 'load_' in i]
print(fnames)

fname = 'load_boston'
loader = getattr(datasets,fname)()
df = pd.DataFrame(loader['data'],columns= loader['feature_names'])
df['target'] = loader['target']
df.head(2)

Finally - can load boston not just iris etc! This split is brilliantly clear and works perfectly. — TickboxPhil, Jun 15 '21 at 10:05

score 15 · Answer 6 · answered Oct 07 '17 at 18:48

15

Just as an alternative that I could wrap my head around much easier:

data = load_iris()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['target'] = data['target']
df.head()

Basically instead of concatenating from the get go, just make a data frame with the matrix of features and then just add the target column with data['whatvername'] and grab the target values from the dataset

answered Oct 07 '17 at 18:48

daguito81

151
1
3

Simple answers are the best... – Brian Wylie Nov 13 '21 at 21:12

Paul Rougieux · Answer 7 · 2020-05-16T07:46:15.897

11

Otherwise use seaborn data sets which are actual pandas data frames:

import seaborn
iris = seaborn.load_dataset("iris")
type(iris)
# <class 'pandas.core.frame.DataFrame'>

Compare with scikit learn data sets:

from sklearn import datasets
iris = datasets.load_iris()
type(iris)
# <class 'sklearn.utils.Bunch'>
dir(iris)
# ['DESCR', 'data', 'feature_names', 'filename', 'target', 'target_names']

edited May 16 '20 at 07:46

answered Feb 12 '20 at 10:25

Paul Rougieux

10,289
4
68
110

score 9 · Answer 8 · answered Sep 03 '20 at 14:26

9

This is easy method worked for me.

boston = load_boston()
boston_frame = pd.DataFrame(data=boston.data, columns=boston.feature_names)
boston_frame["target"] = boston.target

But this can applied to load_iris as well.

answered Sep 03 '20 at 14:26

user3151256

163
1
9

This worked a charm for me! – Oct 27 '21 at 04:05

Liquidgenius · Answer 9 · 2022-02-15T14:50:12.250

Many of the solutions are either missing column names or the species target names. This solution provides target_name labels.

@Ankit-mathanker's solution works, however it iterates the Dataframe Series 'target_names' to substitute the iris species for integer identifiers.

Based on the adage 'Don't iterate a Dataframe if you don't have to,' the following solution utilizes pd.replace() to more concisely accomplish the replacement.

import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
df = pd.DataFrame(iris['data'], columns = iris['feature_names'])
df['target'] = pd.Series(iris['target'], name = 'target_values')
df['target_name'] = df['target'].replace([0,1,2],
['iris-' + species for species in iris['target_names'].tolist()])

df.head(3)

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	target_name
0	5.1	3.5	1.4	0.2	iris-setosa
1	4.9	3.0	1.4	0.2	iris-setosa
2	4.7	3.2	1.3	0.2	iris-setosa

Representing the target variable with a pd.Categorical as in [this answer](https://stackoverflow.com/a/48558847/3388962) is more elegant. — normanius, May 31 '23 at 14:00

score 6 · Answer 10 · answered Jul 20 '17 at 04:11

6

This works for me.

dataFrame = pd.dataFrame(data = np.c_[ [iris['data'],iris['target'] ],
columns=iris['feature_names'].tolist() + ['target'])

answered Jul 20 '17 at 04:11

Mukul Aggarwal

1,515
20
16

score 6 · Answer 11 · answered Apr 11 '18 at 01:35

Other way to combine features and target variables can be using np.column_stack (details)

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

data = load_iris()
df = pd.DataFrame(np.column_stack((data.data, data.target)), columns = data.feature_names+['target'])
print(df.head())

Result:

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)     target
0                5.1               3.5                1.4               0.2     0.0
1                4.9               3.0                1.4               0.2     0.0 
2                4.7               3.2                1.3               0.2     0.0 
3                4.6               3.1                1.5               0.2     0.0
4                5.0               3.6                1.4               0.2     0.0

If you need the string label for the target, then you can use replace by convertingtarget_names to dictionary and add a new column:

df['label'] = df.target.replace(dict(enumerate(data.target_names)))
print(df.head())

Result:

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)     target  label 
0                5.1               3.5                1.4               0.2     0.0     setosa
1                4.9               3.0                1.4               0.2     0.0     setosa
2                4.7               3.2                1.3               0.2     0.0     setosa
3                4.6               3.1                1.5               0.2     0.0     setosa
4                5.0               3.6                1.4               0.2     0.0     setosa

score 4 · Answer 12 · answered May 13 '20 at 16:53

As of version 0.23, you can directly return a DataFrame using the as_frame argument. For example, loading the iris data set:

from sklearn.datasets import load_iris
iris = load_iris(as_frame=True)
df = iris.data

In my understanding using the provisionally release notes, this works for the breast_cancer, diabetes, digits, iris, linnerud, wine and california_houses data sets.

score 3 · Answer 13 · answered Jul 08 '20 at 07:39

Here's another integrated method example maybe helpful.

from sklearn.datasets import load_iris
iris_X, iris_y = load_iris(return_X_y=True, as_frame=True)
type(iris_X), type(iris_y)

The data iris_X are imported as pandas DataFrame and the target iris_y are imported as pandas Series.

score 2 · Answer 14 · edited Dec 10 '19 at 18:47

2

Basically what you need is the "data", and you have it in the scikit bunch, now you need just the "target" (prediction) which is also in the bunch.

So just need to concat these two to make the data complete

  data_df = pd.DataFrame(cancer.data,columns=cancer.feature_names)
  target_df = pd.DataFrame(cancer.target,columns=['target'])

  final_df = data_df.join(target_df)

edited Dec 10 '19 at 18:47

Govinda Sakhare

5,009
6
33
74

answered Oct 17 '19 at 05:39

Dhiraj Himani

131
9

score 2 · Answer 15 · answered May 15 '20 at 15:14

The API is a little cleaner than the responses suggested. Here, using as_frame and being sure to include a response column as well.

import pandas as pd
from sklearn.datasets import load_wine

features, target = load_wine(as_frame=True).data, load_wine(as_frame=True).target
df = features
df['target'] = target

df.head(2)

score 1 · Answer 16 · answered Jul 10 '17 at 02:09

Working off the best answer and addressing my comment, here is a function for the conversion

def bunch_to_dataframe(bunch):
  fnames = bunch.feature_names
  features = fnames.tolist() if isinstance(fnames, np.ndarray) else fnames
  features += ['target']
  return pd.DataFrame(data= np.c_[bunch['data'], bunch['target']],
                 columns=features)

Jeff Hernandez · Answer 17 · 2018-08-05T16:15:42.230

1

This snippet is only syntactic sugar built upon what TomDLT and rolyat have already contributed and explained. The only differences would be that load_iris will return a tuple instead of a dictionary and the columns names are enumerated.

df = pd.DataFrame(np.c_[load_iris(return_X_y=True)])

edited Aug 05 '18 at 16:15

answered Aug 02 '18 at 23:10

Jeff Hernandez

2,063
16
20

Thank you for this code snippet, which might provide some limited, immediate help. A [proper explanation would greatly improve its long-term value](//meta.stackexchange.com/q/114762/206345) by showing _why_ this is a good solution to the problem, and would make it more useful to future readers with other, similar questions. Please [edit] your answer to add some explanation, including the assumptions you've made. – Blue Aug 03 '18 at 01:24

score 1 · Answer 18 · answered Jan 15 '20 at 22:11

I took couple of ideas from your answers and I don't know how to make it shorter :)

import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris['feature_names'])
df['target'] = iris['target']

This gives a Pandas DataFrame with feature_names plus target as columns and RangeIndex(start=0, stop=len(df), step=1). I would like to have a shorter code where I can have 'target' added directly.

score 1 · Answer 19 · answered Nov 30 '20 at 08:27

You can use pd.DataFrame constructor, giving a numpy array (data) and a list of the names of the columns (columns). To have everything in one DataFrame, you can concatenate the features and the target into one numpy array with np.c_[...] (note the square brackets and not parenthesis). Also, you can have some trouble if you don't convert the feature names (iris['feature_names']) to a list before concatenation:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()

df = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                     columns= list(iris['feature_names']) + ['target'])

Richard Jarram · Answer 20 · 2021-08-04T22:04:07.537

Plenty of good responses to this question; I've added my own below.

import pandas as pd
from sklearn.datasets import load_iris

df = pd.DataFrame(
    # load all 4 dimensions of the dataframe EXCLUDING species data
    load_iris()['data'],
    # set the column names for the 4 dimensions of data
    columns=load_iris()['feature_names']
)

# we create a new column called 'species' with 150 rows of numerical data 0-2 signifying a species type. 
# Our column `species` should have data such `[0, 0, 1, 2, 1, 0]` etc.
df['species'] = load_iris()['target']
# we map the numerical data to string data for species type
df['species'] = df['species'].map({
    0 : 'setosa',
    1 : 'versicolor',
    2 : 'virginica'   
})

df.head()

sepal-df-head

Breakdown

For some reason the load_iris['feature_names] has only 4 columns (sepal length, sepal width, petal length, petal width); moreover, the load_iris['data'] only contains data for those feature_names mentioned above.
Instead, the species column names are stored in load_iris()['target_names'] == array(['setosa', 'versicolor', 'virginica'].
On top of this, the species row data is stored in load_iris()['target'].nunique() == 3
Our goal was simply to add a new column called species that used the map function to convert numerical data 0-2 into 3 types of string data signifying the iris species.

SamithaP · Answer 21 · 2022-09-03T23:47:00.177

This is an easy way and works with majority of datasets in sklearn

import pandas as pd
from sklearn import datasets

# download iris data set
iris = datasets.load_iris()

# load feature columns to DataFrame
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)

# add a column to df called 'target_c' then asign the target data of iris data
df['target_c'] = iris.target

# view final DataFrame
df.head()

score 1 · Answer 22 · answered Nov 30 '22 at 19:23

1

A more simpler and approachable manner I tried

import pandas as pd
from sklearn import datasets

iris = load_iris()

X= pd.DataFrame(iris['data'], columns= iris['feature_names'])
y = pd.DataFrame(iris['target'],columns=['target'])
df = X.join(y)

answered Nov 30 '22 at 19:23

Oloyede Abdulganiyu

21
3

score 0 · Answer 23 · answered Jun 29 '16 at 17:09

0

There might be a better way but here is what I have done in the past and it works quite well:

items = data.items()                          #Gets all the data from this Bunch - a huge list
mydata = pd.DataFrame(items[1][1])            #Gets the Attributes
mydata[len(mydata.columns)] = items[2][1]     #Adds a column for the Target Variable
mydata.columns = items[-1][1] + [items[2][0]] #Gets the column names and updates the dataframe

Now mydata will have everything you need - attributes, target variable and columnnames

answered Jun 29 '16 at 17:09

HakunaMaData

1,281
12
26

1

The solution by TomDLT is much superior than what I am suggesting above. It does the same thing but is very elegant and easy to understand. Use that! – HakunaMaData Jun 29 '16 at 17:22
`mydata = pd.DataFrame(items[1][1])` throws `TypeError: 'dict_items' object does not support indexing` – SANBI samples Jul 01 '16 at 07:31

score 0 · Answer 24 · answered Aug 25 '18 at 09:21

Whatever TomDLT answered it may not work for some of you because

data1 = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                 columns= iris['feature_names'] + ['target'])

because iris['feature_names'] returns you a numpy array. In numpy array you can't add an array and a list ['target'] by just + operator. Hence you need to convert it into a list first and then add.

You can do

data1 = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
                 columns= list(iris['feature_names']) + ['target'])

This will work fine tho..

score 0 · Answer 25 · answered Feb 07 '19 at 13:23

0

import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
X = iris['data']
y = iris['target']
iris_df = pd.DataFrame(X, columns = iris['feature_names'])
iris_df.head()

answered Feb 07 '19 at 13:23

Manideep Pullalachervu

29
3

score 0 · Answer 26 · edited May 03 '19 at 12:29

0

One of the best ways:

data = pd.DataFrame(digits.data)

Digits is the sklearn dataframe and I converted it to a pandas DataFrame

edited May 03 '19 at 12:29

mechnicov

12,025
4
33
56

answered May 03 '19 at 10:34

Shilp Baroda

1

score 0 · Answer 27 · answered Jul 18 '20 at 14:18

from sklearn.datasets import load_iris
import pandas as pd

iris_dataset = load_iris()

datasets = pd.DataFrame(iris_dataset['data'], columns = 
           iris_dataset['feature_names'])
target_val = pd.Series(iris_dataset['target'], name = 
            'target_values')

species = []
for val in target_val:
    if val == 0:
        species.append('iris-setosa')
    if val == 1:
        species.append('iris-versicolor')
    if val == 2:
        species.append('iris-virginica')
species = pd.Series(species)

datasets['target'] = target_val
datasets['target_name'] = species
datasets.head()

score 0 · Answer 28 · answered May 31 '23 at 14:23

So many answers, so much noise... The following is simple and uses pd.Categorical for the target variable.

import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()

df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df["species"] = pd.Categorical.from_codes(iris.target, iris.target_names)

#      sepal_length  sepal_width  petal_length  petal_width    species
# 0             5.1          3.5           1.4          0.2     setosa
# 1             4.9          3.0           1.4          0.2     setosa
# 2             4.7          3.2           1.3          0.2     setosa
# 3             4.6          3.1           1.5          0.2     setosa
# 4             5.0          3.6           1.4          0.2     setosa
# ..            ...          ...           ...          ...        ...
# 145           6.7          3.0           5.2          2.3  virginica
# 146           6.3          2.5           5.0          1.9  virginica
# 147           6.5          3.0           5.2          2.0  virginica
# 148           6.2          3.4           5.4          2.3  virginica
# 149           5.9          3.0           5.1          1.8  virginica
# 
# [150 rows x 5 columns]

To extract the integer codes of the target variable, use the cat accessor.

df.species.cat.codes

# 0      0
# 1      0
# 2      0
# 3      0
# 4      0
#       ..
# 145    2
# 146    2
# 147    2
# 148    2
# 149    2
# Length: 150, dtype: int8

How to convert a Scikit-learn dataset to a Pandas dataset

28 Answers28

New Update

If as_frame parameter available (eg. load_iris)

If as_frame parameter NOT available (eg. load_boston)

Breakdown

Linked

Related