How do I create a sklearn.datasets.base.Bunch object in scikit-learn from my own data?

Question

In most of the Scikit-learn algorithms, the data must be loaded as a Bunch object. For many example in the tutorial load_files() or other functions are used to populate the Bunch object. Functions like load_files() expect data to be present in certain format, but I have data stored in a different format, namely a CSV file with strings for each field.

How do I parse this and load data in the Bunch object format?

To be sure: none of the algorithms load `Bunch` objects. The example scripts use those, but the algorithms all want arrays or sparse matrices. — Fred Foo, Dec 11 '13 at 12:23
@Blake, the fit method of the classifier takes in a couple of list objects - list of data (`Bunch.data`) followed by a list of target(`Bunch.target`) - `clf.fit(, )`. — Vivek, Mar 10 '18 at 01:58

score 24 · Answer 1 · answered Dec 21 '16 at 12:15

24

You can do it like this:

import numpy as np
import sklearn.datasets

examples = []
examples.append('some text')
examples.append('another example text')
examples.append('example 3')

target = np.zeros((3,), dtype=np.int64)
target[0] = 0
target[1] = 1
target[2] = 0
dataset = sklearn.datasets.base.Bunch(data=examples, target=target)

answered Dec 21 '16 at 12:15

Hugh Perkins

7,975
7
63
71

2

@MachineEpsilon The question shows a misunderstanding in how you feed data to a classifier in `scikit-learn`, so even though this answers the literal question, it doesn't clear up the original misunderstanding. This link makes it pretty clear: http://scikit-learn.org/stable/faq.html#how-can-i-create-a-bunch-object – Perry Oct 26 '18 at 09:29

score 20 · Accepted Answer · edited Dec 11 '13 at 12:23

20

You don't have to create Bunch objects. They are just useful for loading the internal sample datasets of scikit-learn.

You can directly feed a list of Python strings to your vectorizer object.

edited Dec 11 '13 at 12:23

Fred Foo

355,277
75
744
836

answered Dec 10 '13 at 10:14

ogrisel

39,309
12
116
125

Thanks, Is there any utility function to load .CSV? Have a csv with for columns (all strings). Right now i am using python csv reader http://docs.python.org/2/library/csv.html – David Dec 10 '13 at 23:49
6

I would recommend pandas: http://pandas.pydata.org/ you need to convert the pandas.DataFrame to an np array before feeding it to sklearn, though. – Andreas Mueller Dec 11 '13 at 00:47

score 0 · Answer 3 · edited Dec 17 '20 at 12:51

This is an example of Breast Cancer Wisconsin (Diagnostic) Data Set, you can find the CSV file in Kaggle:

From column 2 at 32 in the CSV file are X_train and X_test data @usecols=range(2,32) this is stored in the Bunch Object key named data
```
from numpy import genfromtxt
data = genfromtxt("YOUR DATA DIRECTORY", delimiter=',', skip_header=1,  usecols=range(2,32))
```
I am interested in the column data B (column 1 in Numpy Array @usecols=(1)) in the CSV file because it is the output of y_train and y_test and is stored in the Bunch Object Key named: target
```
import pandas as pd
target = genfromtxt("YOUR DATA DIRECTORY", delimiter=',', skip_header=1, usecols=(1), dtype=str)
```
There are some tricks to transform the target like it has in sklearn, of course it can be made in a unique variable target, target1, ... is separated only to explain what I did.
First transform the numpy into a Panda
```
target2 = pd.Series(target)
```
It's for use the rank function, you could skip the step number 5
```
target3 = target2.rank(method='dense', axis=0)
```
This is only for transform the target in 0 or 1 like the example in the Book
```
target4 = (target3 % 2 == 0) * 1 
```
Got values into numpy
```
target5 = target4.values
```

Here I copied Hugh Perkins's solution:

import sklearn
dataset = sklearn.datasets.base.Bunch(data=data, target=target5)

How do I create a sklearn.datasets.base.Bunch object in scikit-learn from my own data?

3 Answers3

Linked

Related