20

In most of the Scikit-learn algorithms, the data must be loaded as a Bunch object. For many example in the tutorial load_files() or other functions are used to populate the Bunch object. Functions like load_files() expect data to be present in certain format, but I have data stored in a different format, namely a CSV file with strings for each field.

How do I parse this and load data in the Bunch object format?

MatthewMartin
  • 32,326
  • 33
  • 105
  • 164
David
  • 4,634
  • 7
  • 35
  • 42
  • 1
    To be sure: none of the algorithms load `Bunch` objects. The example scripts use those, but the algorithms all want arrays or sparse matrices. – Fred Foo Dec 11 '13 at 12:23
  • @Blake, the fit method of the classifier takes in a couple of list objects - list of data (`Bunch.data`) followed by a list of target(`Bunch.target`) - `clf.fit(, )`. – Vivek Mar 10 '18 at 01:58

3 Answers3

24

You can do it like this:

import numpy as np
import sklearn.datasets

examples = []
examples.append('some text')
examples.append('another example text')
examples.append('example 3')

target = np.zeros((3,), dtype=np.int64)
target[0] = 0
target[1] = 1
target[2] = 0
dataset = sklearn.datasets.base.Bunch(data=examples, target=target)
Hugh Perkins
  • 7,975
  • 7
  • 63
  • 71
  • 2
    @MachineEpsilon The question shows a misunderstanding in how you feed data to a classifier in `scikit-learn`, so even though this answers the literal question, it doesn't clear up the original misunderstanding. This link makes it pretty clear: http://scikit-learn.org/stable/faq.html#how-can-i-create-a-bunch-object – Perry Oct 26 '18 at 09:29
20

You don't have to create Bunch objects. They are just useful for loading the internal sample datasets of scikit-learn.

You can directly feed a list of Python strings to your vectorizer object.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • Thanks, Is there any utility function to load .CSV? Have a csv with for columns (all strings). Right now i am using python csv reader http://docs.python.org/2/library/csv.html – David Dec 10 '13 at 23:49
  • 6
    I would recommend pandas: http://pandas.pydata.org/ you need to convert the pandas.DataFrame to an np array before feeding it to sklearn, though. – Andreas Mueller Dec 11 '13 at 00:47
0

This is an example of Breast Cancer Wisconsin (Diagnostic) Data Set, you can find the CSV file in Kaggle:

  1. From column 2 at 32 in the CSV file are X_train and X_test data @usecols=range(2,32) this is stored in the Bunch Object key named data

    from numpy import genfromtxt
    data = genfromtxt("YOUR DATA DIRECTORY", delimiter=',', skip_header=1,  usecols=range(2,32))
    
  2. I am interested in the column data B (column 1 in Numpy Array @usecols=(1)) in the CSV file because it is the output of y_train and y_test and is stored in the Bunch Object Key named: target

    import pandas as pd
    target = genfromtxt("YOUR DATA DIRECTORY", delimiter=',', skip_header=1, usecols=(1), dtype=str)
    

    There are some tricks to transform the target like it has in sklearn, of course it can be made in a unique variable target, target1, ... is separated only to explain what I did.

  3. First transform the numpy into a Panda

    target2 = pd.Series(target)
    
  4. It's for use the rank function, you could skip the step number 5

    target3 = target2.rank(method='dense', axis=0)
    
  5. This is only for transform the target in 0 or 1 like the example in the Book

    target4 = (target3 % 2 == 0) * 1 
    
  6. Got values into numpy

    target5 = target4.values
    

Here I copied Hugh Perkins's solution:

import sklearn
dataset = sklearn.datasets.base.Bunch(data=data, target=target5)
Jim O'Brien
  • 2,512
  • 18
  • 29