2

I just followed the code here (with minor modifications for sklearn 0.17). In that example, data are just lists or numpy arrays. Now I want to prepare a toy training dataset on the disk, and use datasets.load_files to load it for multilabel classification. However, simply following the load_files convention, and then copying the same file into multiple folders, doesn't produce a list of lists (aka. label sets) for dataset.target.

So what is the correct way to prepare a dataset for multilabel classification?

Community
  • 1
  • 1
treslumen
  • 183
  • 2
  • 13

1 Answers1

2

I don't think load_files supports multilabel classes, to be honest I've never used scikit learn to load data, I always do my initial data load and preprocessing using pandas. One option for your case would be to store your data as csv, serializing your labels as pipe-delimited lists

For example your file data.csv might be

recipe_name,classes
'stir fried broccoli',chinese|vegetarian
'kung po chicken',chinese|meat
'sauerkraut salad',vegetarian|polish

And you would load it as follows:

import pandas as pd
df = pd.read_csv('data.csv')
X_train = df.recipe_name
y_train = df.classes.str.split('|')
maxymoo
  • 35,286
  • 11
  • 92
  • 119
  • Thanks @maxymoo, this is a good point. I crawl and store multiple files, so perhaps I will just name each document using a list of labels, no more folder structures, and write a function to parse the file names and read the contents... – treslumen May 03 '16 at 01:55
  • 1
    if you are crawling, you might want to consider using a database like mongodb or postgres, you might be glad for it in the long run rather than having a bunch of files floating around. also you can do some of the preprossing on the database which can be handy. – maxymoo May 03 '16 at 02:53