Scikit Learn - Identifying target from loading a CSV

Question

I'm loading a csv, using Numpy, as a dataset to create a decision tree model in Python. using the below extract places columns 0-7 in X and the last column as the target in Y.

#load and set data
data = np.loadtxt("data/tmp.csv", delimiter=",")
X = data[:,0:7] #identify columns as data sets
Y = data[:,8] #identfy last column as target

#create model
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

What i'd like to know is if its possible to have the classifier in any column. for example if its in the fourth column would the following code still fit the model correctly or would it produce errors when it comes to predicting?

#load and set data
data = np.loadtxt("data/tmp.csv", delimiter=",")
X = data[:,0:8] #identify columns as data sets
Y = data[:,3] #identfy fourth column as target

#create model
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

I don't think so. It would learn that the output is always equal to the fourth feature :D — fferri, Jun 15 '15 at 15:33
`X = data[:,0:8]` and `Y = data[:,3] ` mean that you will include the target in the features! — farhawa, Jun 15 '15 at 15:33

Ahmed Fasih · Answer 1 · 2015-06-18T13:31:46.180

2

If you have >4 columns, and the 4th one is the target and the others are features, here's one way (out of many) to load them:

# load data

X = np.hstack([data[:, :3], data[:, 5:]]) # features
Y = data[:,4] # target

# process X & Y

(with belated thanks to @omerbp for reminding me hstack takes a tuple/list, not naked arguments!)

edited Jun 18 '15 at 13:31

answered Jun 15 '15 at 16:30

Ahmed Fasih

6,458
7
54
95

Hi, I believe you did mean `X = np.hstack(data[:, :3], data[:, 5:])`, but `X = np.hstack([data[:, :3], data[:, 5:]])`... good answer though, +1. I added in my answer time comparison between the method you suggested and [this one](http://stackoverflow.com/questions/4857927/swapping-columns-in-a-numpy-array) – AvidLearner Jun 15 '15 at 18:43
Thanks @omerbp, unforgivably I was too busy to be polite when I incorporated your fix. Edited to give credit where credit is due. – Ahmed Fasih Jun 18 '15 at 13:32
haha you went far with it :) upvote is more then enough in this case :) (as I said I upvoted yours, great answer) – AvidLearner Jun 18 '15 at 13:36

score 1 · Answer 2 · edited May 23 '17 at 12:29

First of all, As suggested by @mescalinum in a comment to the question, think of this situation:

.... 4th_feature ...    label
....      1      ...      1
....      0      ...      0
....      1      ...      1
............................

In this example, the classifier (any classifier, not DecisionTreeClassifier particularly) will learn that the 4th feature can best predict the label, since the 4th feature is the label. Unfortunately, this issue happen a lot (by accident I mean).

Secondly, if you want the 4th feature to be input label, you can just swap the columns:

arr[:,[frm, to]] = arr[:,[to, frm]]

@Ahemed Fasih's answer can also do the trick, however its around 10 time slower:

import timeit


setup_code = """
import numpy as np
i, j = 400000, 200
my_array = np.arange(i*j).reshape(i, j)
"""

swap_cols = """
def swap_cols(arr, frm, to):
    arr[:,[frm, to]] = arr[:,[to, frm]]
"""

stack ="np.hstack([my_array[:, :3], my_array[:, 5:]])"
swap ="swap_cols(my_array, 4, 8)"

print "hstack - total time:", min(timeit.repeat(stmt=stack,setup=setup_code,number=20,repeat=3))
#hstack - total time: 3.29988478635
print "swap - total time:", min(timeit.repeat(stmt=swap,setup=setup_code+swap_cols,number=20,repeat=3))
#swap - total time: 0.372791106328

Scikit Learn - Identifying target from loading a CSV

2 Answers2