4

I am trying to use the Sci-kit learn python library to classify a bunch of urls for the presence of certain keywords matching a user profile. A user has name, email address ... and a url assigned to them. I have created a txt with the result of each profile data match on each link so it is in the format:

Name  Email  Address
  0     1      0      =>Relavent
  1     1      0      =>Relavent
  0     1      1      =>Relavent
  0     0      0      =>Not Relavent

Where the 0 or 1 signifies that the attribute was found on the page(each row is a webpage) How do i give this data to the sci-kit so it can use it to run a classifier? The examples i have seen all have data coming from a predefined sch-kit library such as digits or iris or are being generated in the format i already have. I just dont know how to use the data format i have to provide to the library

The above is a toy example and i have many more features than 3

Zword
  • 6,605
  • 3
  • 27
  • 52
John Baum
  • 3,183
  • 11
  • 42
  • 90
  • Just a note - if your data has just 3 features (Name, Email, Address) each 0 or 1 than ANY machine learning is a bad idea. You just have 8 (!) possibilities: 000,100.010.001.110.101.011.111. You can develop perfect (in the sense of your data representation) rules by hand – lejlot Feb 01 '14 at 09:14

1 Answers1

3

The data needed is a numpy array (in this case a "matrix") with the shape (n_samples, n_features).

A simple way to read the csv-file to the right format by using numpy.genfromtxt. Also refer this thread.

Let the contents of a csv file (say file.csv in the current working directory) be:

a,b,c,target
1,1,1,0
1,0,1,0
1,1,0,1
0,0,1,1
0,1,1,0

To load it we do

data = np.genfromtxt('file.csv', skip_header=True)

The skip_header is set to True, to prevent reading the header column (The a,b,c,target line). Refer numpy's documentation for more details.

Once you load the data, you need to do some pre-processing based on your input data format. The preprocessing could be something like splitting the input and the targets (classification) or splitting the whole dataset into a training and validation set (for cross-validation).

To split the input (feature matrix) from the output (target vector) we do

features = data[:, :3]
targets = data[:, 3]   # The last column is identified as the target

For the above given CSV data, the arrays will use will look like:

features = array([[ 0, 1, 0],
              [ 1, 1, 0],
              [ 0, 1, 1],
              [ 0, 0, 0]])  # shape = ( 4, 3)

targets = array([ 1, 1, 1, 0])  # shape = ( 4, )

Now these matrices are passed to the estimator objects fit function. If you are using the popular svm classifier then

>>> from sklearn.svm import LinearSVC
>>> linear_svc_model = LinearSVC()
>>> linear_svc_model.fit(X=features, y=targets) 
Community
  • 1
  • 1
SlimJim
  • 2,264
  • 2
  • 22
  • 25
  • can you add a short example of exactly what this array is supposed to look like? – John Baum Feb 03 '14 at 18:36
  • Yes that clarifies it up a lot. Out of curiosity, would i use a Bernoulli form of a naive bayes classifier for this example since my features are all binary? – John Baum Feb 04 '14 at 00:51
  • did the answer satisfy the bounty requirements or should I update it? @JohnBaum – SlimJim Feb 10 '14 at 22:05