-2

Given a classification problem, the training data looks as follows:

input - output
--------------
A       100
B       150
C       170
..............

where A,B,C are big data sets, each with 6 variables and around 5000 rows.

The problem is: how do I wrap up the input to use a classification algorithm for other data sets inputs such as these.

I tried attaching each row the training classifier value and train them as such. For a new entry, each row would be classified, and I would take the mean (average) value as a classifier for the entire data set. But I did not have very good results using naive Bayes.

Should I keep looking into this method with other classifiers? What other options could I look into?

Edit

Sample data from 2 events

    OUT Var1    Var2    Var3    Var4    Var5    Var6
0   93  209.2   49.4    5451.0  254.0   206.0   37.7
1       344.9   217.6   14590.5 191.7   175.5   106.8
2       663.3   97.2    17069.2 144.4   2.8     59.9
3       147.4   137.7   12367.4 194.1   237.7   116.2
4       231.8   162.2   11938.4 71.3    149.1   116.3

    OUT Var1    Var2    Var3    Var4    Var5    Var6
964 100 44.5    139.7   10702.5 151.4   36.0    17.9
966     59.8    148.9   3184.9  103.0   96.5    12.8
967     189.7   194.4   7569.6  49.9    82.6    55.2
969     158.5   88.2    2932.4  159.8   232.8   125.2
971     226.4   155.2   3156.3  85.0    4010.5  69.9

For a similar data set, I need to predict the out value. I have a lot of samples like these.

Is it correct to apply the same value to all rows?

ashcrok
  • 234
  • 3
  • 16
  • What? I would recommend reading something like this: [An introduction to machine learning in scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html). It seems you're missing some fundamental tenets of machine learning classification. But, yes, you *could* try other classifiers. – blacksite Dec 08 '16 at 13:27
  • It's not about the classifier, the problem is: how to handle the data. The variables I'm dealing with are not strings or numerical, they are entire data sets. – ashcrok Dec 08 '16 at 13:35
  • This would also be useful: [Classifier comparison](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) – blacksite Dec 08 '16 at 13:44

1 Answers1

1

Look into one-hot encoding. Given an input variable x, which has three distinct classes (this is often called a "factor"), for each unique value of x, you need a binary-encoded column in your dataset so the machine learning algorithm understands what it's dealing with (i.e. it'll be able to learn why a given class 'A' corresponds to various output values).

Edit!!

Another import note: what you're looking to do (i.e. map some features onto a continuous output variable), is not referred to as "classification." This is called "regression." In your case, a classification problem would be if you flipped your data and tried to predict inputs (i.e. A, B, or C) given your outputs (your continuous integers). I'll show below how to do regression in your case. If you try classification the way you describe, you'll end of having len(set(df['outputs'])) number of different classes you're trying to predict. Classification is not the approach to take in your scenario.

A quick-and-dirty example follows:

import random
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import r2_score
import pandas as pd
import numpy as np

inputs = ['A', 'B', 'C']

# create some random data similar to yours
df = pd.DataFrame({'input': [random.choice(inputs) for _ in range(5000)], 'output': [int(abs(n) * 100) for n in np.random.randn(5000)]})

# one-hot-encode the categorical variable 'input' for use in classification
dummies = pd.get_dummies(df['input'])

# merge the one-hot-encoded dummies back with the original data
df = df.join(dummies)

# our feature  matrix (input values as dummies)
X = df[['A', 'B', 'C']]

# our outcome variable
y = df['output']

# split the dataset into train and test objects so we can gauge the accuracy of our classifier
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.80, random_state = 100)

# our model instance
model = LogisticRegression()

# train the classifier
model.fit(X_train, y_train)

# use trained model from above to predict the class of "new" data
predicted = model.predict(X_test)

# let's see how well the classifier performed
print(r2_score(y_test, predicted))

Edit 2:

To answer your edited question, as long as the rows within each sample come from the same class, then yes, you should apply the same value to each of the rows in the sample. For your first "event" above, if all of the rows (at indices 0 through 4) are of the same class/group, then you should apply 93 to all of the rows in the sample (all of the rows which are in the class).

Community
  • 1
  • 1
blacksite
  • 12,086
  • 10
  • 64
  • 109
  • Yes, I understand the difference between classification and regression. I do not have problems applying it on a data set, but in this case, I have a an issue with the data set itself. I edited the post. Thank you anyway for explaining. I appreciate it :). – ashcrok Dec 08 '16 at 14:53
  • See the second edit above. I hope that answers your question. – blacksite Dec 08 '16 at 15:06