Questions tagged [classification]

In machine learning and statistics, classification is the problem of identifying which of a set of categories a new observation belongs to, on the basis of a training set of data containing observations whose category membership (label) is known.

In machine learning and statistics, classification refers to the problem of predicting category memberships based on a set of pre-labeled examples. It is thus a type of supervised learning.

Some of the most important classification algorithms are support vector machines , logistic regression, naive Bayes, random forest and artificial neural networks .

When we wish to associate inputs with continuous values in a supervised framework, the problem is instead known as . The unsupervised counterpart to classification is known as (or cluster analysis), and involves grouping data into categories based on some measure of inherent similarity.

7859 questions
575
votes
5 answers

A simple explanation of Naive Bayes Classification

I am finding it hard to understand the process of Naive Bayes, and I was wondering if someone could explain it with a simple step by step process in English. I understand it takes comparisons by times occurred as a probability, but I have no idea…
Aeonitis
  • 5,887
  • 3
  • 14
  • 8
395
votes
6 answers

What are advantages of Artificial Neural Networks over Support Vector Machines?

ANN (Artificial Neural Networks) and SVM (Support Vector Machines) are two popular strategies for supervised machine learning and classification. It's not often clear which method is better for a particular project, and I'm certain the answer is…
Channel72
  • 24,139
  • 32
  • 108
  • 180
254
votes
6 answers

Save classifier to disk in scikit-learn

How do I save a trained Naive Bayes classifier to disk and use it to predict data? I have the following sample program from the scikit-learn website: from sklearn import datasets iris = datasets.load_iris() from sklearn.naive_bayes import…
garak
  • 4,713
  • 9
  • 39
  • 56
203
votes
20 answers

Difference between classification and clustering in data mining?

Can someone explain what the difference is between classification and clustering in data mining? If you can, please give examples of both to understand the main idea.
136
votes
6 answers

Why is the F-Measure a harmonic mean and not an arithmetic mean of the Precision and Recall measures?

When we calculate the F-Measure considering both Precision and Recall, we take the harmonic mean of the two measures instead of a simple arithmetic mean. What is the intuitive reason behind taking the harmonic mean and not a simple average?
London guy
  • 27,522
  • 44
  • 121
  • 179
123
votes
9 answers

How to fix RuntimeError "Expected object of scalar type Float but got scalar type Double for argument"?

I'm trying to train a classifier via PyTorch. However, I am experiencing problems with training when I feed the model with training data. I get this error on y_pred = model(X_trainTensor): RuntimeError: Expected object of scalar type Float but got…
Shawn Zhang
  • 1,719
  • 2
  • 14
  • 20
119
votes
20 answers

Scikit-learn: How to obtain True Positive, True Negative, False Positive and False Negative

My problem: I have a dataset which is a large JSON file. I read it and store it in the trainList variable. Next, I pre-process it - in order to be able to work with it. Once I have done that I start the classification: I use the kfold cross…
113
votes
6 answers

scikit-learn .predict() default threshold

I'm working on a classification problem with unbalanced classes (5% 1's). I want to predict the class, not the probability. In a binary classification problem, is scikit's classifier.predict() using 0.5 by default? If it doesn't, what's the default…
ADJ
  • 4,892
  • 10
  • 50
  • 83
91
votes
10 answers

Higher validation accuracy, than training accurracy using Tensorflow and Keras

I'm trying to use deep learning to predict income from 15 self reported attributes from a dating site. We're getting rather odd results, where our validation data is getting better accuracy and lower loss, than our training data. And this is…
Jasper
  • 1,018
  • 1
  • 10
  • 14
88
votes
5 answers

Scikit-learn train_test_split with indices

How do I get the original indices of the data when using train_test_split()? What I have is the following from sklearn.cross_validation import train_test_split import numpy as np data = np.reshape(np.randn(20),(10,2)) # 10 training examples labels =…
CentAu
  • 10,660
  • 15
  • 59
  • 85
85
votes
13 answers

How can I build a model to distinguish tweets about Apple (Inc.) from tweets about apple (fruit)?

See below for 50 tweets about "apple." I have hand labeled the positive matches about Apple Inc. They are marked as 1 below. Here are a couple of lines: 1|“@chrisgilmer: Apple targets big business with new iOS 7 features http://bit.ly/15F9JeF ”.…
SAL
  • 834
  • 1
  • 8
  • 16
83
votes
5 answers

What is the relation between the number of Support Vectors and training data and classifiers performance?

I am using LibSVM to classify some documents. The documents seem to be a bit difficult to classify as the final results show. However, I have noticed something while training my models. and that is: If my training set is for example 1000 around 800…
Hossein
  • 40,161
  • 57
  • 141
  • 175
82
votes
5 answers

Use scikit-learn to classify into multiple categories

I'm trying to use one of scikit-learn's supervised learning methods to classify pieces of text into one or more categories. The predict function of all the algorithms I tried just returns one match. For example I have a piece of text: "Theaters in…
CodeMonkeyB
  • 2,970
  • 4
  • 22
  • 29
78
votes
6 answers

Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

I'm using scikit-learn in Python to develop a classification algorithm to predict the gender of certain customers. Amongst others, I want to use the Naive Bayes classifier but my problem is that I have a mix of categorical data (ex: "Registered…
75
votes
11 answers

FailedPreconditionError: Attempting to use uninitialized in Tensorflow

I am working through the TensorFlow tutorial, which uses a "weird" format to upload the data. I would like to use the NumPy or pandas format for the data, so that I can compare it with scikit-learn results. I get the digit recognition data from…
user3654387
  • 2,240
  • 4
  • 19
  • 20
1
2 3
99 100