Questions tagged [text-classification]

Simply stating, text classification is all about putting a piece of text into a set of (mostly predefined) categories. This is one of the most important problems which occurs in many real world applications. For example one example of text classification would be an automated call centre which would like to categorise the complaints automatically into the most appropriate bucket of problems.

Text classification is a sub-problem of a more general problem of classification. In this application, the input is represented with a piece of text (rather than images, sounds, videos etc). The output could be:

binary (binary classification)
one category out of k possible categories (multi-class)
a set of categories out of k possible categories (multi-label).

In text classification, the feature extracted from the text are usually sparse (instead of dense, like in image classification).

1694 questions

156

votes

3 answers

How can I plot a confusion matrix?

I am using scikit-learn for classification of text documents(22000) to 100 classes. I use scikit-learn's confusion matrix method for computing the confusion matrix. model1 = LogisticRegression() model1 = model1.fit(matrix, labels) pred =…

asked Feb 23 '16 at 08:06

minks

2,859
4
21
29

votes

9 answers

How to use Bert for long text classification?

We know that BERT has a max length limit of tokens = 512, So if an article has a length of much bigger than 512, such as 10000 tokens in text How can BERT be used?

nlp text-classification bert-language-model

asked Oct 31 '19 at 03:34

user1337896

1,081
1
10
15

votes

4 answers

ROC for multiclass classification

I'm doing different text classification experiments. Now I need to calculate the AUC-ROC for each task. For the binary classifications, I already made it work with this code: scaler = StandardScaler(with_mean=False) enc = LabelEncoder() y =…

python scikit-learn text-classification roc multiclass-classification

asked Jul 26 '17 at 16:16

Bambi

votes

2 answers

Multilabel Text Classification using TensorFlow

The text data is organized as vector with 20,000 elements, like [2, 1, 0, 0, 5, ...., 0]. i-th element indicates the frequency of the i-th word in a text. The ground truth label data is also represented as vector with 4,000 elements, like [0, 0,…

python tensorflow text-classification multilabel-classification

asked Feb 15 '16 at 01:10

Benben

1,355
5
18
31

votes

3 answers

Information Gain calculation with Scikit-learn

I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. the Information Gain is defined as H(Class) - H(Class | Attribute), where H is…

python machine-learning scikit-learn text-classification feature-selection

asked Oct 15 '17 at 07:17

Roman Purgstaller

votes

3 answers

adding words to stop_words list in TfidfVectorizer in sklearn

I want to add a few more words to stop_words in TfidfVectorizer. I followed the solution in Adding words to scikit-learn's CountVectorizer's stop list . My stop word list now contains both 'english' stop words and the stop words I specified. But…

python scikit-learn classification stop-words text-classification

asked Nov 09 '14 at 07:24

ac11

votes

2 answers

How to add another feature (length of text) to current bag of words classification? Scikit-learn

I am using bag of words to classify text. It's working well but I am wondering how to add a feature which is not a word. Here is my sample code. import numpy as np from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import…

python machine-learning scikit-learn classification text-classification

asked Aug 24 '16 at 10:42

aaravam

votes

4 answers

CountVectorizer: AttributeError: 'numpy.ndarray' object has no attribute 'lower'

I have a one-dimensional array with large strings in each of the elements. I am trying to use a CountVectorizer to convert text data into numerical vectors. However, I am getting an error saying: AttributeError: 'numpy.ndarray' object has no…

python numpy scikit-learn text-classification

asked Oct 14 '14 at 17:48

ashu

votes

3 answers

Naive Bayes: Imbalanced Test Dataset

I am using scikit-learn Multinomial Naive Bayes classifier for binary text classification (classifier tells me whether the document belongs to the category X or not). I use a balanced dataset to train my model and a balanced test set to test it and…

python machine-learning classification scikit-learn text-classification

asked Jun 23 '14 at 13:25

Erol

6,478
5
41
55

votes

2 answers

UserWarning: Label not :NUMBER: is present in all training examples

I am doing multilabel classification, where I try to predict correct labels for each document and here is my code: mlb = MultiLabelBinarizer() X = dataframe['body'].values y = mlb.fit_transform(dataframe['tag'].values) classifier = Pipeline([ …

python scikit-learn classification text-classification multilabel-classification

asked Mar 15 '17 at 21:48

PeterB

2,234
6
24
43

votes

2 answers

sklearn classifier get ValueError: bad input shape

I have a csv, struct is CAT1,CAT2,TITLE,URL,CONTENT, CAT1, CAT2, TITLE ,CONTENT are in chinese. I want train LinearSVC or MultinomialNB with X(TITLE) and feature(CAT1,CAT2), both get this error. below is my code: PS: I write below code through…

python scikit-learn classification text-classification

asked Jul 09 '15 at 00:59

Mithril

12,947
18
102
153

votes

2 answers

How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?

I am familiar with using BOW features for text classification, wherein we first find the size of the vocabulary for the corpus which becomes the size of our feature vector. For each sentence/document, and for all its constituent words, we then put…

text vector nlp text-classification word2vec

asked Oct 26 '14 at 03:45

Satarupa Guha

1,267
13
20

votes

1 answer

What are differences between AutoModelForSequenceClassification vs AutoModel

We can create a model from AutoModel(TFAutoModel) function: from transformers import AutoModel model = AutoModel.from_pretrained('distilbert-base-uncase') In other hand, a model is created by…

nlp text-classification huggingface-transformers

asked Nov 10 '21 at 03:33

Tan Phan

votes

6 answers

With BERT Text Classification, ValueError: too many dimensions 'str' error occuring

Trying to make a classifier for sentiments of texts with BERT model but getting ValueError : too many dimensions 'str' That is the DataFrame for values of train data; so they are train_labels 0 notr 1 notr 2 notr 3 negative 4 notr ...…

python tensor text-classification bert-language-model mlp

asked Jan 20 '21 at 07:12

KazımTibetSar

votes

2 answers

unable to use FeatureUnion in scikit-learn due to different dimensions

I'm trying to use FeatureUnion to extract different features from a datastructure, but it fails due to different dimensions: ValueError: blocks[0,:] has incompatible row dimensions Implementaion My FeatureUnion is built the following way: …

python scikit-learn classification text-classification

asked Sep 11 '14 at 19:22

jwacalex

2 3

…

99 100 Next