Questions tagged [text-classification]

Simply stating, text classification is all about putting a piece of text into a set of (mostly predefined) categories. This is one of the most important problems which occurs in many real world applications. For example one example of text classification would be an automated call centre which would like to categorise the complaints automatically into the most appropriate bucket of problems.

Text classification is a sub-problem of a more general problem of classification. In this application, the input is represented with a piece of text (rather than images, sounds, videos etc). The output could be:

  • binary (binary classification)
  • one category out of k possible categories (multi-class)
  • a set of categories out of k possible categories (multi-label).

In text classification, the feature extracted from the text are usually sparse (instead of dense, like in image classification).

1694 questions
156
votes
3 answers

How can I plot a confusion matrix?

I am using scikit-learn for classification of text documents(22000) to 100 classes. I use scikit-learn's confusion matrix method for computing the confusion matrix. model1 = LogisticRegression() model1 = model1.fit(matrix, labels) pred =…
minks
  • 2,859
  • 4
  • 21
  • 29
80
votes
9 answers

How to use Bert for long text classification?

We know that BERT has a max length limit of tokens = 512, So if an article has a length of much bigger than 512, such as 10000 tokens in text How can BERT be used?
user1337896
  • 1,081
  • 1
  • 10
  • 15
43
votes
4 answers

ROC for multiclass classification

I'm doing different text classification experiments. Now I need to calculate the AUC-ROC for each task. For the binary classifications, I already made it work with this code: scaler = StandardScaler(with_mean=False) enc = LabelEncoder() y =…
35
votes
2 answers

Multilabel Text Classification using TensorFlow

The text data is organized as vector with 20,000 elements, like [2, 1, 0, 0, 5, ...., 0]. i-th element indicates the frequency of the i-th word in a text. The ground truth label data is also represented as vector with 4,000 elements, like [0, 0,…
Benben
  • 1,355
  • 5
  • 18
  • 31
32
votes
3 answers

Information Gain calculation with Scikit-learn

I am using Scikit-learn for text classification. I want to calculate the Information Gain for each attribute with respect to a class in a (sparse) document-term matrix. the Information Gain is defined as H(Class) - H(Class | Attribute), where H is…
28
votes
3 answers

adding words to stop_words list in TfidfVectorizer in sklearn

I want to add a few more words to stop_words in TfidfVectorizer. I followed the solution in Adding words to scikit-learn's CountVectorizer's stop list . My stop word list now contains both 'english' stop words and the stop words I specified. But…
ac11
  • 927
  • 2
  • 11
  • 18
21
votes
2 answers

How to add another feature (length of text) to current bag of words classification? Scikit-learn

I am using bag of words to classify text. It's working well but I am wondering how to add a feature which is not a word. Here is my sample code. import numpy as np from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import…
18
votes
4 answers

CountVectorizer: AttributeError: 'numpy.ndarray' object has no attribute 'lower'

I have a one-dimensional array with large strings in each of the elements. I am trying to use a CountVectorizer to convert text data into numerical vectors. However, I am getting an error saying: AttributeError: 'numpy.ndarray' object has no…
ashu
  • 491
  • 2
  • 5
  • 13
17
votes
3 answers

Naive Bayes: Imbalanced Test Dataset

I am using scikit-learn Multinomial Naive Bayes classifier for binary text classification (classifier tells me whether the document belongs to the category X or not). I use a balanced dataset to train my model and a balanced test set to test it and…
16
votes
2 answers

UserWarning: Label not :NUMBER: is present in all training examples

I am doing multilabel classification, where I try to predict correct labels for each document and here is my code: mlb = MultiLabelBinarizer() X = dataframe['body'].values y = mlb.fit_transform(dataframe['tag'].values) classifier = Pipeline([ …
15
votes
2 answers

sklearn classifier get ValueError: bad input shape

I have a csv, struct is CAT1,CAT2,TITLE,URL,CONTENT, CAT1, CAT2, TITLE ,CONTENT are in chinese. I want train LinearSVC or MultinomialNB with X(TITLE) and feature(CAT1,CAT2), both get this error. below is my code: PS: I write below code through…
Mithril
  • 12,947
  • 18
  • 102
  • 153
15
votes
2 answers

How to use vector representation of words (as obtained from Word2Vec,etc) as features for a classifier?

I am familiar with using BOW features for text classification, wherein we first find the size of the vocabulary for the corpus which becomes the size of our feature vector. For each sentence/document, and for all its constituent words, we then put…
Satarupa Guha
  • 1,267
  • 13
  • 20
14
votes
1 answer

What are differences between AutoModelForSequenceClassification vs AutoModel

We can create a model from AutoModel(TFAutoModel) function: from transformers import AutoModel model = AutoModel.from_pretrained('distilbert-base-uncase') In other hand, a model is created by…
Tan Phan
  • 337
  • 1
  • 4
  • 14
14
votes
6 answers

With BERT Text Classification, ValueError: too many dimensions 'str' error occuring

Trying to make a classifier for sentiments of texts with BERT model but getting ValueError : too many dimensions 'str' That is the DataFrame for values of train data; so they are train_labels 0 notr 1 notr 2 notr 3 negative 4 notr ...…
14
votes
2 answers

unable to use FeatureUnion in scikit-learn due to different dimensions

I'm trying to use FeatureUnion to extract different features from a datastructure, but it fails due to different dimensions: ValueError: blocks[0,:] has incompatible row dimensions Implementaion My FeatureUnion is built the following way: …
jwacalex
  • 517
  • 1
  • 5
  • 17
1
2 3
99 100