Questions tagged [text-mining]

Text Mining is a process of deriving high-quality information from unstructured (textual) information.

Text Mining is a process of deriving high-quality information from unstructured (textual) information. Possible applications for text-mining are

  • Comments of Survey responses
  • Customer messages, emails, complaints etc.
  • Investigating competitors by crawling their web sites

More about text mining in below links.

2607 questions
351
votes
7 answers

What is "entropy and information gain"?

I am reading this book (NLTK) and it is confusing. Entropy is defined as: Entropy is the sum of the probability of each label times the log probability of that same label How can I apply entropy and maximum entropy in terms of text mining? Can…
TIMEX
  • 259,804
  • 351
  • 777
  • 1,080
87
votes
1 answer

Inconsistent behaviour with tm_map transformation functions when using multiple cores

Another potential title for this post could be "When parallel processing in R, does the ratio between the number of cores, loop chunk size, and object size matter?" I have a corpus I am running some transformations on using the tm package. Since the…
Doug Fir
  • 19,971
  • 47
  • 169
  • 299
68
votes
2 answers

How do I search for a pattern within a text file using Python combining regex & string/file operations and store instances of the pattern?

So essentially I'm looking for specifically a 4 digit code within two angle brackets within a text file. I know that I need to open the text file and then parse line by line, but I am not sure the best way to go about structuring my code after…
Carl Carlson
  • 868
  • 1
  • 7
  • 17
67
votes
2 answers

What is CoNLL data format?

I am using a open source jar (Mate Parser) which outputs in the CoNLL 2009 format after dependency parsing. I want to use the dependency parsing results for Information Extraction, however, I only understand part of the output in the CoNLL data…
42
votes
7 answers

Detect text language in R

I have a list of tweets and I would like to keep only those that are in English. How can I do this?
zoltanctoth
  • 2,788
  • 5
  • 26
  • 32
37
votes
1 answer

Using Sklearn's TfidfVectorizer transform

I am trying to get the tf-idf vector for a single document using Sklearn's TfidfVectorizer object. I create a vocabulary based on some training documents and use fit_transform to train the TfidfVectorizer. Then, I want to find the tf-idf vectors for…
Sterling
  • 3,835
  • 14
  • 48
  • 73
36
votes
14 answers

R tm package invalid input in 'utf8towcs'

I'm trying to use the tm package in R to perform some text analysis. I tied the following: require(tm) dataSet <- Corpus(DirSource('tmp/')) dataSet <- tm_map(dataSet, tolower) Error in FUN(X[[6L]], ...) : invalid input 'RT @noXforU Erneut riesiger…
maiaini
  • 692
  • 1
  • 9
  • 13
32
votes
4 answers

R-Project no applicable method for 'meta' applied to an object of class "character"

I am trying to run this code (Ubuntu 12.04, R 3.1.1) # Load requisite packages library(tm) library(ggplot2) library(lsa) # Place Enron email snippets into a single vector. text <- c( "To Mr. Ken Lay, I’m writing to urge you to donate the millions…
user990137
  • 333
  • 1
  • 3
  • 5
30
votes
3 answers

How to find the closest word to a vector using word2vec

I have just started using Word2vec and I was wondering how can we find the closest word to a vector suppose. I have this vector which is the average vector for a set of vectors: array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32) Is…
sel
  • 942
  • 1
  • 12
  • 25
28
votes
8 answers

Large scale Machine Learning

I need to run various machine learning techniques on a big dataset (10-100 billions records) The problems are mostly around text mining/information extraction and include various kernel techniques but are not restricted to them (we use some bayesian…
user387263
  • 291
  • 3
  • 5
28
votes
1 answer

Data sets for emotion detection in text

I'm implementing a system that could detect the human emotion in text. Are there any manually annotated data sets available for supervised learning and testing? Here are some interesting datasets: https://dataturks.com/projects/trending
ekka
  • 355
  • 1
  • 4
  • 11
26
votes
2 answers

Save and reuse TfidfVectorizer in scikit learn

I am using TfidfVectorizer in scikit learn to create a matrix from text data. Now I need to save this object to reuse it later. I tried to use pickle, but it gave the following…
Joswin K J
  • 690
  • 1
  • 7
  • 16
25
votes
5 answers

Are there APIs for text analysis/mining in Java?

I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc. I'm starting…
Renato Dinhani
  • 35,057
  • 55
  • 139
  • 199
25
votes
3 answers

How to calculate TF*IDF for a single new document to be classified?

I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification. I am looking forward to classify…
24
votes
7 answers

Finding 2 & 3 word Phrases Using R TM Package

I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it that I do not know). I have been trying to use the tokenizer, but seem to have…
appletree
  • 353
  • 2
  • 5
  • 10
1
2 3
99 100