Questions tagged [text-mining]

Text Mining is a process of deriving high-quality information from unstructured (textual) information.

Text Mining is a process of deriving high-quality information from unstructured (textual) information. Possible applications for text-mining are

Comments of Survey responses
Customer messages, emails, complaints etc.
Investigating competitors by crawling their web sites

What is "entropy and information gain"?

I am reading this book (NLTK) and it is confusing. Entropy is defined as: Entropy is the sum of the probability of each label times the log probability of that same label How can I apply entropy and maximum entropy in terms of text mining? Can…

asked Dec 07 '09 at 11:54

TIMEX

259,804
351
777
1,080

votes

1 answer

Inconsistent behaviour with tm_map transformation functions when using multiple cores

Another potential title for this post could be "When parallel processing in R, does the ratio between the number of cores, loop chunk size, and object size matter?" I have a corpus I am running some transformations on using the tm package. Since the…

r parallel-processing text-mining tm doparallel

asked Aug 25 '17 at 06:21

Doug Fir

19,971
47
169
299

votes

2 answers

How do I search for a pattern within a text file using Python combining regex & string/file operations and store instances of the pattern?

So essentially I'm looking for specifically a 4 digit code within two angle brackets within a text file. I know that I need to open the text file and then parse line by line, but I am not sure the best way to go about structuring my code after…

python regex file-io text-mining string-parsing

asked May 07 '12 at 05:53

Carl Carlson

votes

2 answers

What is CoNLL data format?

I am using a open source jar (Mate Parser) which outputs in the CoNLL 2009 format after dependency parsing. I want to use the dependency parsing results for Information Extraction, however, I only understand part of the output in the CoNLL data…

nlp text-parsing text-mining information-extraction

asked Dec 11 '14 at 05:45

swapna sourav rout

votes

7 answers

Detect text language in R

I have a list of tweets and I would like to keep only those that are in English. How can I do this?

r text-mining

asked Nov 10 '11 at 11:11

zoltanctoth

2,788
5
26
32

votes

1 answer

Using Sklearn's TfidfVectorizer transform

I am trying to get the tf-idf vector for a single document using Sklearn's TfidfVectorizer object. I create a vocabulary based on some training documents and use fit_transform to train the TfidfVectorizer. Then, I want to find the tf-idf vectors for…

python document text-mining tf-idf

asked Nov 21 '13 at 21:18

Sterling

3,835
14
48
73

votes

14 answers

R tm package invalid input in 'utf8towcs'

I'm trying to use the tm package in R to perform some text analysis. I tied the following: require(tm) dataSet <- Corpus(DirSource('tmp/')) dataSet <- tm_map(dataSet, tolower) Error in FUN(X[[6L]], ...) : invalid input 'RT @noXforU Erneut riesiger…

r utf-8 iconv text-mining

asked Mar 09 '12 at 16:10

maiaini

votes

4 answers

R-Project no applicable method for 'meta' applied to an object of class "character"

I am trying to run this code (Ubuntu 12.04, R 3.1.1) # Load requisite packages library(tm) library(ggplot2) library(lsa) # Place Enron email snippets into a single vector. text <- c( "To Mr. Ken Lay, I’m writing to urge you to donate the millions…

r text-mining tm

asked Jul 16 '14 at 02:15

user990137

votes

3 answers

How to find the closest word to a vector using word2vec

I have just started using Word2vec and I was wondering how can we find the closest word to a vector suppose. I have this vector which is the average vector for a set of vectors: array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32) Is…

python text-mining data-analysis word2vec

asked Sep 24 '15 at 11:03

sel

votes

8 answers

Large scale Machine Learning

I need to run various machine learning techniques on a big dataset (10-100 billions records) The problems are mostly around text mining/information extraction and include various kernel techniques but are not restricted to them (we use some bayesian…

java c++ machine-learning mapreduce text-mining

asked Jul 08 '10 at 23:58

user387263

votes

1 answer

Data sets for emotion detection in text

I'm implementing a system that could detect the human emotion in text. Are there any manually annotated data sets available for supervised learning and testing? Here are some interesting datasets: https://dataturks.com/projects/trending

database dataset nlp text-mining emotion

asked Jun 08 '15 at 07:34

ekka

votes

2 answers

Save and reuse TfidfVectorizer in scikit learn

I am using TfidfVectorizer in scikit learn to create a matrix from text data. Now I need to save this object to reuse it later. I tried to use pickle, but it gave the following…

python nlp scikit-learn pickle text-mining

asked Jun 15 '15 at 10:35

Joswin K J

votes

5 answers

Are there APIs for text analysis/mining in Java?

I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc. I'm starting…

java api nlp analysis text-mining

asked Jul 23 '11 at 12:56

Renato Dinhani

35,057
55
139
199

votes

3 answers

How to calculate TF*IDF for a single new document to be classified?

I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification. I am looking forward to classify…

machine-learning classification information-retrieval text-mining document-classification

asked Apr 01 '14 at 15:59

smwikipedia

61,609
92
309
482

votes

7 answers

Finding 2 & 3 word Phrases Using R TM Package

I am trying to find a code that actually works to find the most frequently used two and three word phrases in R text mining package (maybe there is another package for it that I do not know). I have been trying to use the tokenizer, but seem to have…

r data-mining text-mining

asked Jan 17 '12 at 16:53

appletree

2 3

…

99 100 Next