Questions tagged [tm]

The `tm` package (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

source: http://tm.r-forge.r-project.org/

tm - Text Mining Package

tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database back-end support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.

The package provides native support for reading in several classic file formats (e.g. plain text, PDFs, or XML files). There is also a plug-in mechanism to handle additional file formats.

The data structures and algorithms can be extended to fit custom demands, since the package is designed in a modular way to enable easy integration of new file formats, readers, transformations and filter operations.

tm provides easy access to preprocessing and manipulation mechanisms such as whitespace removal, stemming, or stopword deletion. Further a generic filter architecture is available in order to filter documents for certain criteria, or perform full text search. The package supports the export from document collections to term-document matrices.

tm is freely available under the GNU General Public License (GPL).

Resources:

CRAN summary page
R-Forge project page
FAQ
Ingo Feinerer, Kurt Hornik, and David Meyer. Text mining infrastructure in R. Journal of Statistical Software, 25(5):1-54, March 2008.

1083 questions

votes

3 answers

How to flatten a list of lists?

The tm package extends c so that, if given a set of PlainTextDocuments it automatically creates a Corpus. Unfortunately, it appears that each PlainTextDocument must be specified separately. e.g. if I had: foolist <- list(a, b, c); # where a,b,c are…

r list tm

asked Apr 30 '13 at 12:49

dnagirl

20,196
13
80
123

votes

1 answer

Inconsistent behaviour with tm_map transformation functions when using multiple cores

Another potential title for this post could be "When parallel processing in R, does the ratio between the number of cores, loop chunk size, and object size matter?" I have a corpus I am running some transformations on using the tm package. Since the…

r parallel-processing text-mining tm doparallel

asked Aug 25 '17 at 06:21

Doug Fir

19,971
47
169
299

votes

4 answers

DocumentTermMatrix error on Corpus argument

I have the following code: # returns string w/o leading or trailing whitespace trim <- function (x) gsub("^\\s+|\\s+$", "", x) news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings. corpus_clean <- tm_map(news_corpus,…

r tm corpus

asked Jun 12 '14 at 18:44

user1477388

20,790
32
144
264

votes

4 answers

Error converting text to lowercase with tm_map(..., tolower)

I tried using the tm_map. It gave the following error. How can I get around this? require(tm) byword<-tm_map(byword, tolower) Error in UseMethod("tm_map", x) : no applicable method for 'tm_map' applied to an object of class "character"

r tm lowercase term-document-matrix

asked Nov 30 '12 at 06:35

jackStinger

2,035
5
23
36

votes

4 answers

R-Project no applicable method for 'meta' applied to an object of class "character"

I am trying to run this code (Ubuntu 12.04, R 3.1.1) # Load requisite packages library(tm) library(ggplot2) library(lsa) # Place Enron email snippets into a single vector. text <- c( "To Mr. Ken Lay, I’m writing to urge you to donate the millions…

r text-mining tm

asked Jul 16 '14 at 02:15

user990137

votes

2 answers

Topic models: cross validation with loglikelihood or perplexity

I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60. I have divided my corpus into ten batches and set aside one batch for a holdout…

r tm cross-validation topic-modeling

asked Jan 25 '14 at 17:52

user37874

votes

5 answers

How to determine which older version of the R package is compatible with my R version

I am trying to install the "tm" package but then I get an error saying that "tm" is not available for my R version package ‘tm’ is not available (for R version 3.0.2) But then I saw that someone suggested I download the archived version from…

r package tm

asked Feb 27 '15 at 14:04

London guy

27,522
44
121
179

votes

13 answers

dependency ‘slam’ is not available when installing TM package

I was able to use the library(tm) in r without problem until today, when loading tm shows: library(tm) Loading required package: NLP Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : there is no package called…

r tm slam

asked Oct 05 '16 at 23:52

Carl H

1,036
2
15
27

votes

2 answers

Use R to convert PDF files to text files for text mining

I have nearly one thousand pdf journal articles in a folder. I need to text mine on all article's abstracts from the whole folder. Now I am doing the following: dest <- "~/A1.pdf" # set path to pdftotxt.exe and convert pdf to text exe <-…

r text-mining tm pdftotext

asked Jan 30 '14 at 00:33

S Das

3,291
6
26
41

votes

3 answers

How does the removeSparseTerms in R work?

I am using the removeSparseTerms method in R and it required a threshold value to be input. I also read that the higher the value, the more will be the number of terms retained in the returned matrix. How does this method work and what is the logic…

r tm lda

asked Feb 27 '15 at 10:55

London guy

27,522
44
121
179

votes

3 answers

LDA with topicmodels, how can I see which topics different documents belong to?

I am using LDA from the topicmodels package, and I have run it on about 30.000 documents, acquired 30 topics, and got the top 10 words for the topics, they look very good. But I would like to see which documents belong to which topic with the…

r lda topic-modeling tm

asked Feb 14 '13 at 12:22

d12n

votes

6 answers

R tm package vcorpus: Error in converting corpus to data frame

I am using the tm package to clean up some data using the following code: mycorpus <- Corpus(VectorSource(x)) mycorpus <- tm_map(mycorpus, removePunctuation) I then want to convert the corpus back into a data frame in order to export a text file…

r tm corpus

asked Jul 11 '14 at 18:11

lmcshane

1,074
4
14
27

votes

6 answers

Adding custom stopwords in R tm

I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords tm_map(abs, removeWords, stopwords("english")) Is there a way to add my own custom stop words to this list?

r text-mining stop-words corpus tm

asked Aug 26 '13 at 14:22

Brian

7,098
15
56
73

votes

8 answers

tm_map has parallel::mclapply error in R 3.0.1 on Mac

I am using R 3.0.1 on Platform: x86_64-apple-darwin10.8.0 (64-bit) I am trying to use tm_map from the tm library. But when I execute the this code library(tm) data('crude') tm_map(crude, stemDocument) I get this error: Warning message: In…

r parallel-processing tm mclapply

asked Aug 17 '13 at 10:55

Dominik

2,753
7
28
32

votes

6 answers

R text file and text mining...how to load data

I am using the R package tm and I want to do some text mining. This is one document and is treated as a bag of words. I don't understand the documentation on how to load a text file and to create the necessary objects to start using features such…

r load text-mining tm

asked Oct 28 '11 at 09:20

user959129

2 3

…

72 73 Next