Questions tagged [tm]

The `tm` package (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

source: http://tm.r-forge.r-project.org/

tm - Text Mining Package

tm (shorthand for Text Mining Infrastructure in R) provides a framework for text mining applications within R.

The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated database back-end support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched document sets.

The package provides native support for reading in several classic file formats (e.g. plain text, PDFs, or XML files). There is also a plug-in mechanism to handle additional file formats.

The data structures and algorithms can be extended to fit custom demands, since the package is designed in a modular way to enable easy integration of new file formats, readers, transformations and filter operations.

tm provides easy access to preprocessing and manipulation mechanisms such as whitespace removal, stemming, or stopword deletion. Further a generic filter architecture is available in order to filter documents for certain criteria, or perform full text search. The package supports the export from document collections to term-document matrices.

tm is freely available under the GNU General Public License (GPL).

Resources:

1083 questions
90
votes
3 answers

How to flatten a list of lists?

The tm package extends c so that, if given a set of PlainTextDocuments it automatically creates a Corpus. Unfortunately, it appears that each PlainTextDocument must be specified separately. e.g. if I had: foolist <- list(a, b, c); # where a,b,c are…
dnagirl
  • 20,196
  • 13
  • 80
  • 123
87
votes
1 answer

Inconsistent behaviour with tm_map transformation functions when using multiple cores

Another potential title for this post could be "When parallel processing in R, does the ratio between the number of cores, loop chunk size, and object size matter?" I have a corpus I am running some transformations on using the tm package. Since the…
Doug Fir
  • 19,971
  • 47
  • 169
  • 299
56
votes
4 answers

DocumentTermMatrix error on Corpus argument

I have the following code: # returns string w/o leading or trailing whitespace trim <- function (x) gsub("^\\s+|\\s+$", "", x) news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings. corpus_clean <- tm_map(news_corpus,…
user1477388
  • 20,790
  • 32
  • 144
  • 264
48
votes
4 answers

Error converting text to lowercase with tm_map(..., tolower)

I tried using the tm_map. It gave the following error. How can I get around this? require(tm) byword<-tm_map(byword, tolower) Error in UseMethod("tm_map", x) : no applicable method for 'tm_map' applied to an object of class "character"
jackStinger
  • 2,035
  • 5
  • 23
  • 36
32
votes
4 answers

R-Project no applicable method for 'meta' applied to an object of class "character"

I am trying to run this code (Ubuntu 12.04, R 3.1.1) # Load requisite packages library(tm) library(ggplot2) library(lsa) # Place Enron email snippets into a single vector. text <- c( "To Mr. Ken Lay, I’m writing to urge you to donate the millions…
user990137
  • 333
  • 1
  • 3
  • 5
29
votes
2 answers

Topic models: cross validation with loglikelihood or perplexity

I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60. I have divided my corpus into ten batches and set aside one batch for a holdout…
user37874
  • 415
  • 1
  • 5
  • 11
27
votes
5 answers

How to determine which older version of the R package is compatible with my R version

I am trying to install the "tm" package but then I get an error saying that "tm" is not available for my R version package ‘tm’ is not available (for R version 3.0.2) But then I saw that someone suggested I download the archived version from…
London guy
  • 27,522
  • 44
  • 121
  • 179
25
votes
13 answers

dependency ‘slam’ is not available when installing TM package

I was able to use the library(tm) in r without problem until today, when loading tm shows: library(tm) Loading required package: NLP Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : there is no package called…
Carl H
  • 1,036
  • 2
  • 15
  • 27
21
votes
2 answers

Use R to convert PDF files to text files for text mining

I have nearly one thousand pdf journal articles in a folder. I need to text mine on all article's abstracts from the whole folder. Now I am doing the following: dest <- "~/A1.pdf" # set path to pdftotxt.exe and convert pdf to text exe <-…
S Das
  • 3,291
  • 6
  • 26
  • 41
20
votes
3 answers

How does the removeSparseTerms in R work?

I am using the removeSparseTerms method in R and it required a threshold value to be input. I also read that the higher the value, the more will be the number of terms retained in the returned matrix. How does this method work and what is the logic…
London guy
  • 27,522
  • 44
  • 121
  • 179
19
votes
3 answers

LDA with topicmodels, how can I see which topics different documents belong to?

I am using LDA from the topicmodels package, and I have run it on about 30.000 documents, acquired 30 topics, and got the top 10 words for the topics, they look very good. But I would like to see which documents belong to which topic with the…
d12n
  • 841
  • 2
  • 10
  • 20
17
votes
6 answers

R tm package vcorpus: Error in converting corpus to data frame

I am using the tm package to clean up some data using the following code: mycorpus <- Corpus(VectorSource(x)) mycorpus <- tm_map(mycorpus, removePunctuation) I then want to convert the corpus back into a data frame in order to export a text file…
lmcshane
  • 1,074
  • 4
  • 14
  • 27
17
votes
6 answers

Adding custom stopwords in R tm

I have a Corpus in R using the tm package. I am applying the removeWords function to remove stopwords tm_map(abs, removeWords, stopwords("english")) Is there a way to add my own custom stop words to this list?
Brian
  • 7,098
  • 15
  • 56
  • 73
17
votes
8 answers

tm_map has parallel::mclapply error in R 3.0.1 on Mac

I am using R 3.0.1 on Platform: x86_64-apple-darwin10.8.0 (64-bit) I am trying to use tm_map from the tm library. But when I execute the this code library(tm) data('crude') tm_map(crude, stemDocument) I get this error: Warning message: In…
Dominik
  • 2,753
  • 7
  • 28
  • 32
16
votes
6 answers

R text file and text mining...how to load data

I am using the R package tm and I want to do some text mining. This is one document and is treated as a bag of words. I don't understand the documentation on how to load a text file and to create the necessary objects to start using features such…
user959129
1
2 3
72 73