0

I have read this website and am new to R.

The following code works well for csv files with 100 rows (tested), but gives an error message for csv files with 500,000 rows exceeding 1GB:

library(tm) 
library(RWeka) setwd("c:textanalysis/") 
data <- read.csv("postsdataset.csv", header=FALSE, stringsAsFactors=FALSE) 
data <- data[,2] 

source("GenerateTDM.R") # generatetdm function in appendix 
tdm.generate <- function(string, ng){   
    # tutorial on rweka - http://tm.r-forge.r-project.org/faq.html

    corpus <- Corpus(VectorSource(string)) # create corpus for TM processing
    corpus <- tm_map(corpus, content_transformer(tolower))
    corpus <- tm_map(corpus, removeNumbers) 
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, stripWhitespace)
    # corpus <- tm_map(corpus, removeWords, stopwords("english")) 
    options(mc.cores=1) # http://stackoverflow.com/questions/17703553/bigrams-   instead-of-single-words-in-termdocument-matrix-using-r-and-rweka/20251039#20251039
    BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = ng, max = ng)) # create n-grams
    tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))   # create tdm from n-grams
    tdm
}

tdm <- tdm.generate(data, 2)

I want to clean text data (online posts collected in a csv file) and get rid of URL, blank rows, and usernames; and explore my data and do clustering analysis for ngrams with tf/idf.

How do I use source("GenerateTDM.R")?

honk
  • 9,137
  • 11
  • 75
  • 83
bsolomon
  • 31
  • 2
  • What is the exact error message that you get? – Stibu Oct 02 '15 at 09:12
  • I have only gone as far as trying to clean the data and make the matrix but also need the tf/idf for ngrams and the clustering analysis – bsolomon Oct 02 '15 at 09:20
  • The error message if i remember was once I ran the matrix command to make the matrix and it then was creating a 1gb file and brought up some comments - I will try to get the exact error message – bsolomon Oct 02 '15 at 09:23
  • I changed the following code for tf/idf tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer, weighting=weightTfIdf)) # create tdm from n-grams got > tdm <- tdm.generate(data, 1) Warning message: In weighting(x) : empty document(s): 495 498 727 955 959 1157 1408 1433 1590 etc1 [... truncated] ran > tdm.matrix <- as.matrix(tdm) Error: cannot allocate vector of size 11.8 Gb In addition: Warning messages: 1: In matrix(vector(typeof(x$v), nr * nc), nr, nc) : Reached total allocation of 16299Mb: see help(memory.size) – bsolomon Oct 02 '15 at 10:03
  • It seems that you have reached the limits of your computer's memory. – Stibu Oct 02 '15 at 10:36
  • Don't use dense data types for text. – Has QUIT--Anony-Mousse Oct 02 '15 at 14:34
  • I would really appreciate it if someone could give me an example of a suitable r script to use to load the data from the csv, clean it for urls, usernames, get ris of blank rows after cleaning etc then build a bigram tf/idf and do clustering analysis – bsolomon Oct 03 '15 at 08:09
  • 1
    I improved the wording a bit and removed unnecessary information. However, please don't provide additional information (e.g. on the error message) in the comments. Please put the information directly into the question. Please use the [edit](http://stackoverflow.com/posts/32903971/edit) button for that. That might help reader to help you more easily. – honk Oct 04 '15 at 15:01
  • Please also note that Stack Overflow is no tutorial service. Asking for extensive examples is frowned upon on SO. You have to come up with an initial solution (even if it's not working) by yourself. If you then get stuck, we can help best. – honk Oct 04 '15 at 15:05

0 Answers0