How do I do clustering analysis (K means) on a medium csv file (50 000 rows of text in a single column) for ngrams with tf/idf in R?

Question

I have read this website and am new to R.

The following code works well for csv files with 100 rows (tested), but gives an error message for csv files with 500,000 rows exceeding 1GB:

library(tm) 
library(RWeka) setwd("c:textanalysis/") 
data <- read.csv("postsdataset.csv", header=FALSE, stringsAsFactors=FALSE) 
data <- data[,2] 

source("GenerateTDM.R") # generatetdm function in appendix 
tdm.generate <- function(string, ng){   
    # tutorial on rweka - http://tm.r-forge.r-project.org/faq.html

    corpus <- Corpus(VectorSource(string)) # create corpus for TM processing
    corpus <- tm_map(corpus, content_transformer(tolower))
    corpus <- tm_map(corpus, removeNumbers) 
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, stripWhitespace)
    # corpus <- tm_map(corpus, removeWords, stopwords("english")) 
    options(mc.cores=1) # http://stackoverflow.com/questions/17703553/bigrams-   instead-of-single-words-in-termdocument-matrix-using-r-and-rweka/20251039#20251039
    BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = ng, max = ng)) # create n-grams
    tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))   # create tdm from n-grams
    tdm
}

tdm <- tdm.generate(data, 2)

I want to clean text data (online posts collected in a csv file) and get rid of URL, blank rows, and usernames; and explore my data and do clustering analysis for ngrams with tf/idf.

How do I use source("GenerateTDM.R")?

I have only gone as far as trying to clean the data and make the matrix but also need the tf/idf for ngrams and the clustering analysis — bsolomon, Oct 02 '15 at 09:20
The error message if i remember was once I ran the matrix command to make the matrix and it then was creating a 1gb file and brought up some comments - I will try to get the exact error message — bsolomon, Oct 02 '15 at 09:23
I changed the following code for tf/idf tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer, weighting=weightTfIdf)) # create tdm from n-grams got > tdm <- tdm.generate(data, 1) Warning message: In weighting(x) : empty document(s): 495 498 727 955 959 1157 1408 1433 1590 etc1 [... truncated] ran > tdm.matrix <- as.matrix(tdm) Error: cannot allocate vector of size 11.8 Gb In addition: Warning messages: 1: In matrix(vector(typeof(x$v), nr * nc), nr, nc) : Reached total allocation of 16299Mb: see help(memory.size) — bsolomon, Oct 02 '15 at 10:03
It seems that you have reached the limits of your computer's memory. — Stibu, Oct 02 '15 at 10:36
I would really appreciate it if someone could give me an example of a suitable r script to use to load the data from the csv, clean it for urls, usernames, get ris of blank rows after cleaning etc then build a bigram tf/idf and do clustering analysis — bsolomon, Oct 03 '15 at 08:09
I improved the wording a bit and removed unnecessary information. However, please don't provide additional information (e.g. on the error message) in the comments. Please put the information directly into the question. Please use the [edit](http://stackoverflow.com/posts/32903971/edit) button for that. That might help reader to help you more easily. — honk, Oct 04 '15 at 15:01
Please also note that Stack Overflow is no tutorial service. Asking for extensive examples is frowned upon on SO. You have to come up with an initial solution (even if it's not working) by yourself. If you then get stuck, we can help best. — honk, Oct 04 '15 at 15:05

How do I do clustering analysis (K means) on a medium csv file (50 000 rows of text in a single column) for ngrams with tf/idf in R?

0 Answers0