0

with the fallow code I try to find the tfidf for each term for all docs tha I have in csv(200.000 docs), and then I want to make a one column csv that it will contain each term with its tfidf, in non-decreasing. I try for a little sample and I think it works. put for the big csv Rstudio allways crasing.. any ideas?

#read text converted to csv
myfile3 <- "tweetsc.csv"
x <- read.csv(myfile3, header = FALSE)
#make data frame
x <- data.frame(lapply(x, as.character), stringsAsFactors=FALSE)
# make vector sources
dd <- Corpus(DataframeSource(x))
# from tm package conculate tfidf 
xx <- as.matrix(DocumentTermMatrix(dd, control = list(weighting = weightTfIdf)))
#data frame from columns to rows decreasing
freq = data.frame(sort(colSums(as.matrix(xx)), decreasing=FALSE))
write.csv2(freq, "important_tweets.csv")
PVoulg
  • 1
  • 3
  • 1
    Welcome to SO. You could improve your question. Please read [how to provide minimal reproducible examples in R](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example#answer-5963610). Then edit & improve it accordingly. A good post usually provides minimal input data, the desired output data & code tries - all copy-paste-run'able in a new/clean R session. Your code, however, produces _"cannot open file 'tweetsc.csv': No such file or directory"_ thus not making your example reproducible. – lukeA Dec 19 '16 at 16:29
  • my problem is at freq <- data.frame(sort(colSums(as.matrix(xx)), decreasing=FALSE)) I run out of 16gb of ram , cpu is overheated and Rstudio crashing without mistakes. – PVoulg Dec 19 '16 at 16:41

1 Answers1

1

Do not coerce the TDM to a matrix. That will most likely cause an integer overflow issue with so many documents. The tm package uses the slam package to represent the tdm/dtm's. It has some functions for doing row- or column-wise operations without having to coerce to dense matrix.

library(slam)
#read text converted to csv
myfile3 <- "tweetsc.csv"
x <- read.csv(myfile3, header = FALSE)
#make data frame
x <- data.frame(lapply(x, as.character), stringsAsFactors=FALSE)
# make vector sources
dd <- Corpus(DataframeSource(x))
# from tm package conculate tfidf 
xx <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
#data frame from columns to rows decreasing
freq = as.data.frame(sort(col_sums(xx), decreasing=FALSE)
write.csv2(freq, "important_tweets.csv")

One thing to note: you mention you want to calculate "each term with its tfidf..." the tf-idf is specific to each term in each document. Summing the tf-idf may not really a meaningful measure because it obscures the weight of the term in a given document.

emilliman5
  • 5,816
  • 3
  • 27
  • 37