21

I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same

library(tm)
library(RWeka)

txt <- read.csv("HW.csv",header=T) 
df <- do.call("rbind", lapply(txt, as.data.frame))
names(df) <- "text"

myCorpus <- Corpus(VectorSource(df$text))
myStopwords <- c(stopwords('english'),"originally", "posted")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

#building the TDM

btm <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
myTdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = btm))

I typically use the following code for generating list of words in a frequency range

frq1 <- findFreqTerms(myTdm, lowfreq=50)

Is there any way to automate this such that we get a dataframe with all words and their frequency?

The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors. Is there a simple solution for this?

ProcRJ
  • 211
  • 1
  • 2
  • 3

6 Answers6

22

Try this

data("crude")
myTdm <- as.matrix(TermDocumentMatrix(crude))
FreqMat <- data.frame(ST = rownames(myTdm), 
                      Freq = rowSums(myTdm), 
                      row.names = NULL)
head(FreqMat, 10)
#            ST Freq
# 1       "(it)    1
# 2     "demand    1
# 3  "expansion    1
# 4        "for    1
# 5     "growth    1
# 6         "if    1
# 7         "is    2
# 8        "may    1
# 9       "none    2
# 10      "opec    2
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Brilliant thank you! One note though for beginners: myTdm <- TermDocumentMatrix(crude)[1:10, 1:10] gives you a 10 by 10 tdm so if the corpus is bigger the [1:10, 1:10] should not be used – Simone Oct 06 '16 at 15:36
  • I thought so. In the beginnings R can be quite confusing sometimes so added it for R newbies. – Simone Oct 07 '16 at 16:56
  • it is enough if you do: `FreqMat <- as.data.frame(as.table(myTdm))` – jibiel Jul 05 '17 at 09:24
  • When I try inspect() I only get out [1:10,1:10] regardless of the size of the tdm/dtm. – user1603472 Jul 31 '17 at 12:08
  • @user1603472 If you''ll do `myTdm <- TermDocumentMatrix(crude)` you'll get the full view. – David Arenburg Jul 31 '17 at 12:17
  • @jibiel `head(as.data.frame(as.table(myTdm)), 10)` doesn't gives me the same result. – David Arenburg Jul 31 '17 at 12:23
  • @user1603472 Btw, I've just checked `inspect` source code and it just does `as.matrix` and then prints it. So no need to use it at all. – David Arenburg Jul 31 '17 at 12:25
  • @Simone You are quite right and that was an error on my side. Also, I've just checked `inspect` source code and it just does `as.matrix` and then prints it so I fixed the code accordingly in order to avoid the unnecessary print. – David Arenburg Jul 31 '17 at 12:29
11

I have the following lines in R that can help to create word frequencies and put them in a table, it reads the file of text in .txt format and create the frequencies of words, I hope that this can help to anyone interested.

avisos<- scan("anuncio.txt", what="character", sep="\n")
avisos1 <- tolower(avisos)
avisos2 <- strsplit(avisos1, "\\W")
avisos3 <- unlist(avisos2)
freq<-table(avisos3)
freq1<-sort(freq, decreasing=TRUE)
temple.sorted.table<-paste(names(freq1), freq1, sep="\\t")
cat("Word\tFREQ", temple.sorted.table, file="anuncio.txt", sep="\n")
alejandro
  • 111
  • 1
  • 3
  • this has been very helpful for one of my tiny pet projects in text mining.. thanks a lot :)) – LearneR Dec 18 '16 at 10:20
  • also, one question.. if i want to count the frequency of a particular phrase or a sentence in a dump of text, is there a way to do it? for example: let's say I want to find the frequency of set of words 'what a strange incident' in the entire book.. what changes should I do to the above code? – LearneR Dec 18 '16 at 14:56
9

Looking at the source of findFreqTerms, it appears that the function slam::row_sums does the trick when called on a term-document matrix. Try, for instance:

data(crude)
slam::row_sums(TermDocumentMatrix(crude))
Daniel Janus
  • 637
  • 6
  • 10
8

Depending on your needs, using some tidyverse functions might be a rough solution that offers some flexibility in terms of how you handle capitalization, punctuation, and stop words:

text_string <- 'I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same. I typically use the following code for generating list of words in a frequency range. Is there any way to automate this such that we get a dataframe with all words and their frequency?
The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors. Is there a simple solution for this?'

stop_words <- c('a', 'and', 'for', 'the') # just a sample list of words I don't care about

library(tidyverse)
data_frame(text = text_string) %>% 
  mutate(text = tolower(text)) %>% 
  mutate(text = str_remove_all(text, '[[:punct:]]')) %>% 
  mutate(tokens = str_split(text, "\\s+")) %>%
  unnest() %>% 
  count(tokens) %>% 
  filter(!tokens %in% stop_words) %>% 
  mutate(freq = n / sum(n)) %>% 
  arrange(desc(n))


# A tibble: 64 x 3
  tokens      n   freq
  <chr>   <int>  <dbl>
1 i           5 0.0581
2 with        5 0.0581
3 is          4 0.0465
4 words       3 0.0349
5 into        2 0.0233
6 list        2 0.0233
7 of          2 0.0233
8 problem     2 0.0233
9 run         2 0.0233
10 that       2 0.0233
# ... with 54 more rows
sbha
  • 9,802
  • 2
  • 74
  • 62
2
a = scan(file='~/Desktop//test.txt',what="list")
a1 = data.frame(lst=a)
count(a1,vars="lst")

seems to work to get simple frequencies. I've used scan because I had a txt file, but it should work with read.csv too.

Tahnoon Pasha
  • 5,848
  • 14
  • 49
  • 75
  • the above doesnt help me figure out n grams and word associations. I am interested in evaluating frequency of the n grams that have been generated – ProcRJ Aug 07 '13 at 11:20
2

Does apply(myTdm, 1, sum) or rowSums(as.matrix(myTdm)) give the ngram counts you're after?

Ben
  • 41,615
  • 18
  • 132
  • 227