How to transform a Document Term Matrix in R?

Question

Hello I have a document term matrix and I transformed it with the tidy() function and it works perfect. I want to plot a word cloud based on the frequency of a word. So my transformed table looks like this:

> head(Wcloud.Data)
# A tibble: 6 x 3
  document term       count
  <chr>    <chr>      <dbl>
1 1        accept         1
2 1        access         1
3 1        accomplish     1
4 1        account        4
5 1        accur          2
6 1        achiev         1

I have 33,647,383 observations so its a very big dataframe. If I use the max() function I am getting a very high number (64116) but no word in my dataframe has a frequency of 64116. Also if I plot the dataframe in shiny with wordcloud() it plots same words several times. Also if I want to sort my column count its not working - sort(Wcloud.Data$count,decreasing = TRUE). So something is not correct but I dont know, what and how to solve it. Somebody has any idea?

Thas the summary of my document term matrix, before transform it into a dataframe:

> observations.tf
<<DocumentTermMatrix (documents: 76717, terms: 4234)>>
Non-/sparse entries: 33647383/291172395
Sparsity           : 90%
Maximal term length: 15
Weighting          : term frequency (tf)

Update: I add a picture of my dataframe

Can you provide us with a subset of the data `Wcloud.Data` (maybe using `dput`) so we can reproduce the problem on your dataset? I think I have a solution for you but need to confirm locally. Thanks :) — mysteRious, Jun 17 '18 at 14:45
The same word appearing is normal as you have multiple documents (76717) and if a word is appearing in multiple docs with a high frequency it will get printed multiple times. If you want a wordcloud of only the words, get rid the document and aggregate the numbers per word. — phiver, Jun 17 '18 at 14:46
@phiver thanks for your answer. How can I solve that automatically? I dont want it as multiple. — Belfort90, Jun 17 '18 at 15:07
@mysteRious I dont know why but I have a problem with output dput. Or R is calculating and it needs some time. What is your idea? — Belfort90, Jun 17 '18 at 15:08
Anything that would get 100-1000 rows of `Wcloud.Data` to work with would be helpful. — mysteRious, Jun 17 '18 at 15:33
@mysteRious that will not help because my dataset looks just "normal". but as I have multiple docs it will calculate the count column different. How can I solve this, I just want to work with the numbers that are shown in my count column (see picture above in my question) — Belfort90, Jun 17 '18 at 15:52
@mysteRious Check my question I posted a picture thats show a word is ploted several time and thats my problem — Belfort90, Jun 17 '18 at 16:13

Carles · Answer 1 · 2018-06-17T16:36:09.873

1

Using dplyr you can do as following:

library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

Wcloud.Data<- data.frame(Document= c(rep(1,6)), 
                         term = c("accept", "access","accomplish", "account", "accur", "achiev"),
                         count = c(1,1,1,4,2,1))

Data<-Wcloud.Data %>% 
  group_by(term) %>% 
  summarise(Frequency = sum(count))
set.seed(1234)
wordcloud(words = Data$term, freq = Data$Frequency, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

On the other side, libraries quanteda and tibble can help you creting the term frequency matrix. I will put you an example to work with it:

library(tibble)
library(quanteda)
Data <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "this is china",
                              "china is here",
                              'hello china',
                              "Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "this is china",
                              "china is here",
                              'hello china',
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              'japan'))
DocTerm <- quanteda::dfm(Data$text)
DocTerm
# Document-feature matrix of: 19 documents, 11 features (78.5% sparse).
# 19 x 11 sparse Matrix of class "dfm"
# features
# docs     chinese beijing shanghai this is china here hello kyoto japan tokyo
# text1        2       1        0    0  0     0    0     0     0     0     0
# text2        2       0        1    0  0     0    0     0     0     0     0
# text3        0       0        0    1  1     1    0     0     0     0     0
# text4        0       0        0    0  1     1    1     0     0     0     0
# text5        0       0        0    0  0     1    0     1     0     0     0
# text6        2       1        0    0  0     0    0     0     0     0     0
# text7        2       0        1    0  0     0    0     0     0     0     0
# text8        0       0        0    1  1     1    0     0     0     0     0
# text9        0       0        0    0  1     1    1     0     0     0     0
# text10       0       0        0    0  0     1    0     1     0     0     0
# text11       0       0        0    0  0     0    0     0     1     1     0
# text12       1       0        0    0  0     0    0     0     0     1     1
# text13       0       0        0    0  0     0    0     0     1     1     0
# text14       1       0        0    0  0     0    0     0     0     1     1
# text15       0       0        0    0  0     0    0     0     1     1     0
# text16       1       0        0    0  0     0    0     0     0     1     1
# text17       0       0        0    0  0     0    0     0     1     1     0
# text18       1       0        0    0  0     0    0     0     0     1     1
# text19       0       0        0    0  0     0    0     0     0     1     0

Mat<-quanteda::convert(DocTerm,"data.frame")[,2:ncol(DocTerm)] # Converting to a Dataframe without taking into account the text variable
Result<- colSums(Mat) # This is what you are interested in
names(Result)<-colnames(Mat)
# > Result
# chinese  beijing shanghai     this       is    china     here    hello    kyoto    japan 
# 24        4        4        4        8       12        4        4        8       18

edited Jun 17 '18 at 16:36

answered Jun 17 '18 at 14:52

Carles

2,731
14
25

Thank you carles, but my problem is something different. phiver found out where my problem goes to. – Belfort90 Jun 17 '18 at 15:09
Okay, now I edited my question to get your answer. Also, if you just try to do sum(count) group by "term " it should work out. – Carles Jun 17 '18 at 15:45
Thats my problem: The same word appearing is normal as you have multiple documents (76717) and if a word is appearing in multiple docs with a high frequency it will get printed multiple times. – Belfort90 Jun 17 '18 at 15:51
Yes, that is what the term frequency matrix is. A count of words per document. If the most typical word happens to be in more than one document, it appears multiple times. – Carles Jun 17 '18 at 15:55
So what can I do? – Belfort90 Jun 17 '18 at 15:57
I do not undestand then well your question. What do you exactly want to do. Also, as they said, try to provide the data you are using, so that people can play with it. – Carles Jun 17 '18 at 15:59
If I plot the wordcloud it plots same words several times and it calculates all my count column multiple. I just want to work with this numbers included in the count column, nothing more (See my picture above.) – Belfort90 Jun 17 '18 at 16:02
I still do not understand what you want to do, what is your objective? – Carles Jun 17 '18 at 16:05
Bro I just want to plot a word cloud with the wordcloud function. wordcloud function takes the "word" and the "frequency" of the word and plots it. Because my dataset is calculated multiple (and I dont know how to get this away) thats my problem. I just dont want it calculated multiple. – Belfort90 Jun 17 '18 at 16:08
Check my question I posted a picture thats show a word is ploted several time and thats my problem – Belfort90 Jun 17 '18 at 16:12
I think I answered that on my code. Please tell me if it is not like that – Carles Jun 17 '18 at 16:12
i tried your filter code, didnt work. The other code from your generates a new table, but I already have a dataframe – Belfort90 Jun 17 '18 at 16:18
To me it works with your example. I have rewritten my code with the beginning of your example provided. Check again. – Carles Jun 17 '18 at 16:36
Bro I think it worked!!! Let me make some adjustments and I let you know – Belfort90 Jun 17 '18 at 17:39
If I put the max.words higher I am getting a lot of errors like this for a lot of words: `Warnung in wordcloud(words = WData$term, freq = WData$Frequency, min.freq = 1, subtyp could not be fit on page. It will not be plotted.` – Belfort90 Jun 17 '18 at 17:43
But I need to. I want to plot them all. – Belfort90 Jun 17 '18 at 20:51
https://stackoverflow.com/questions/27981651/text-wordcloud-plotting-error. I think that your problem Belfort is solved. Cheers ! – Carles Jun 17 '18 at 21:34
still not working. getting same error, even if i try with the scale. I dont know man how can I solve this. Maybe I ask again the same question? – Belfort90 Jun 18 '18 at 10:44

How to transform a Document Term Matrix in R?

1 Answers1