4

I have a dfmSparse object (large, with 2.1GB) which is tokenized and with ngrams (unigrams, bigrams, trigrams and fourgrams), and I want to convert it to a data frame or a data table object with the columns: Content and Frequency.

I tried to unlist... but didn't work. I'm new in NLP, and I don't know with method to use, I'm without ideas and didn't found a solution here or with Google.

Some info about the data:

>str(tokfreq)
Formal class 'dfmSparse' [package "quanteda"] with 11 slots
  ..@ settings    :List of 1
  .. ..$ : NULL
  ..@ weighting   : chr "frequency"
  ..@ smooth      : num 0
  ..@ ngrams      : int [1:4] 1 2 3 4
  ..@ concatenator: chr "_"
  ..@ Dim         : int [1:2] 167500 19765478
  ..@ Dimnames    :List of 2
  .. ..$ docs    : chr [1:167500] "character(0).content" "character(0).content" "character(0).content" "character(0).content" ...
  .. ..$ features: chr [1:19765478] "add" "lime" "juice" "tequila" ...
  ..@ i           : int [1:54488417] 0 75 91 178 247 258 272 327 371 391 ...
  ..@ p           : int [1:19765479] 0 3218 3453 4015 4146 4427 4637 140665 140736 142771 ...
  ..@ x           : num [1:54488417] 1 1 1 1 5 1 1 1 1 1 ...
  ..@ factors     : list()

>summary(tokfreq)
       Length         Class          Mode 
3310717565000     dfmSparse            S4

Thanks!

EDITED: This is how I created the dataset from a corpus:

# tokenize
tokenized <- tokenize(x = teste, ngrams = 1:4)
# Creating the dfm
tokfreq <- dfm(x = tokenized)
Diego Gaona
  • 498
  • 6
  • 24

2 Answers2

7

This should do it, if I understood your question about what you mean by "Content" and "Frequency". Note that in this approach, the data.frame is not larger than the sparse matrix, since you are just recording total counts, and not storing the document row distributions.

myDfm <- dfm(data_corpus_inaugural, ngrams = 1:4, verbose = FALSE)
head(myDfm)
## Document-feature matrix of: 57 documents, 314,224 features.
## (showing first 6 documents and first 6 features)
##                  features
## docs              fellow-citizens  of the senate and house
##   1789-Washington               1  71 116      1  48     2
##   1793-Washington               0  11  13      0   2     0
##   1797-Adams                    3 140 163      1 130     0
##   1801-Jefferson                2 104 130      0  81     0
##   1805-Jefferson                0 101 143      0  93     0
##   1809-Madison                  1  69 104      0  43     0

# convert to a data.frame
df <- data.frame(Content = featnames(myDfm), Frequency = colSums(myDfm), 
                 row.names = NULL, stringsAsFactors = FALSE)
head(df)
##           Content Frequency
## 1 fellow-citizens        39
## 2              of      7055
## 3             the     10011
## 4          senate        15
## 5             and      5233
## 6           house        11
tail(df)
##                           Content Frequency
## 314219         and_may_he_forever         1
## 314220       may_he_forever_bless         1
## 314221     he_forever_bless_these         1
## 314222 forever_bless_these_united         1
## 314223  bless_these_united_states         1
## 314224     these_united_states_of         1    

object.size(df)
## 25748240 bytes
object.size(myDfm)
## 29463592 bytes

Added 2018-02-25

In quanteda >= 1.0.0 there is a function textstat_frequency() that will produce the data.frame that you want, e.g.

textstat_frequency(data_dfm_lbgexample) %>% head()
#   feature frequency rank docfreq group
# 1       P       356    1       5   all
# 2       O       347    2       4   all
# 3       Q       344    3       5   all
# 4       N       317    4       4   all
# 5       R       316    5       4   all
# 6       S       280    6       4   all
Ken Benoit
  • 14,454
  • 27
  • 50
  • Yes! That's it! Worked very well, without memory problems and very fast. The size is 1561MB, most than 500MB less than the dfm. Thank you! A simple solution, but I didn't have idea of how to do without your help. – Diego Gaona Mar 24 '16 at 17:11
  • Happy to help, and glad to hear about any experiences/problems/feature requests for **quanteda**. – Ken Benoit Mar 24 '16 at 21:05
  • Hi, this is very goof for dfm from corpus with just one document. I wonder, is there an easy way how this extends to multi-document corpus dfm? Thanks. – Simon Nov 07 '16 at 14:55
  • One way is to specify the document, like this: data.frame(..., Frequency = colSums(myDfm[1,]) ) – Simon Nov 07 '16 at 15:03
  • @KenBenoit, in Diego's example, `dfm()` returns ngram of length 1 to 4. Any simple way to create an additional column, `n`, in the call to `data.frame` to mark the length of each ngram? – Conner M. May 29 '19 at 03:17
0

Speaking of "too large", you could run into memory problems. Take for instance:

library(quanteda)
mydfm <- dfm(subset(inaugCorpus, Year>1980))
class(mydfm)
# [1] "dfmSparse"
# attr(,"package")
# [1] "quanteda"
print(object.size(mydfm), units="KB")
# 273.6 Kb

You could transform the sparse matrix (which uses compressed/efficient storage methods for data with many zeros) into a long data frame like this:

library(reshape2)
df <- melt(as.matrix(mydfm))
head(df)
#           docs features value
# 1  1981-Reagan  senator     2
# 2  1985-Reagan  senator     4
# 3    1989-Bush  senator     2
# 4 1993-Clinton  senator     0
# 5 1997-Clinton  senator     0
# 6    2001-Bush  senator     0
print(object.size(df), units="KB")
# 619.2 Kb

As you can see, the new data type requires much more RAM (and the conversion itself may require additional, too). The sparsity/percentage of 0s is sum(mydfm==0)/length(mydfm) = 0.759289 here.


With regards to your comments, here's a reproducible example:

dfm <- dfm(inaugCorpus, ngrams = 1L:16L)
print(object.size(dfm), units="MB")
# 254.1 Mb

library(reshape2)
df <- melt(as.matrix(dfm))
print(object.size(df), units="MB")
# 1884.6 Mb

memory.size()
# [1] 3676.43
memory.size(TRUE)
# [1] 3858.12
memory.limit()
# [1] 8189
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • Thanks! I tried with your suggestion using reshape2, using a part of my data (144MB) and even so, I receive: `teste <- melt(as.matrix(tokfreq1))` `Error in asMethod(object) : Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105 Error during wrapup: cannot open the connection` – Diego Gaona Mar 23 '16 at 16:41
  • What's your `memory.size()` and `memory.limit()`? Enough harddisc space (maybe some routines use temporary files? I've not much experience in code profiling and looking at what's going on under R's hood) – lukeA Mar 23 '16 at 16:46
  • I tried to clean the memory and tried again, but the same error (the error happen instantly). The details: `> memory.size() [1] 263.75` and `> memory.limit() [1] 8134` I don't think the problem is with the HD (more than 60GB free). – Diego Gaona Mar 23 '16 at 19:06