2

I have a small corpus e.g.

myvec <- c("n417", "disturbance", "grand theft auto", "assault", "burglary", 
"vandalism", "atmt to locate", "drug arrest", "traffic stop", 
"larceny", "graffiti complaint / reporting")

corpus <- VCorpus(VectorSource(myvec))

If I wanted to make corpus 10 times bigger, how would I do that so that the resulting variable is a VCorpus and not a list?

Tried:

corpus <- replicate(10, corpus) # returns a list
corpus <- VCorpus(replicate(10, corpus)) # Error: inherits(x, "Source") is not TRUE
corpus <- c(corpus, corpus, corpus, corpus, corpus, corpus, corpus) # works, returns a corpus 7 times bigger but involves lots of typing)

If I have a small corpus and I want to make it ten times larger for example purposes, how could I do that?

Doug Fir
  • 19,971
  • 47
  • 169
  • 299

1 Answers1

2

We can use do.call with c after replicating

library(tm)
do.call(c, rep(list(corpus), 7))
# <<VCorpus>>
#Metadata:  corpus specific: 0, document level (indexed): 0
#Content:  documents: 77

Similarly for replicate

do.call(c, replicate(7, corpus, simplify = FALSE))
#<<VCorpus>>
#Metadata:  corpus specific: 0, document level (indexed): 0
#Content:  documents: 77

The simplify = FALSE is not needed here with replicate

do.call(c, replicate(7, corpus))
#<<VCorpus>>
#Metadata:  corpus specific: 0, document level (indexed): 0
#Content:  documents: 77
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you very much, accepting when the timer comes off. I noticed as well, I pasted only a small sample above in myvec. In my instance of R when I ```VCorpus(VectorSource(myvec))``` as above actually a list is returned not a corpus. R only seems to make a corpus out of it, with the exact same commands, when the vector is larger. Is that correct? – Doug Fir Aug 28 '17 at 07:01
  • @DougFir Sorry, I am confused about your query. Are you saying that for large vectors this is not working? – akrun Aug 28 '17 at 07:02
  • 1
    @DougFir I was working on a text analytics problem few days ago, and I too noticed the same behaviour. When the data is small, it gives a `list` type object in the environment, while for large data it gives a `VCorpus` type object. But if you check using `class` on the object even with small amount of data it gives you `VCorpus`. – tushaR Aug 28 '17 at 07:07
  • 1
    No it's working - thanks again. What I'm saying is that if I paste into console ```corpus <- VCorpus(VectorSource(myvec))``` then the variable that appears in the environment pane shows as a list not a corpus. Is that expected? – Doug Fir Aug 28 '17 at 07:07
  • @tushaR thank you for clarifying that it was confusing – Doug Fir Aug 28 '17 at 07:08
  • @DougFir If you talking about the `str`, then it is stored as `list`, but the attributes are `- attr(*, "class")= chr [1:2] "VCorpus" "Corpus"` – akrun Aug 28 '17 at 07:14
  • @DougFir Could you check if there is any difference in behavior with the replicated and the non-replicated – akrun Aug 28 '17 at 07:15
  • Hi @akrun, both return a corpus and here is the speed of each: ```> system.time(corpus1 <- do.call(c, replicate(250, corpus, simplify = F))) user system elapsed 8.180 0.076 8.252 > system.time(corpus2 <- do.call(c, rep(list(corpus), 250))) user system elapsed 6.652 0.024 6.671 ```. However, using rep crashed my r session twice. Note that I actually used a 2k docs sample for corpus and not the tiny sample I initially posted here in myvec variable – Doug Fir Aug 28 '17 at 07:44
  • @DougFir so, it might be adding the list attribute there – akrun Aug 28 '17 at 07:45
  • @akrun don't follow? – Doug Fir Aug 28 '17 at 07:47
  • @DougFir if the `replicate` is working used that. `rep` may not work without the `list` – akrun Aug 28 '17 at 07:47
  • psst cheeky promo that I have an open bounty over here too https://stackoverflow.com/questions/45875482/inconsistent-behaviour-with-tm-map-transformation-functions – Doug Fir Aug 28 '17 at 07:53
  • @DougFir Thanks for notifying me. If I get enough time to go through it, will attempt it – akrun Aug 28 '17 at 07:56
  • Yeah I'm getting desperate on that one and it has a few upvotes too! – Doug Fir Aug 28 '17 at 07:56