How to take first 25 words of each corpus (in R)?

Question

I'm guessing that the technique for this is similar to taking the first N characters from any dataframe, regardless of if it is a corpus or not.

My attempt:

create.greetings <- function(corpus, create_df = FALSE) {
  for(i in length(Charlotte.corpus.raw)) {
    Doc1<-Charlotte.corpus.raw[i]
    Word1<-Doc1[1:25]
    Greetings[i]<-Word1
  }
  return(VCorpus)
}

Where Greetings begins as a corpus with n=6. I couldn't figure out how to make a null corpus, or a corpus of large enough characters. I have a corpus of 200 documents here (Charlotte.corpus.raw). Unlike vectors (and by extension, dataframes), there doesn't seem to be a easy way to create null corpora.

Part of the problem is that R doesn't seem to recognize the class of "document". It only recognizes corpus. That is, that to R, a single document is a corpus of n=1.

Reproducable Sample: You will need the 'tm' and 'dplyr' and 'NLP' packages as well as more common R packages

read.corpus <- function(directory, pattern = "", to.lower = TRUE) {
 corpus <- DirSource(directory = directory, pattern = pattern) %>%
   VCorpus # Read files and create `VCorpus` object
 if(to.lower == TRUE) corpus <- # Lowercase text
     tm_map(corpus, 
            content_transformer(tolower))
 return(corpus)
}

Then run the function for any directory you have with a few txt documents, then you have a corpus to work with. Then replace Charlotte.corpus.raw from above with whatever you name your corpus as.

Is your "corpus" essentially just a vector of strings, each being sentences/paragraphs with space-separated words? Please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), specifically small but representative sample data. — r2evans, Aug 18 '16 at 21:29
@r2evans how about what I just edited in? I think it's more convenient for you guys to use data already stored on your computers. — Antecedent, Aug 18 '16 at 21:44
Sorry, I can't install `tm` on this system (package `slam` isn't available for R-3.2.5/win) so I can't test with your code. Unless it's possible to do this without `tm_map`, I'm out. — r2evans, Aug 18 '16 at 21:53

score 0 · Answer 1 · answered Aug 18 '16 at 22:12

0

Each row of greetings will contain the first 25 words of each document:

greetings <- c()
for(i in 1:length(corpus)) {
  row <- unlist(corpus[i])[1:25]
  greetings <- rbind(greetings, row)
}

answered Aug 18 '16 at 22:12

AidanGawronski

2,055
1
14
24

If `corpus` is just a list of character vectors, would it be easier to do `lapply(corpus, head, n = 25)`? (BTW: starting with an empty vector and appending to it is bad practice and absolutely avoidable when you know the desired size. Though it'll work with small numbers, realize that with each iteration of the `for` loop, R is making a complete copy of `greetings` with the next row. This gets expensive.) – r2evans Aug 18 '16 at 22:26
This post is tagged with for loop so figured I would do it as asked. But certainly a good point. – AidanGawronski Aug 18 '16 at 22:28
Your point about using a `for` loop is salient, thanks. I've been dealing a lot with a GB of data at a time, so I'm beginning to cringe with code that doesn't scale well ... my eye starts twitching :-) – r2evans Aug 18 '16 at 22:30
THe problem is that I'm pretty sure R views Corpus as being made up of other Corpora until you have a Corpus of n=1. So it doesn't read as a character vector. This isn't working for me, anyhow. Any thoughts? – Antecedent Aug 18 '16 at 22:57
Works for me... have any more info? What is class(corpus) for you? – AidanGawronski Aug 18 '16 at 23:22

How to take first 25 words of each corpus (in R)?

1 Answers1