I need to break a corpus into chunks of N words each. Say this is my corpus:
corpus <- "I need to break this corpus into chunks of ~3 words each"
One way around this problem is turning the corpus into a dataframe, tokenizing it
library(tidytext)
corpus_df <- as.data.frame(text = corpus)
tokens <- corpus_df %>% unnest_tokens(word, text)
and then splitting the dataframe rowwise using the code below (taken from here).
chunk <- 3
n <- nrow(tokens)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(tokens,r)
This works, but there must be a more direct way. Any takes?