I am using the R programming language. Using the following 3 "articles" (Shakespeare's plays), I created a "term document matrix" (a R "object" used for text analytics).
First, I create these 3 articles:
#load libraries
library(pdftools)
library(tidytext)
library(textrank)
library(tm)
#1st document
url <- "https://shakespeare.folger.edu/downloads/pdf/hamlet_PDF_FolgerShakespeare.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words_1 <- article_words %>%
anti_join(stop_words, by = "word")
#2nd document
url <- "https://shakespeare.folger.edu/downloads/pdf/macbeth_PDF_FolgerShakespeare.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words_2<- article_words %>%
anti_join(stop_words, by = "word")
#3rd document
url <- "https://shakespeare.folger.edu/downloads/pdf/othello_PDF_FolgerShakespeare.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words_3 <- article_words %>%
anti_join(stop_words, by = "word")
From here, I create the actual "term document matrix":
library(tm)
tdm <- TermDocumentMatrix(Corpus(VectorSource(rbind(article_words_1, article_words_2, article_words_3))))
inspect(tdm)
My question: From this "term document matrix" (the "tdm" object that is created in the above step), is it possible to "extract" (i.e. ungroup) each of these 3 articles? Can you go back and forth between the 3 individual articles and the term document matrix? If I save this term document matrix as an "RDS" file (e.g. tdm.RDS) , close R studio and then re-import this file (tdm.RDS) back into R - will I be able to separate "tdm.RDS" back into "article_words_1", "article_words_2" and "article_words_3"?
I found some related stackoverflow questions, but they do not seem to specifically answer this question (e.g. inspect specific document from DocumentTermMatrix for specific terms , Extract top features by frequency per document from a dtm in R).
Thanks