0

I am using the R programming language. Using the following 3 "articles" (Shakespeare's plays), I created a "term document matrix" (a R "object" used for text analytics).

First, I create these 3 articles:

#load libraries
library(pdftools)
library(tidytext)
library(textrank)
library(tm)

#1st document
url <- "https://shakespeare.folger.edu/downloads/pdf/hamlet_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_1 <- article_words %>%
  anti_join(stop_words, by = "word")

#2nd document
url <- "https://shakespeare.folger.edu/downloads/pdf/macbeth_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_2<- article_words %>%
  anti_join(stop_words, by = "word")


#3rd document
url <- "https://shakespeare.folger.edu/downloads/pdf/othello_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_3 <- article_words %>%
  anti_join(stop_words, by = "word")

From here, I create the actual "term document matrix":

library(tm)

tdm <- TermDocumentMatrix(Corpus(VectorSource(rbind(article_words_1, article_words_2, article_words_3))))
inspect(tdm)

My question: From this "term document matrix" (the "tdm" object that is created in the above step), is it possible to "extract" (i.e. ungroup) each of these 3 articles? Can you go back and forth between the 3 individual articles and the term document matrix? If I save this term document matrix as an "RDS" file (e.g. tdm.RDS) , close R studio and then re-import this file (tdm.RDS) back into R - will I be able to separate "tdm.RDS" back into "article_words_1", "article_words_2" and "article_words_3"?

I found some related stackoverflow questions, but they do not seem to specifically answer this question (e.g. inspect specific document from DocumentTermMatrix for specific terms , Extract top features by frequency per document from a dtm in R).

Thanks

stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 1
    Not really into your article_words_# model, which is grouped by sentences (i.e. sentence_id). With tidytext::tidy(tdm) you can retrieve the tdm's count of words per document (or book), but not per book per sentence. – Nicolás Velasquez Apr 14 '21 at 22:08
  • thank you for your reply! suppose it was like this : article_1 <- pdf_text(url_1) ; article_2 <- pdf_text(url_2) ; article_3 <- pdf_text(url_3) ; tdm <- TermDocumentMatrix(Corpus(VectorSource(rbind(article_1, article_2, article_3)))) ; – stats_noob Apr 15 '21 at 02:12
  • now, would there have been a way to separate "tdm" back into "article_1", "article_2" and "article_3"? – stats_noob Apr 15 '21 at 02:13

0 Answers0