0

I am using the R programming language. I learned how to take pdf files from the internet and load them into R. For example, below I load 3 different books by Shakespeare into R:

library(pdftools)
library(tidytext)
library(textrank)
library(tm)

#1st document
url <- "https://shakespeare.folger.edu/downloads/pdf/hamlet_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_1 <- article_words %>%
  anti_join(stop_words, by = "word")

#2nd document
url <- "https://shakespeare.folger.edu/downloads/pdf/macbeth_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_2<- article_words %>%
  anti_join(stop_words, by = "word")


#3rd document
url <- "https://shakespeare.folger.edu/downloads/pdf/othello_PDF_FolgerShakespeare.pdf"

article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
  unnest_tokens(sentence, text, token = "sentences") %>%
  mutate(sentence_id = row_number()) %>%
  select(sentence_id, sentence)


article_words <- article_sentences %>%
  unnest_tokens(word, sentence)


article_words_3 <- article_words %>%
  anti_join(stop_words, by = "word")

Each one of these files (e.g. article_words_1) is now a "tibble" file. From here, I want to convert these into a "document term matrix" so that I can perform text mining and NLP on these :

#convert to document term matrix
myCorpus <- Corpus(VectorSource(article_words_1, article_words_2, article_words_3))
tdm <- TermDocumentMatrix(myCorpus)
inspect(tdm)

But this seems to result in an error:

Error in VectorSource(article_words_1, article_words_2, article_words_3) : 
  unused arguments (article_words_2, article_words_3)

Can someone please show me what I am doing wrong?

Thanks

stats_noob
  • 5,401
  • 4
  • 27
  • 83

1 Answers1

2

As the error message suggests, VectorSource only takes 1 argument. You can rbind the datasets together and pass it to VectorSource function.

library(tm)

tdm <- TermDocumentMatrix(Corpus(VectorSource(rbind(article_words_1, article_words_2, article_words_3))))
inspect(tdm)

#<<TermDocumentMatrix (terms: 14952, documents: 2)>>
#Non-/sparse entries: 14952/14952
#Sparsity           : 50%
#Maximal term length: 25
#Weighting          : term frequency (tf)
#Sample             :
#            Docs
#Terms        1     2
#  "act",     0   397
#  "cassio",  0   258
#  "ftln",    0 10303
#  "hamlet",  0   617
#  "iago",    0   371
#  "lord",    0   355
#  "macbeth", 0   386
#  "othello", 0   462
#  "sc",      0   337
#  "thou",    0   346
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you for your answer @RonakShah! Can you please take a look at this question https://stackoverflow.com/questions/67027186/creating-a-loop-for-load-and-save-processes ? – stats_noob Apr 10 '21 at 05:02
  • can you please take a look at this question if you have time? https://stackoverflow.com/questions/67096056/r-extracting-individual-terms-from-a-matrix thank you – stats_noob Apr 15 '21 at 02:16
  • Hi @Ronak Shah, if you have time, can you please take a look at this question? https://stackoverflow.com/questions/67394744/r-convert-a-term-document-matrix-to-a-corpus thanks – stats_noob May 05 '21 at 03:26