I am using the R programming language. I learned how to take pdf files from the internet and load them into R. For example, below I load 3 different books by Shakespeare into R:
library(pdftools)
library(tidytext)
library(textrank)
library(tm)
#1st document
url <- "https://shakespeare.folger.edu/downloads/pdf/hamlet_PDF_FolgerShakespeare.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words_1 <- article_words %>%
anti_join(stop_words, by = "word")
#2nd document
url <- "https://shakespeare.folger.edu/downloads/pdf/macbeth_PDF_FolgerShakespeare.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words_2<- article_words %>%
anti_join(stop_words, by = "word")
#3rd document
url <- "https://shakespeare.folger.edu/downloads/pdf/othello_PDF_FolgerShakespeare.pdf"
article <- pdf_text(url)
article_sentences <- tibble(text = article) %>%
unnest_tokens(sentence, text, token = "sentences") %>%
mutate(sentence_id = row_number()) %>%
select(sentence_id, sentence)
article_words <- article_sentences %>%
unnest_tokens(word, sentence)
article_words_3 <- article_words %>%
anti_join(stop_words, by = "word")
Each one of these files (e.g. article_words_1) is now a "tibble" file. From here, I want to convert these into a "document term matrix" so that I can perform text mining and NLP on these :
#convert to document term matrix
myCorpus <- Corpus(VectorSource(article_words_1, article_words_2, article_words_3))
tdm <- TermDocumentMatrix(myCorpus)
inspect(tdm)
But this seems to result in an error:
Error in VectorSource(article_words_1, article_words_2, article_words_3) :
unused arguments (article_words_2, article_words_3)
Can someone please show me what I am doing wrong?
Thanks