I am extracting text from a pdf. Removing punctuation and looking at key repeated words and how often they appear.
library(pdftools)
library(tm)
setwd("S:/Shared Folders/Impact Investing/Investment/Scripts/PDF")
files <- list.files(pattern = "pdf$")
opinions <- lapply(files, pdf_text)
corp <- Corpus(URISource(files),
readerControl = list(reader = readPDF))
opinions.tdm <- TermDocumentMatrix(corp,
control =
list(removePunctuation = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
removeNumbers = TRUE,
bounds = list(global = c(3, Inf))))
inspect(opinions.tdm[1:10,])
I am currently getting an error:
Error in
[.simple_triplet_matrix
(opinions.tdm, 1:10, ) : subscript out of bounds
My opinions.tdm
has the following characteristics:
opinions.tdm list length of 6. nrow integer [1]. ncol [1]. dimnames list [2]. attributes [3]