2

I am extracting text from a pdf. Removing punctuation and looking at key repeated words and how often they appear.

library(pdftools)
library(tm)

setwd("S:/Shared Folders/Impact Investing/Investment/Scripts/PDF")

files <- list.files(pattern = "pdf$")
opinions <- lapply(files, pdf_text)

corp <- Corpus(URISource(files),
           readerControl = list(reader = readPDF))

opinions.tdm <- TermDocumentMatrix(corp, 
        control = 
            list(removePunctuation = TRUE,
            stopwords = TRUE,
            tolower = TRUE,
            stemming = TRUE,
            removeNumbers = TRUE,
            bounds = list(global = c(3, Inf)))) 

inspect(opinions.tdm[1:10,])

I am currently getting an error:

Error in [.simple_triplet_matrix(opinions.tdm, 1:10, ) : subscript out of bounds

My opinions.tdm has the following characteristics:

opinions.tdm list length of 6. nrow integer [1]. ncol [1]. dimnames list [2]. attributes [3]

zx8754
  • 52,746
  • 12
  • 114
  • 209
Will
  • 35
  • 5
  • 1
    Your question lacks sufficient information to give a meaningful answer. At minimum you should present the structure of your data which will enable people who are familiar with the functions to give an informed guess. If you would want help from a broader audience, you would make your code self contained and reproducible (code, data, the whole shebang). See [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for some tips on how to do that. – Roman Luštrik Sep 26 '19 at 07:24

0 Answers0