0

I am trying to count the number of keywords in multiple pdf files.

library(tm)
library(pdftools)

files <- list.files(pattern = "pdf$")
Rpdf <- readPDF(control = list(text = "-layout"))
corp <- Corpus(URISource(files), readerControl = list(reader = Rpdf))

words <- c("example", "keyword", "test")
dt <- DocumentTermMatrix(corp, control=list(dictionary=words))

When I run the code I always get this errors:

PDF error: May not be a PDF file (continuing anyway)
PDF error (3): Illegal character <21> in hex string
PDF error (5): Illegal character <4f> in hex string
PDF error (7): Illegal character <54> in hex string
PDF error (8): Illegal character <59> in hex string
PDF error (9): Illegal character <50> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_text(loadfile(pdf), opw, upw) : PDF parsing failure.
In addition: There were 12 warnings (use warnings() to see them)

If you have any suggestions, please let me know. Thank you!

phiver
  • 23,048
  • 14
  • 44
  • 56
  • 1
    I can't reproduce your error. You will have to point to an example pdf that generates this error. Also please add the results of the `warnings()` to your question. – phiver Nov 24 '18 at 11:07
  • You did a `library(pdftools)`. What happens wen you try to use it? – hrbrmstr Nov 24 '18 at 17:44
  • library(pdftools) works good, there is no error at all. – Daniel Meyer Nov 24 '18 at 18:46
  • @DanielMeyer - did you manage to get a solution to this? I am also getting a similar error on a specific pdf file in a large set of files `PDF error (21): Illegal character '{'` and this aborts all my processing upto that point. How did you manage to get around this error? – Lazarus Thurston Dec 13 '18 at 16:35

1 Answers1

1

I guess your pdfs are formatted as binary files and should thus be downnloaded/read as binary files. I had a similar issue downloading pdf files with download.file. I couldnt mine infos from the pdf using pdftools after I downloaded them. I discovered that my pdfs where binary files and where broken bc I didnt download them in proper format (try using any pdf reader, it should say it's broken when opening your pdf). Using Windows as OS I added mode="wb" to download.file making sure it stores them in the right format. I could then run the functions from pdftools on it without that error message. Hope that helps somehow. Got the idea from that SO question: Problems with Downloading pdf file using R

Same error message as yours:

pdf_toc(example_path)
PDF error (1151926): Illegal character <3a> in hex string
PDF error (1151929): Illegal character <73> in hex string
[...omitted for brevity...]
PDF error (1152006): Illegal character <22> in hex string
PDF error: Couldn't find trailer dictionary
PDF error: Couldn't read xref table
Error in poppler_pdf_toc(loadfile(pdf), opw, upw) : PDF parsing failure.
ToWii
  • 590
  • 5
  • 8