0

I am trying to import text contained in a PDF file into R Studio, using {readtext}. In this past, this has worked smoothly and still does so for the most part. However, there are a handful of PDF files I struggle to import, meaning that R Studio will abort (no error message!) when I try to read in the file.

Essentially, this is what I am doing:

library(readtext)

readtext::readtext("pdf_1.pdf")

#> readtext::readtext("pdf_1.pdf")
#readtext object consisting of 1 document and 0 docvars.
## Description: df [1 × 2]
#doc_id    text               
#<chr>     <chr>              
#  1 pdf_1.pdf "\"      DEMO\"..."

readtext::readtext("pdf_2.pdf")

# R Studio aborts.

The funny thing is that both PDF files are remarkably similar, in terms of usage rights, file size, its contents (text surrounded by imgs) and its creator. I am using the most recent versions of R and the R Studio IDE, as well as the most recent version of {readtext}, namely V 0.81.

Since I cannot provide the PDF files directly, please allow me to refer you to the following link, where the PDF can be downloaded.

PDF that I can import: link

PDF that I cannot import: link

Word of advice: Don't spend too much time reading. They are the weekly newspapers of the German anti-lockdown movement, Querdenken. My trying to import them in R only serves research purposes. :)

Any help with this is much appreciated. I've run out of ideas.

Dr. Fabian Habersack
  • 1,111
  • 12
  • 30
  • 1
    Which OS are you working on? If Windows, a non R solution is to reprint the problematic PDFs with, perhaps, "Microsoft Print to PDF". Then read the reprinted PDF instead of the original one. – Nicolás Velasquez Mar 08 '23 at 16:24
  • Thanks again, that actually worked! Still, I'm wondering what the problem with the original PDF was and if there's a way to overcome it. :) – Dr. Fabian Habersack Mar 08 '23 at 16:37
  • I was able to reproduce the problem in Windows. It seems that we can programmatically solve the "resaving" into pdf by using `{qpdf}`. – Nicolás Velasquez Mar 08 '23 at 19:44

1 Answers1

0

This trick simply re-writes the problematic pdf. It uses qpdf::pdf_combine() to "fake combine" it with nothing, but does output a new pdf that should be readable by R in your OS.

library(tidyverse)
library(readtext)
library(qpdf)
  

pdf_combine(input = "problematic_01.pdf", output = "working_01.pdf")
readtext("working_01.pdf")

readtext object consisting of 1 document and 0 docvars.
# Description: df [1 × 2]
  doc_id         text               
  <chr>          <chr>              
1 working_01.pdf "\"          \"..."
Nicolás Velasquez
  • 5,623
  • 11
  • 22