I am trying to import text contained in a PDF file into R Studio, using {readtext}
. In this past, this has worked smoothly and still does so for the most part. However, there are a handful of PDF files I struggle to import, meaning that R Studio will abort (no error message!) when I try to read in the file.
Essentially, this is what I am doing:
library(readtext)
readtext::readtext("pdf_1.pdf")
#> readtext::readtext("pdf_1.pdf")
#readtext object consisting of 1 document and 0 docvars.
## Description: df [1 × 2]
#doc_id text
#<chr> <chr>
# 1 pdf_1.pdf "\" DEMO\"..."
readtext::readtext("pdf_2.pdf")
# R Studio aborts.
The funny thing is that both PDF files are remarkably similar, in terms of usage rights, file size, its contents (text surrounded by imgs) and its creator. I am using the most recent versions of R and the R Studio IDE, as well as the most recent version of {readtext}
, namely V 0.81.
Since I cannot provide the PDF files directly, please allow me to refer you to the following link, where the PDF can be downloaded.
PDF that I can import: link
PDF that I cannot import: link
Word of advice: Don't spend too much time reading. They are the weekly newspapers of the German anti-lockdown movement, Querdenken. My trying to import them in R only serves research purposes. :)
Any help with this is much appreciated. I've run out of ideas.