Convert .pdf to .txt

Question

The problem is not new on Stackoverflow, but I am pretty sure I am missing something obvious.

I am trying to convert a few .pdf files into .txt files, in order to mine their text. I based my approach on this excellent script. The text in the .pdf files is not composed by images, hence no OCR required.

# Load tm package
library(tm)

# The folder containing my PDFs
dest <- "./pdfs"

# Correctly installed xpdf from http://www.foolabs.com/xpdf/download.html

file.exists(Sys.which(c("pdfinfo", "pdftotext")))
[1] TRUE TRUE

# Delete white spaces from pdfs' names
sapply(myfiles, FUN = function(i){
  file.rename(from = i, to =  paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})

# make a vector of PDF file names
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

lapply(myfiles, function(i) system(paste('"C:/Program Files/xpdf/bin64/pdftotext.exe"', 
paste0('"', i, '"')), wait = FALSE))

It should create a .txt copy of any .pdf file in the dest folder. I checked for issues with the path, for white spaces in the path, for xpdf common installation issues but nothing happens.

Here is the repository I am working on. If it can be useful, I can paste the SessionInfo. Thanks in advance.

Does the program work if you enter the commands on the command line? — Jongware, Jun 16 '16 at 20:25
Sorry for the late answer. I just tried, but nothing happened. — Worice, Jun 17 '16 at 19:13
@PeterEllis thank you for the suggestion, next time I will try it for sure. — Worice, Jan 09 '17 at 14:06

score 1 · Accepted Answer · answered Jun 30 '18 at 11:35

Late answer:

But I recently discovered that with the current verions of tm (0.7-4) you can read pdfs directly into a corpus if you have pdftools installed (install.packages("pdftools")).

library(tm)

directory <- getwd() # change this to directory where pdf-files are located

# read the pdfs with readPDF, default engine used is pdftools see ?readPDF for more info
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"), 
                               readerControl = list(reader = readPDF))

Convert .pdf to .txt

1 Answers1