I need to convert 24 PDF files in a folder into txt files so that I can perform semantic analysis on them. I took a look at this question, and proceeded from there. However, after getting the code to work the first time, I then changed some things around, and now I am getting the following error:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
Because of this, what is saved in the bodies
variable in the code below is just a list of 24 blanks, and I end up with 24 blank text files (in addition to the 24 text files that are created by converting the PDFs into txt). I'm not sure what I've done wrong - at one point, this code worked!
I've already looked through what I could find about this error, but those are associated with read.csv
, and the fixes they suggested (setting white.space=TRUE
and quote=""
) did not work.
Here's the code (the error is on line 20-23):
# folder with journal articles
PDFfolder_path <- "~/Dropbox/The Egoist PDFs/PDFs"
# vector of PDF file names
PDFfiles <- list.files(path=PDFfolder_path, pattern="*.pdf", full.names=TRUE)
# location of pdftotext.exe file
converter <- "~/Widgets/PDFConverter/bin64/pdftotext"
# folder with text files
textfolder_path <- "~/Dropbox/The Egoist PDFs/textfiles"
# convert PDFs in origin folder into txt files
lapply(PDFfiles, function(i) {
system(paste(converter, paste0('"', i, '"')), wait=FALSE)
})
# it takes DropBox a bit of time to catch all of the folders
# without this we only end up with 23 txt files for some reason
Sys.sleep(.5)
txtfiles_in_PDFfolder_path <- list.files(path=PDFfolder_path, pattern="*.txt", full.names=TRUE)
# extracting only the Bodies of the articles
bodies <- lapply(txtfiles_in_PDFfolder_path, function(i){
j <- paste0(scan(i, what = character()), collapse = " ")
regmatches(j, gregexpr("(?<=Published).*?(?=Prepaid Advertisements)", j, perl=TRUE))
})
# write article-bodies into txt files
lapply(1:length(bodies), function(i){
write.table(bodies[i], file=paste(txtfiles_in_PDFfolder_path[i], "body", "txt", sep="."), quote=FALSE, row.names=FALSE, col.names=FALSE, eol=" ")
})
EDIT: A bit more on the result of the bodies
variable: the result is a list of 24, which takes the following form (on the R Studio console, I'm not sure the actual name of this):
bodies: list of 24
:List of 1
..$ : chr(0)
:List of 1
..$ : chr(0)
(repeating 24 times)
But I can't for the life of me figure out why it's chr(0)
- I think it has something to do with the same kind of things that's going on here - I'm definitely not capturing all of the lines.
I've tried everything I can think of, even switching readLines()
for scan()
, and I've looked to see if that might help. I've even switched scan()
for read.table()
, but it turns out that read.table()
itself relies on scan
! So... I'm stuck, and am just working my way in circles.