0

I need to convert 24 PDF files in a folder into txt files so that I can perform semantic analysis on them. I took a look at this question, and proceeded from there. However, after getting the code to work the first time, I then changed some things around, and now I am getting the following error:

In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string

Because of this, what is saved in the bodies variable in the code below is just a list of 24 blanks, and I end up with 24 blank text files (in addition to the 24 text files that are created by converting the PDFs into txt). I'm not sure what I've done wrong - at one point, this code worked!

I've already looked through what I could find about this error, but those are associated with read.csv, and the fixes they suggested (setting white.space=TRUE and quote="") did not work.

Here's the code (the error is on line 20-23):

# folder with journal articles
PDFfolder_path <- "~/Dropbox/The Egoist PDFs/PDFs"
# vector of PDF file names
PDFfiles <- list.files(path=PDFfolder_path, pattern="*.pdf", full.names=TRUE)
# location of pdftotext.exe file
converter <- "~/Widgets/PDFConverter/bin64/pdftotext"
# folder with text files
textfolder_path <- "~/Dropbox/The Egoist PDFs/textfiles"

# convert PDFs in origin folder into txt files
lapply(PDFfiles, function(i) {
  system(paste(converter, paste0('"', i, '"')), wait=FALSE)
})
# it takes DropBox a bit of time to catch all of the folders
# without this we only end up with 23 txt files for some reason
Sys.sleep(.5)
txtfiles_in_PDFfolder_path <- list.files(path=PDFfolder_path, pattern="*.txt", full.names=TRUE)

# extracting only the Bodies of the articles
bodies <- lapply(txtfiles_in_PDFfolder_path, function(i){
  j <- paste0(scan(i, what = character()),  collapse = " ")
  regmatches(j, gregexpr("(?<=Published).*?(?=Prepaid Advertisements)", j, perl=TRUE))
})

# write article-bodies into txt files
lapply(1:length(bodies), function(i){
  write.table(bodies[i], file=paste(txtfiles_in_PDFfolder_path[i], "body", "txt", sep="."), quote=FALSE, row.names=FALSE, col.names=FALSE, eol=" ")
})

EDIT: A bit more on the result of the bodies variable: the result is a list of 24, which takes the following form (on the R Studio console, I'm not sure the actual name of this): bodies: list of 24 :List of 1 ..$ : chr(0) :List of 1 ..$ : chr(0) (repeating 24 times)

But I can't for the life of me figure out why it's chr(0) - I think it has something to do with the same kind of things that's going on here - I'm definitely not capturing all of the lines.

I've tried everything I can think of, even switching readLines() for scan(), and I've looked to see if that might help. I've even switched scan() for read.table(), but it turns out that read.table() itself relies on scan! So... I'm stuck, and am just working my way in circles.

Community
  • 1
  • 1
mlinegar
  • 1,389
  • 1
  • 11
  • 19
  • Does this happen with one PDF in particular? (try them each separately). Have you inspected the contents of your converted files? Without a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610) it will be very difficult to help you. – MrFlick Jul 06 '15 at 23:29
  • It happens for every PDF. And I've looked through the contents for the converted files, but they're relatively long, and I'm not sure what to look for. But I made a pastebin for one of the converted documents, hopefully that will help! http://pastebin.com/dDfzJC91 – mlinegar Jul 06 '15 at 23:35
  • Well, it sure sounds like there is bad data in the converted text files. Unless you can extract a portion that reproduces this behavior, it's very difficult to tell you exactly what's wrong. – MrFlick Jul 06 '15 at 23:38
  • Are you sure you need `scan()` rather than `readLines()`? (Note they do return different vectors but it doesn't seem like you need to break by word) – MrFlick Jul 06 '15 at 23:40
  • Hmm, okay. I'll try to find a portion that causes this to happen. However, at one point this was working fine - when the code was going well, R converted and cut the txt files perfectly well (and very quickly). But then I changed something (not at all sure what) and for each files R started to say things like `Read 10648 items`, and began to take noticeably longer. – mlinegar Jul 06 '15 at 23:42
  • Progress! When I changed to `readLines( )`, I get `Warning message: In readLines(i) : incomplete final line found on '/Users/mlinegar/Dropbox/The Egoist PDFs/PDFs/The Egoist 2.1.1914.txt'`. But I'm not sure what would cause an incomplete final line... any ideas? – mlinegar Jul 06 '15 at 23:45

0 Answers0