Text extraction from PDF returns strange results in R

Question

I am trying to mine text from a bunch of PDFs, but when I read them into R using pdf_text from the pdftools package, the text it produces is just strange and nothing like what is actually on the PDF file. Onedrive link: https://1drv.ms/b/s!AlTtlgN0WIa3s2qeq4yrv9fUu-Z6 . Here's the sample code I use:

library(pdftools)
pdf1 <- pdf_text("https://dl.dropboxusercontent.com/s/308gpdijvnw18mf/2018REQ118030709.pdf?dl=0")
pdf1   

     ## c("(’-*)&&$(&’-’’’’)*,&’$)’&/.\r\n     itiCHMON&\\     4Q\\a WN BQKPUWVL
     ##FQZOQVQI                                          )’(/ 7QZ[\\ 9ITN BMIT
     ##6[\\I\\M DI‘ 3QTT\r\n                    5Q^Q[QWV WN 4WTTMK\\QWV[\r\n                   
     ##FE 8_h -10+0\r\n                    HYSX]_^T’ L7 -.-1,(10+0                                                 
     ##3QTT >]UJMZ (/’*’.’0\r\n   IBKHHO F7L;HI ?D9                                                        
     ##@TMI[M ZMKWZL 3QTT >]UJMZ QV UMUW [MK\\QWV WN KPMKS\r\n   ,0+, L7BB;O H:\r\n  
     ##H?9>CED: L7 -.---(0/+1                                                         
     ##IVL QVKT]LM QV ITT WVTQVM JIVSQVO \\ZIV[IK\\QWV[\r\n                                
     ##@ZWXMZ\\a :VNWZUI\\QWV                                                          
     ##DI‘ :VNWZUI\\QWV\r\n     JQh OUQb5                                                          
     ##-+,3 J_dQ\\ 7TZecdUT 7^^eQ\\ 9XQbWUc5                                     
     ##!,+’/+/)++\r\n     3QTT >]UJMZ1                                .
     ##.. <truncated>

I am pretty new to R, any idea what I may be doing wrong? Please, any help with this would be appreciated.

Edit: I have replaced the url with a working url and I have also included the results that I am getting.

We cannot reproduce your problem, because we do not have access to your pdf document. Try making a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — phiver, Mar 03 '18 at 10:13
I have edited the url to the pdf, this one works fine now and is 100% reproducible. Thanks — Somto, Mar 03 '18 at 14:35
The fonts in question neither contain an **Encoding** nor a **ToUnicode** entry. Thus, text extraction from your pdf is educated guesswork at best. Apparently `pdf_text` does not guess correctly. — mkl, Mar 03 '18 at 20:50
@mkl Thanks for your effort. Is there any possible solution to extract text from the pdf? — Somto, Mar 04 '18 at 03:01
I don't know about `pdf_text`. One can guess more successfully here, though. Adobe reader for example does. — mkl, Mar 04 '18 at 08:36

score 5 · Accepted Answer · answered Mar 04 '18 at 08:51

You pdf is a pdf image. It looks like a scan. pdftools cannot convert this directly into text. You can use the package tesseract to get the data and pdftools to convert it into an png.

Code below will transform the first page into text. I will let you do the rest of the pages. Rembember that OCR to text isn't perfect. You need to check the outcome.

library(pdftools)
library(tesseract)
pdf_convert("https://dl.dropboxusercontent.com/s/308gpdijvnw18mf/2018REQ118030709.pdf?dl=0", 
                       pages = 1, 
                       dpi = 600, 
                       filenames = "page1.png")
text <- ocr("page1.png")
cat(text)

More information is available in the tesseract vignette.

You also might want to remove access to this pdf. I'm not sure it this data should be publicly available

"You pdf is a pdf image" - That's not correct. Only the backgrounds are an image. The text is text. Nonetheless rendering as image and ocr'ing may return the contents better than text extraction as the pdf font encoding is missing. — mkl, Mar 04 '18 at 08:53
Thanks a lot for this answer. But for some reason, this crashes my RStudio as well as RGui. But when I read the pdf locally, it doesn't. Also the `pdf_convert` doesn't seem to render the pdf to "png" properly and consequently, the `ocr` has nothing concrete to work on. — Somto, Mar 04 '18 at 21:39

Text extraction from PDF returns strange results in R

1 Answers1