3

I am trying to mine text from a bunch of PDFs, but when I read them into R using pdf_text from the pdftools package, the text it produces is just strange and nothing like what is actually on the PDF file. Onedrive link: https://1drv.ms/b/s!AlTtlgN0WIa3s2qeq4yrv9fUu-Z6 . Here's the sample code I use:

library(pdftools)
pdf1 <- pdf_text("https://dl.dropboxusercontent.com/s/308gpdijvnw18mf/2018REQ118030709.pdf?dl=0")
pdf1   

     ## c("(’-*)&&$(&’-’’’’)*,&’$)’&/.\r\n     itiCHMON&\\     4Q\\a WN BQKPUWVL
     ##FQZOQVQI                                          )’(/ 7QZ[\\ 9ITN BMIT
     ##6[\\I\\M DI‘ 3QTT\r\n                    5Q^Q[QWV WN 4WTTMK\\QWV[\r\n                   
     ##FE 8_h -10+0\r\n                    HYSX]_^T’ L7 -.-1,(10+0                                                 
     ##3QTT >]UJMZ (/’*’.’0\r\n   IBKHHO F7L;HI ?D9                                                        
     ##@TMI[M ZMKWZL 3QTT >]UJMZ QV UMUW [MK\\QWV WN KPMKS\r\n   ,0+, L7BB;O H:\r\n  
     ##H?9>CED: L7 -.---(0/+1                                                         
     ##IVL QVKT]LM QV ITT WVTQVM JIVSQVO \\ZIV[IK\\QWV[\r\n                                
     ##@ZWXMZ\\a :VNWZUI\\QWV                                                          
     ##DI‘ :VNWZUI\\QWV\r\n     JQh OUQb5                                                          
     ##-+,3 J_dQ\\ 7TZecdUT 7^^eQ\\ 9XQbWUc5                                     
     ##!,+’/+/)++\r\n     3QTT >]UJMZ1                                .
     ##.. <truncated>

I am pretty new to R, any idea what I may be doing wrong? Please, any help with this would be appreciated.

Edit: I have replaced the url with a working url and I have also included the results that I am getting.

Somto
  • 75
  • 7
  • 1
    We cannot reproduce your problem, because we do not have access to your pdf document. Try making a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – phiver Mar 03 '18 at 10:13
  • I have edited the url to the pdf, this one works fine now and is 100% reproducible. Thanks – Somto Mar 03 '18 at 14:35
  • Nope. can't access the file. – phiver Mar 03 '18 at 18:12
  • The fonts in question neither contain an **Encoding** nor a **ToUnicode** entry. Thus, text extraction from your pdf is educated guesswork at best. Apparently `pdf_text` does not guess correctly. – mkl Mar 03 '18 at 20:50
  • @phiver Please try it again, it really works fine now – Somto Mar 04 '18 at 03:00
  • @mkl Thanks for your effort. Is there any possible solution to extract text from the pdf? – Somto Mar 04 '18 at 03:01
  • I don't know about `pdf_text`. One can guess more successfully here, though. Adobe reader for example does. – mkl Mar 04 '18 at 08:36

1 Answers1

5

You pdf is a pdf image. It looks like a scan. pdftools cannot convert this directly into text. You can use the package tesseract to get the data and pdftools to convert it into an png.

Code below will transform the first page into text. I will let you do the rest of the pages. Rembember that OCR to text isn't perfect. You need to check the outcome.

library(pdftools)
library(tesseract)
pdf_convert("https://dl.dropboxusercontent.com/s/308gpdijvnw18mf/2018REQ118030709.pdf?dl=0", 
                       pages = 1, 
                       dpi = 600, 
                       filenames = "page1.png")
text <- ocr("page1.png")
cat(text)

More information is available in the tesseract vignette.

You also might want to remove access to this pdf. I'm not sure it this data should be publicly available

phiver
  • 23,048
  • 14
  • 44
  • 56
  • 1
    "You pdf is a pdf image" - That's not correct. Only the backgrounds are an image. The text is text. Nonetheless rendering as image and ocr'ing may return the contents better than text extraction as the pdf font encoding is missing. – mkl Mar 04 '18 at 08:53
  • Thanks a lot for this answer. But for some reason, this crashes my RStudio as well as RGui. But when I read the pdf locally, it doesn't. Also the `pdf_convert` doesn't seem to render the pdf to "png" properly and consequently, the `ocr` has nothing concrete to work on. – Somto Mar 04 '18 at 21:39