16

I have been trying to do OCR within R (reading PDF data which data as scanned image). Have been reading about this @ http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/

This a very good post.

Effectively 3 steps:

  1. convert pdf to ppm (an image format)
  2. convert ppm to tif ready for tesseract (using ImageMagick for convert)
  3. convert tif to text file

The effective code for the above 3 steps as per the link post:

lapply(myfiles, function(i){
  # convert pdf to ppm (an image format), just pages 1-10 of the PDF
  # but you can change that easily, just remove or edit the 
  # -f 1 -l 10 bit in the line below
  shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r 600 ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, ".tif")))
  # convert tif to text file
  shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
  # delete tif file
  file.remove(paste0(i, ".tif" ))
  })

The first two steps are happening fine. (although taking good amount of time, for 4 pages of a pdf, but will look into the scalability part later, first trying if this works or not)

While running this, the fist two steps work fine.

While runinng the 3rd step, i.e

shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))

I having this error:

Error: evaluation nested too deeply: infinite recursion / options(expressions=)?

Or Tesseract is crashing.

Any workaround or root cause analysis would be appreciated.

hcham1
  • 1,799
  • 2
  • 16
  • 27
anshuk_pal
  • 195
  • 1
  • 8
  • can you give the content of `myfiles`? – bdecaf Aug 13 '15 at 06:00
  • @bdecaf - Unfortunately I cannot, due to data security issue. Essentially its companies financial statements (scanned image) which is inside the pdf (4 pages). That single pdf is in my files. This is not a r issue (that's what I am thinking, but more of a tesseract issue. – anshuk_pal Aug 13 '15 at 07:36
  • 1
    @r_analytics Did you find a solution for your problem? – R Yoda Jun 01 '16 at 14:12
  • I'm having a lot of trouble installing and running Teseract. Any help on getting it going? – Pierre L Aug 25 '16 at 20:41

2 Answers2

13

By using "tesseract", I created a sample script which works.Even it works for scanned PDF's too.

library(tesseract)
library(pdftools)

# Render pdf to png image

img_file <- pdftools::pdf_convert("F:/gowtham/A/B/invoice.pdf", format = 'tiff',  dpi = 400)

# Extract text from png image
text <- ocr(img_file)
write.table(text, "F:/gowtham/A/B/mydata.txt")

I'm new to R and Programming. Guide me if it's wrong. Hope this help you.

Lakshmana
  • 131
  • 1
  • 2
  • Yes, this was helpful. One can also see a similar example in the `tesseract` vignette, [here](https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html#read_from_pdf_files). A minor note: `library(tesseract)` loads also the `pdftools` package, so no need to call `library(pdftools)`, especially that you also used the `pdftools::` style of calling a function. – Valentin_Ștefan Aug 30 '20 at 20:19
  • 1
    Also, for the purpose of plain text files, maybe is slightly better to use `writeLines` instead of `write.table`, see [here](https://stackoverflow.com/questions/2470248/write-lines-of-text-to-a-file-in-r). Hope this feedback helps. – Valentin_Ștefan Aug 30 '20 at 20:26
7

The newly released tesseract package might be worth checking out. It allows you to perform the whole process inside of R without the shell calls.

Taking the procedure as used in the help documentation of the tesseract package your function would look something like this:

lapply(myfiles, function(i){
  # convert pdf to jpef/tiff and perform tesseract OCR on the image

  # Read in the PDF
  pdf <- pdf_text(i)
  # convert pdf to tiff
  bitmap <- pdf_render_page(news, dpi = 300)
  tiff::writeTIFF(bitmap, paste0(i, ".tiff"))
  # perform OCR on the .tiff file
  out <- ocr(paste0, (".tiff"))
  # delete tiff file
  file.remove(paste0(i, ".tiff" ))
})
Marijn Stevering
  • 1,204
  • 10
  • 24
  • just to clarify that this approach will only work on searchable pdfs and not scanned ones – Andres Mora Apr 19 '21 at 03:07
  • You can also consider an approach based on the RDCOMClient (e.g. see https://stackoverflow.com/questions/42294770/refine-table-extracted-from-pdf-tabulizer/73750966#73750966 for example). With the RDCOMClient R package, you can convert a PDF to Word by using an OCR that is embedded in the Word software. Afterwards, it is possible to extract the text from the word file. This approach works for searchable pdfs and scanned pdfs. – Emmanuel Hamel Sep 16 '22 at 23:06