2

I am trying to use the R interface to tesseract to create a PDF file with embedded text. I have seen the previous question tesseract (v3.03) output as PDF but it is about using the command line interface to tesseract. This question is about the R interface. I set the tessedit_create_pdf option to 1, but got no new pdf file. I do not see an option to set the output file. How can I make tesseract create a pdf with embedded text? The code below generates good text in memory, but no PDF file.

library(tesseract)
packageVersion("tesseract")
[1] ‘4.1.1’

eng1P <- tesseract(language = "eng", 
    options = list(tessedit_pageseg_mode = 1,
        tessedit_create_pdf=1))

text0 <- tesseract::ocr("TestImage.png", engine = eng1P)
cat(text0[[1]])

This image can be used for testing.

Test Image

G5W
  • 36,531
  • 10
  • 47
  • 80
  • I had no joy getting any output to text or pdf with R using `tessedit_create_txt` or `tessedit_create_pdf` (For setting the output file perhaps `document_title` would be used but nothing is produced). Obvious easy alt is to run tesseract commend with a system call or you could use rmarkdown to render the text but I'm sure you know this. – user20650 Aug 30 '21 at 00:34
  • Also see related question: https://stackoverflow.com/questions/69020976/convert-scanned-pdf-to-searcheable-pdf-in-r and issue: github.com/ropensci/tesseract/issues/51 – Bryan Shalloway Feb 06 '22 at 02:30
  • @BryanShalloway Thanks for pointing this out. There is a kind of an answer in the comments there, but `rmarkdown::render` requires the external program pandoc. I would like to be able to do this entirely in R. I am glad to se the github issue (that others want this feature too). – G5W Feb 06 '22 at 14:25
  • Simple R function here if you're using Ubuntu: [PDF to Searchable Text PDF in R](https://stackoverflow.com/questions/69020976/convert-scanned-pdf-to-searcheable-pdf-in-r/72455688#72455688) – Hayward Oblad Jun 01 '22 at 01:58
  • @HaywardOblad I am working in Windows, but thank you for pointing out the earlier question. I had not seen it. – G5W Jun 01 '22 at 12:37

3 Answers3

1

I have been able to convert your image to a searchable PDF with the following code by using RDCOMClient instead of tesseract. I thought it might be of interest to you.

First, the image is converted to a scanned PDF. Afterwards, the OCR of Word is used to convert the PDF to a Word document. Finally, the Word document is saved as a searchable PDF file.

library(RDCOMClient)
library(magick)

################################################
#### Step 1 : We convert the image to a PDF ####
################################################

path_PDF <- "D:\\Dropbox\\Reponses_Stackoverflow\\temp.pdf"
path_PNG <- "D:\\Dropbox\\Reponses_Stackoverflow\\lnCHO.png"
path_Word <- "D:\\Dropbox\\Reponses_Stackoverflow\\temp.docx"

pdf(path_PDF, height = 12, width = 8)

im <- image_read(path_PNG)
plot(im)
dev.off()

####################################################################
#### Step 2 : We use the OCR of Word to convert the PDF to word ####
####################################################################
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE

doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
                                   ConfirmConversions = FALSE)

doc$SaveAs2(path_Word)

###############################################
#### Step 3 : Convert word document to pdf ####
###############################################
wordApp[["ActiveDocument"]]$SaveAs(path_PDF, FileFormat = 17) # FileFormat = 17 saves as .PDF
doc$Close()
wordApp$Quit() # quit wordApp
Emmanuel Hamel
  • 1,769
  • 7
  • 19
  • 1
    To the best of my understanding, it is needed to convert the image in your post to a scanned PDF file. – Emmanuel Hamel Sep 21 '22 at 12:49
  • 1
    I am going to leave the question open. Your answer is useful. Thank you. However, in the environment that I intend to use this, I cannot use `RDCOMClient`, nor any package that does not come from CRAN. – G5W Sep 21 '22 at 17:17
0

In my job, sometimes, I call ECopy (http://www.ecopysoftware.com/) from R to convert scanned pdfs to searchable pdfs. ECopy is not a free software, but it is powerful.

I use the following function :

ecopy_Scanned_PDF_To_Numeric_PDF <- function(directory_Scanned_PDF, directory_Numeric_PDF)
{
  path_To_BatchConverter <- "C:/Program Files (x86)/Nuance/eCopy PDF Pro Office 6/BatchConverter.com"
  args <- paste0("-I", directory_Scanned_PDF, "\\*.pdf -O", directory_Numeric_PDF, " -Tpdfs -Lfre -W -V1.5 -J -Ao")
  system2(path_To_BatchConverter, args = args)
}

Maybe you can install ECopy in your environment and call it from R.

Emmanuel Hamel
  • 1,769
  • 7
  • 19
  • Sorry, I cannot. We have a very restricted software list. That is why I asked about r and Tesseract. I _do_ have that. Other programs are not useful in my case. – G5W Sep 21 '22 at 21:18
0

I installed tesseract on my computer (see https://indiantechwarrior.com/how-to-install-tesseract-on-windows/) and I was able to convert the image in a searchable PDF with the following code :

path_Tesseract <- "C:/Program Files (x86)/Tesseract-OCR/tesseract.exe"
args <- "D:/stackoverflow110.png D:/stackoverflow110 -l eng PDF"
system2(command = path_Tesseract, args = args)
Emmanuel Hamel
  • 1,769
  • 7
  • 19
  • 1
    As I said in my question "I have seen the previous question tesseract (v3.03) output as PDF but it is about using the command line interface to tesseract. This question is about the R interface" I am not permitted to install the command line tesseract on my machine. – G5W Sep 21 '22 at 23:12